添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Ruby 1.9.3 Why does "\x03".force_encoding("UTF-8") get \u0003 ,but "\x03".force_encoding("UTF-16") gets "\x03"

Ask Question
irb(main):036:0* "\x03".force_encoding("UTF-16")
=> "\x03"
irb(main):040:0* "\x03".force_encoding("UTF-8")
=> "\u0003"

Why is "\x03".force_encoding("UTF-8") is \u0003 and "\x03".force_encoding("UTF-16") ends up with "\x03" , I thought it should be the other way round?

"\x03".force_encoding('binary').encode('utf-16') might be a little more illuminating. force_encoding lets you produce invalidly encoded text such as \x03 pretending to be UTF-16 or "\xce".force_encoding('utf-8') pretending to be UTF-8. My UTF-16 knowledge is rather sparse so just a comment. – mu is too short Mar 13, 2014 at 3:15 Can you explain why i cannot concat "\x81" with [993].pack("n") but i can do it with "\x81" and "\x03\xE1" when [993].pack("n") is "\x03\xE1". – tensaix2j Mar 13, 2014 at 3:29

Because "\x03" is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.

That's why "\x03" can be treated as unicode \u0003 in UTF-8 but not in UTF-16.

To represent "\u0003" in UTF-16, you have to use two byte, either 00 03 or 03 00, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be

FE FF 00 03

For the little-endian, the byte sequence should be

FF FE 03 00

The byte order mark should appear at the beginning of a string, or at the beginning of a file.

Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes.

If you see "\u0003", that doesn't mean you got a String which is represented in two bytes 00 03, but some byte(s) that represents the Unicode code point 0003 under the specific encoding as carried in that String. It may be:

03              //tagged as UTF-8
FE FF 00 03     //tagged as UTF-16
FF FE 03 00     //tagged as UTF-16
03              //tagged as GBK
03              //tagged as ASCII
00 00 FE FF 00 00 00 03 // tagged as UTF-32
FF FE 00 00 03 00 00 00 // tagged as UTF-32
                if "\x03" is not valid in UTF-16, why forcing it to be UTF-16 you get "\x03" instead of "\u0003" ?
– tensaix2j
                Mar 13, 2014 at 2:57
                And in case2, clearly \x03 is a one byte thing, and forcing it to be utf-8, i ended up with a 2 bytes thing??? I still don't get it.
– tensaix2j
                Mar 13, 2014 at 3:02
                @tensaix2j Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding is a method to change the encoding tag, it won't affect the byte sequence. You can validate that by inspecting "\x03".force_encoding("UTF-8").bytes.
– Arie Xiao
                Mar 13, 2014 at 3:16
                "\x81" + [993].pack("n")     Encoding::CompatibilityError: incompatible character encodings:     IBM437 and ASCII-8BIT         from (irb):70         from c:/ruby/2.0.0/bin/irb:12:in <main>' irb(main):071:0> [993].pack("n") => "\x03\xE1" irb(main):072:0> irb(main):073:0* "\x81" + "\x03\xE1" => "\x81\x03\xE1" irb(main):074:0> "\x81" + [993].pack("n").b Encoding::CompatibilityError: incompatible character encodings: IBM437 and ASCII-8BIT         from (irb):74         from c:/ruby/2.0.0/bin/irb:12:in <main>'
– tensaix2j
                Mar 13, 2014 at 3:21
                @tensaix2j The error message has told you. In your machine, "\x81" is [129] with a encoding tag IBM437, while [993].pack("n") gives you [3, 225] with a encoding tag ASCII-8BIT. You can only concat two String when there underlying encoding tag is the same. In this case, you can do "\x81".force_encoding("ASCII-8BIT") + [993].pack("n")
– Arie Xiao
                Mar 13, 2014 at 3:31
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.