JS coded character set which one to use, how to calculate the number of bytes occupied by the string?

Character code base

unicode

is a unicode character set, from a very simple idea: all the characters included in a collection of the world, the computer just support this character set will be able to display all these characters, there will not be garbled.

And it provides only Unicode code points for each character, in the end point of this code indicates what kind of byte sequence, involves the encoding method.

UTF-32 and UTF-8

The most straightforward encoding method is used for each code point represents a four-byte, each byte code point correspondence. This coding method is called UTF-32. UTF-32 is the advantage that the conversion rule is simple and intuitive, high search efficiency. The disadvantage is that a waste of space, the English version of the same content, it will be four times bigger than ASCII encoding. The fatal drawback, in fact, no one uses this coding method, HTML 5 standard will expressly provides that the page may not be encoded into UTF-32.

Thus was born the UTF-8, UTF-8 is a variable-length encoding methods, the character length ranging from 1 byte to 4 bytes. The more commonly used characters, bytes shorter frontmost 128 characters, only one byte, exactly the same ASCII code.

Note unicode UTF-8 is just one implementation. Such as "strict" unicode code 4E25, and UTF-8 encoding of E4B8A5, the two are different.

UCS-2

UCS-2 (2-byte character set common) coding format is a fixed-length encoding is only 16 bytes to represent the code bit encoding unit. This result represents the result of the most (BMP) in the range of 0 to UTF-16 as 0xFFFF.

UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2, which represents the ratio of BMP range allows more characters. It is a variable length format, which each code bit may be used one or two 16-byte coding units represented. In this manner a code capable of encoding bits between 0 to 0x10FFFF.

Briefly, UTF-16 is an extension of UCS-2.

UTF-16

UTF-16 encoded between UTF-32 and UTF-8, combined with the characteristics of the two kinds of fixed length and variable length encoding methods.

It encoding rules are simple: a substantially planar character occupies 2 bytes, characters auxiliary plane occupies 4 bytes. That is, the code length UTF-16 is either 2 bytes (U + 0000 to U + FFFF), or 4 bytes (U + 010000 to U + 10FFFF).

Which character set encoding JS use?

ES 5.1 has a passage

A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.

Therefore JS engine may choose to use UCS-2 or UTF-16.

Why not choose UTF-8 it? Due to historical reasons, first appeared in the UCS-2, then in 1996 UTF-16 Unicode 2.0 standard with the birth, you can migrate to UTF-16. But they can not move to UTF-8, because it would destroy the API interface binary compatibility (and other features).

Also even Java, it runs in the same way: it initially supported the UCS-2, UTF-16 but moved in J2SE 5.0.

You can get serious unicode character codes by the codes of FIG.

var str = '严'
str.charCodeAt(0).toString(16) // "4e25"

Calculating the number of bytes occupied by the string

/**
   * 计算字符串所占的内存字节数,默认使用UTF-8的编码方式计算,使用一至四个字节为每个字符编码
   * 参考来源: http://www.jb51.net/article/73675.htm
   *
   * 000000 - 00007F(128个代码)   0zzzzzzz(00-7F)               一个字节
   * 000080 - 0007FF(1920个代码)   110yyyyy(C0-DF) 10zzzzzz(80-BF)       两个字节
   * 000800 - 00D7FF  &  00E000 - 00FFFF(61440个代码)  1110xxxx(E0-EF) 10yyyyyy 10zzzzzz      三个字节
   * 010000 - 10FFFF(1048576个代码) 11110www(F0-F7) 10xxxxxx 10yyyyyy 10zzzzzz 四个字节
   *
   * 注: Unicode在范围 D800-DFFF 中不存在任何字符
   * {@link http://zh.wikipedia.org/wiki/UTF-8}
   *
   * UTF-16 大部分使用两个字节编码,编码超出 65535 的使用四个字节
   * 000000 - 00FFFF 两个字节
   * 010000 - 10FFFF 四个字节
   *
   * {@link http://zh.wikipedia.org/wiki/UTF-16}
   * @param {String} str
   * @param {String} charset utf-8
   * @return {Number}
   */
var sizeof = function(str, charset) {
    var total = 0,
        charCode,
        i,
        len;

    charset = charset ? charset.toLowerCase() : '';

    if (charset === 'utf-16' || charset === 'utf16') {
        for (i = 0, len = str.length; i < len; i++) {
            charCode = str.charCodeAt(i);
            if (charCode <= 0xffff) {
                total += 2;
            } else {
                total += 4;
            }
        }
    } else {
        for (i = 0, len = str.length; i < len; i++) {
            charCode = str.charCodeAt(i);
            if (charCode <= 0x007f) {
                total += 1;
            } else if (charCode <= 0x07ff) {
                total += 2;
            } else if (charCode <= 0xffff) {
                total += 3;
            } else {
                total += 4;
            }
        }
    }

    return total;
}

Reference material

Guess you like

Origin www.cnblogs.com/everlose/p/12500856.html