[Base] JS javascript calculate the number of bytes occupied by the string

Do not talk nonsense, straight to the main topic.

Recent projects have a written demand to use a bunch of string js calculated in proportion to the memory localStorage, well-known, js is using Unicode encoding. Unicode achieved are N species, most of which use is UTF-8 and UTF-16. Therefore, this article only discuss these two codes.

The following definitions are taken from Wikipedia ( http://zh.wikipedia.org/zh-cn/UTF-8 ), made a part of the cut.

UTF-8 (8-bit Unicode Transformation Format) is a variable-length Unicode character for encoding, can represent any character in the Unicode standard, and the encoding is still compatible with the first byte of the ASCII, to use a four bytes for each character encoding.

Which encodes the following rules:

  1. Character codes 000000-- between 00007F, with a byte code;
  2. 000080-- 0007FF characters between two bytes;
  3. 000800 - 00D7FF and 00E000 - three bytes between 00FFFF, Note: Unicode character in the absence of any in the range D800-DFFF;
  4. 010000-- between 10FFFF 4 bytes used.

And UTF-16 is the fixed length character encoding, most of the two-byte character coded using the character code of four bytes beyond use 65535 as follows:

  1. 000000 - 00FFFF two bytes;
  2. 010000 - 10FFFF four bytes.

That since the beginning of the page using a UTF-8 encoding, the string is stored in localStorage, we should also use UTF-8 encoding. But later tests found that, even the calculated size is less than 5MB, but throw into localStorage anomaly. Thought, coding pages can be changed. If localStorage encoded character strings are stored according to the page, not went wrong? The browser should all use UTF-16 encoding. Calculated by UTF-16 encoded string of 5MB, indeed successfully written into it. Over the failure.

Well, attach code. Calculation rule is written above, in order to calculate the speed, the two are separated for recycling wrote.

1  / * *
 2  * calculated bytes of memory occupied by a string, using UTF-8 encoding calculating default, may be developed as 16-UTF
 . 3  * Unicode UTF-8 encoding format is a variable length, using one to four bytes per character encoding
 . 4  * 
 . 5  * 000000 - 00007F (128 codes) 0zzzzzzz (00-7F) byte
 . 6  * 000080 - 0007FF (code 1920) 110yyyyy (C0-DF) 10zzzzzz ( 80-BF) two bytes
 . 7  * 000800 - 00D7FF 
 . 8     00E000 - 00FFFF (code 61440) 1110xxxx (E0-EF) 10yyyyyy 10zzzzzz three bytes
 . 9  * 010000 - 10FFFF (code 1048576) 11110www (F0-F7) 10xxxxxx 10yyyyyy 10zzzzzz four bytes
 10  * 
 11  * Note: Unicode character in the absence of any of D800-DFFF range
 12 Http://zh.wikipedia.org/wiki/UTF-8} {@link *
 13 is  * 
 14  * UTF-16 encoding most of two bytes, that out using four bytes 65535
 15  * 000000 - 00FFFF two bytes
 16  * 010000 - 10FFFF four bytes
 . 17  * 
 18 is  *} {@link http://zh.wikipedia.org/wiki/UTF-16
 . 19  * @param {String} STR 
 20 is  * @param {String UTF-charset. 8}, UTF-16
 21 is  * Number The @return {}
 22 is   * / 
23 is  var the sizeof = function (STR, charset) {
 24      var Total = 0 ,
 25          the charCode,
 26 is          I,
 27         len;
28     charset = charset ? charset.toLowerCase() : '';
29     if(charset === 'utf-16' || charset === 'utf16'){
30         for(i = 0, len = str.length; i < len; i++){
31             charCode = str.charCodeAt(i);
32             if(charCode <= 0xffff){
33                 total += 2;
34             }else{
35                 total += 4;
36             }
37         }
38     }else{
39         for(i = 0, len = str.length; i < len; i++){
40             charCode = str.charCodeAt(i);
41             if(charCode <= 0x007f) {
42                 total += 1;
43             }else if(charCode <= 0x07ff){
44                 total += 2;
45             }else if(charCode <= 0xffff){
46                 total += 3;
47             }else{
48                 total += 4;
49             }
50         }
51     }
52     return total;
53 }

Guess you like

Origin www.cnblogs.com/7qin/p/12032897.html