Learn HTML5 standard - coding

I believe every front-end engineers are more or less met the "garbage" this man, no matter how solid your foundation, in the production process will inevitably occasionally and "garbage" Brother few drinks tea. As a front-end engineer, how do you specify the encoding of a page of it? You know how the browser identification code do?

First of all, a very simple example, a case of simple HTML page to see what's different in each browser:

<!DOCTYPE html>

Minimalist HTML, <head>and <body>are not content, the server does not give a specific encoding declaration, opened directly from the local view coding of the page under each browser:

Browser Show Codes Remark
IE6 UTF-8  
ie8 UTF-8  
IE9 GB2312 The system default character set
Firefox3.5 GBK2312 The system default character set
Firefox4.0 ISO-8859-1 Western European languages, English default encoding
Chrome GBK The system default character set
Opera Chinese - Automatic detection It should also GB2312

As can be seen from the table, without the use of any means for coding the page declaration, each browser has a different resolution. Of course, in the most simple page, no matter what encoding (of course, the premise is a superset of ASCII) are not affected, but enough to show the importance of correctly set encoding.

Encoding declaration

HTML4 and HTML5 were used to illustrate a method for encoding declaration a chapter, you can click here to see the relevant sections of the HTML4 or click here to see the relevant sections of HTML5 .

First of all, what is encoded? That is encoded by a certain way, specify the browser (or user agent) to a special algorithm to parse the byte stream to be really correct content. In the HTML standard, the coding may be represented using aliases. Coded aliases from the definition of IANA encoding only appear in this list can be identified browser. So if the UTF8 written in UTF8, the browser is likely to completely ignore. Further, coding alias is not case sensitive to.

In HTML4, proposed three ways to specify the encoding of the page, according to the priority level are:

  1. In advance of the HTTP Content-Type field to follow the character set.
  2. Use <meta http-equiv="Content-Type">tags to declare.
  3. For some external resources, such as <script>label loading js file, charset attributes declared on the label can.

This natural no doubt, be noted that, by <meta http-equiv="Content-Type">labeling to declare the page, then when the browser meets the label, and if found not coded label declarations own use, it will return to re-parse the pages ahead. This causes part of the page is re-parsed, so if you try to use the label declaratively coding, it is recommended that the label EDITORIAL as possible. A best practice is to write in <head>after the tag, before any other label. In this regard, Google PageSpeed also has a corresponding presentation .

Evolution of the times

But as time goes on, developers gradually discovered one thing. Like most simple DOCTYPE declaration, like, in fact, the browser reads the <meta>coded label when not strictly in accordance with the standards. All in all, due to the resolution phase of HTML coding must be determined based on a good page before Tokenizer stage, so the browser can not, as a DOM tree analysis, as in the time of re-building the DOM tree decomposition <meta>structure tags, remove them http-equivand contentattributes , and then determining the coding.

In reality, the browser did a very simple thing, to read the <meta>label definition of code:

  1. This determination is a <meta>label, which based on the HTML parsing state machine, can be determined by the characters "<" plus "Meta" string.
  2. Find the string (no label concept here, just a string), find a substring "charset".
  3. After reading again, ignoring all space characters, find the first meaningful character c.
    • If c is not "=" This character, then keep looking back to step 2.
    • If c is "=" This character, continue to go down.
  4. Jump out all the whitespace characters and single quotes, double quotes and other scanning backward, until I meet single quote character should not appear in double quotes, the space character, end labels as on the interception string which scans obtained s.
  5. Analysis of s, is coded alias.

From the above algorithm, not difficult to find, the following types of writing, in fact, can make the browser to correctly identify the code:

  • <meta http-equiv="Cotnent-Type" content="text/html; charset=utf-8" />
  • <meta charset="utf-8" />
  • <meta charset=utf-8 />
  • ...... and many other strange wording.

So, as history progresses, and finally one day, all browser vendors have sat together and began to discuss the issue ...... eventually they were surprised to find that their implementation is very similar (perhaps simply learn from each other), so they decided to in this way become a standard ...... Finally, after a long discussion, HTML5 in widely-loved way of encoding declaration was born. In HTML5, called "meta charset element", its simplest form as follows:

<meta charset=utf-8>

Of course, this is the HTML syntax, if compliance with XHTML and XHTML feel more kindly words, written in <meta charset="utf-8" />is not the problem.

And specific acquired encoding algorithm previously described is also recorded in detail, can be seen here .

To the era of HTML5, the standard way again encoded declaration was amended and refined, generally speaking, there are the following differences:

  • HTML5 allows BOM to determine the coding, but only supports BOM UTF-16 (i.e., U + FEFF), and does not indicate how to specify the encoding priority BOM.
  • HTML5 adds meta charsetlabel.
  • If a page is not HTML5 predetermined specified coding, is used as ASCII code, and the predetermined HTML4 browser may choose according to the environment.

Other miscellaneous

In addition to the basic coding declarative way, there are many standard requires attention to the details:

  • If you use a <meta>label claim coding, then the coding can only be a superset of ASCII coding. Simply finds ASCII encoding support is a superset of 256 characters of ASCII.
  • HTML5 is highly recommended to use UTF-8 encoding.
  • The standard proposes not to use UTF-32, JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB other character sets, and to prohibit the use of CESU-8, UTF-7, BOCU-1 SCSU and character sets. But in fact the browser but at least recognizes UTF-7.
  • For developers who want to strictly comply with XHTML, and XML declaration should be used to specify the encoding that <?xml version="1.0" encoding="UTF-8" standalone="no" ?>. But in IE6 this will affect the DOCTYPE, so developers are not given to compromise on this point, obediently go with HTML declaratively.
  • Priority, as well as some other details about the reality of each encoded declaratively to note, this article is worth reading.

Best Practices

  • Whenever possible, use HTTP headers specified encoding.
  • Use UTF-8 as much as possible, or at least all of the station's resource use Unicode.
  • If you want to use UTF-16, give the file with BOM, to determine the Little Endian or Big Endian.
  • If you use <meta>labels specify the encoding, it may not be used in the form of http-equiv, but as much as possible so that the label appears in front of, at least to ensure that before any non-ASCII characters.
  • External links script, if you can not determine if the same coding, plus charset attribute.

This article permanent address: http://www.otakustay.com/learning-html5-charset/

Reproduced in: https: //www.cnblogs.com/GrayZhang/archive/2011/04/11/learning-html5-charset.html

Guess you like

Origin blog.csdn.net/weixin_33885253/article/details/93272200