The first four tortured soul: talk about presenting enter the URL to the page from what happened? - parsing algorithm articles

Complete network request and response, if the response header Content-Typevalue is text/html, then the next step is the browser 解析and 渲染work.

First, let's introduce the Resolution section, divided into the following steps:

  • Construction of DOMtree
  • 样式Compute
  • Generate 布局树( Layout Tree)

Build a DOM tree

Since browsers can not be understood HTML字符串, and therefore this series is converted into byte stream data structure in a meaningful and easy to operate, which is a data structure DOM树. DOM树It is to essentially a documentmulti-tree root.

It is to be resolved it by what way?

Nature HTML grammar

First, we should have a clear grasp it: HTML is not grammar 上下文无关文法.

Here, it is necessary to discuss what is 上下文无关文法.

In computer science compiler theory disciplines, there is a very clear definition:

If a formal grammar G = (N, Σ, P, S) of production rules take the following form: V-> w, is called the context-free grammar. Which V∈N, w∈ (N∪Σ) *.

Wherein the meaning of G = (N, Σ, P, S) in the respective parameters explain:

  1. N is a non-terminal (as the name implies, that is to say it is not the last symbol, empathy below) set.
  2. Σ is the terminator collection.
  3. P is the start symbol, which must belong to N, i.e., non-terminal symbol.
  4. S is a collection of different productions. The S -> aSb like.

Plainly speaking, 上下文无关的文法that is left of all the productions of the grammar is a nonterminal.

See here, if there is a little ignorant laps, I give an example that you will understand.

such as:

A -> B
复制代码

This grammar, each production will have left a nonterminal, this is 上下文无关的文法. In this case, xByit must be possible statute out xAyof.

Here we take a look to see a counter-example:

aA -> B
Aa -> B
复制代码

This is not the case 上下文无关的文法, when faced with Bthe time, we do not know in the end can not be out of the statute A, depending on whether left or right side there aexist, and that is context-sensitive.

About why it is 非上下文无关文法, first of all we need to pay attention to that standard HTML syntax, it is consistent with 上下文无关文法, and be able to reflect it 非上下文无关is not a standard syntax . Here I take just one counterexample to prove.

For example, the parser to scan formthe label when context-free grammar approach is to directly create the corresponding form DOM object, but the real scene HTML5 is not the case, the parser will look at formthe context, if the formparent tag label is formthen skip the current formlabel, or just create a DOM object.

Conventional programming languages are context-free , but HTML contrary, it is precisely the non-context-free characteristics, determines HTML Parsernot to use conventional programming language parser to complete, requires a different approach.

Parsing algorithm

HTML5 specification describes in detail parsing algorithm. This algorithm is divided into two stages:

  1. Tokenization.
  2. Achievements.

Two corresponding process is the lexical analysis and parsing .

Tokenization algorithm

The algorithm input HTML文本, output HTML标记, has become a marker generator . Wherein the use of finite state machine automatically accomplished. I.e. when the current state, receiving one or more characters, will be updated to the next state.

<html>
  <body>
    Hello sanyuan
  </body> </html> 复制代码

Through a simple example to show you 标记化the process.

Encounter <, state flag is ON .

Receiving [a-z]characters, will enter the tag name status .

This state is maintained until the encounter >, a mark indicating the name of the recording is completed, this time into a data state .

The next encounter bodylabel do the same process.

This time htmland bodymarks are recorded good.

Now to the <body> of>, enter data state , then holding the received character so that a state behind Hello sanyuan .

Then receives the </ body> in the <back flag is on , to receive the next /post, which creates a time end tagof token.

Then enter the tag name states , met >back to data state .

Followed by the same processing style </ body>.

Achievements algorithm

Mentioned before, DOM is a tree with documentmulti-tree root. Therefore, the parser will first create an documentobject. Tag generator tag will transmit information to each contribution unit . Contribution device upon receiving a respective tag, will create the corresponding DOM object . Creating this DOM对象post will do two things:

  1. Will be DOM对象added to the DOM tree.
  2. Storing the corresponding tag is pressed into the opening (and 闭合标签the corresponding mean) elements in the stack.

Or take the example below, he said:

<html>
  <body>
    Hello sanyuan
  </body> </html> 复制代码

First, the state is initialized state .

Receiving the transmitted tag generator htmltag, this time becomes a state before html state . While creating a HTMLHtmlElementDOM element, it is added to documentthe root object, and push operation.

Then this automatically before head , at this time there came from the marker generator body, not represented head, this time contribution is automatically creates a HTMLHeadElement and added to DOM树the.

Now go to in head state, then skip ahead to the After head .

Now tokenizer came the bodynumerals, creating HTMLBodyElement is , inserted into DOMthe tree, while the press-open the mark stack.

Next state is changed in body , and then receives a series of characters that follow: the Hello sanyuan . Receiving first character you will create a Text node and wherein the characters are inserted, and the Text node into the DOM tree body元素below. With receiving back characters that will be attached to Text on the node.

Now, the tokenizer pass over a bodyclosing tag, into the after body state.

Tokenizer last pass over a htmlclosing tag, into the after after body state, showing an analysis process ends.

Fault Tolerance

Mentioned HTML5specifications, it would have a strong tolerance policy , fault tolerance is very strong, although we mixed, but I think as a senior front-end engineer, it is necessary to know HTML Parserwhat had been done things in fault tolerance.

Next is WebKit in some of the classic examples of fault-tolerant, we found that there are other also welcome to add.

  1. Use </br> instead <br>
if (t->isCloseTag(brTag) && m_document->inCompatMode()) {
  reportError(MalformedBRError);
  t->beginTag = true;
}
复制代码

All replaced <br> form.

  1. Discrete form
<table>
  <table>
    <tr><td>inner table</td></tr> </table> <tr><td>outer table</td></tr> </table> 复制代码

WebKitIt will be automatically converted to:

<table>
    <tr><td>outer table</td></tr> </table> <table> <tr><td>inner table</td></tr> </table> 复制代码
  1. Nested form elements

This time simply ignored inside form.

Style computing

About CSS styles, its source is generally three types:

  1. link label references
  2. style tag style
  3. Inline style attributes of the element

Format Stylesheet

First, after the browser is not directly identify the CSS style of the text, therefore rendering engine receives the CSS text first thing is to convert it into an object-oriented structure, namely styleSheets.

The formatting process is too complicated, but for different browsers have different optimization strategies, there is not carried out.

In the browser console able document.styleSheetsto see the final structure. Of course, this structure contains these three sources of CSS, provides the basis for the following operating style.

Standardization style properties

Some CSS style value is not readily understood by the rendering engine, it is necessary before calculation of their standardized pattern, such as em-> px, red-> #ff0000, bold-> 700and the like.

Computing specific styles of each node

Style has been 格式化and 标准化, then you can calculate specific style information of each node.

In fact, the way computing is not complicated, mainly two rules: Inheritance and stacked .

Each child will inherit the parent node of the default style attributes, if the parent node is not found, will use the browser's default style, also called UserAgent样式. This is the inheritance rules, very easy to understand.

Then the rules are stacked, CSS biggest feature is its layered nature, which is the ultimate effect depends on the style of interaction of each attribute, even a lot of strange layered phenomenon, read "CSS in the world," the students should have this deep experience, specific CSS cascading rules belong to in-depth language category, there is not much introduced.

But it is worth noting that After computing style, all style values will be hung on to window.getComputedStylethem, that is, after the style can be obtained by calculating JS, very convenient.

Create a layout tree

Now it has been generated DOM树and DOM样式, the next thing to do is through the browser's layout system 确定元素的位置, which is to generate a 布局树(Layout Tree).

Generating a layout tree is substantially as follows:

  1. Traversing the DOM tree generated by the node, and add them to 布局树中.
  2. Calculating the coordinate position of the layout tree node.

Notably, the layout tree tree values contain visible elements for headthe label set and display: noneelements will not be put into it.

Some say will first generation Render Tree, that is, rendering tree, in fact, this is the thing 16 years ago, and now Chrome team has done a lot of remodeling, has not generated Render Treethe process of. The information Tree layout has been very perfect, complete with Render Treefeatures.

The reason why the layout of the details do not speak, because it is too complicated, introduced one article would seem too bloated, but in most cases we only need to know that the work done is what you can, if you want in-depth principle which, you know it is how to do , I highly recommend you go read all articles FED team from the Chrome source code to see how the browser layout layout .

to sum up

Comb main context of this section:

 

Guess you like

Origin www.cnblogs.com/guchengnan/p/12160657.html