HTML generates reverse analysis Markdown

Reverse generating HTML parsing Markdown - one

The parsing process is divided into four stages. The following is a brief description of each stage.

  1. Segmentation: The original text is divided into HTML HTML tags
  2. Generating a virtual DOM node: HTML tags will be divided into corresponding node
  3. Building a virtual DOM tree: the node generates a corresponding DOM tree according to the order
  4. Generating Markdown text: The predefined HTML To Markdown conversion rule, the conversion of the DOM tree

Below this text as HTML sample text parsing:

<h2 id="逆向解析HTMl">逆向解析HTMl</h2>
<p><a href="https://www.baidu.com" rel="nofollow" target="_blank">Markdown</a>解析过程分为四个阶段</p>
<ul>
<li>分词</li>
<li>生成虚拟DOM节点</li>
<li>构建虚拟DOM树</li>
<li><p>生成Markdown文本</p></li>
</ul>
复制代码

Participle

We will source HTML text in accordance with the syntax of HTML elements , decomposition Opening tag Closing tag Enclosed text content.

Because there is likely to internal HTML elements nested elements, the division will continue Enclosed text contentuntil only 文字文本.

It is clear from the map view, Opening tagand Closing tagare made of < >two parcels of symbols, then we only need to conduct a search is willing to HTML text will be < >wrapped up in string extracted, placed in an array. After the search array is the result of our word. After dividing the original text, as follows:

const result = [
  '<h2 id="逆向解析HTMl">',
  '逆向解析HTMl',
  '</h2>',
  '<p>',
  '<a href="https://www.baidu.com" rel="nofollow" target="_blank">',
  'Markdown',
  '</a>',
  '解析过程分为四个阶段',
  '</p>',
  '<ul>',
  '<li>',
  '分词',
  '</li>',
  '<li>',
  '生成虚拟DOM节点',
  '</li>',
  '<li>',
  '构建虚拟DOM树',
  '</li>',
  '<li>',
  '<p>',
  '生成Markdown文本',
  '</p>',
  '</li>',
  '</ul>'
]
复制代码

Note that, the property value html tags are allowed <and >these two symbols , which means that there will be similar to <div data-demo="<demo>asd</demo>">this text. It should be noted here that can not be directly searched from start to finish < >and then extract the string inside, otherwise there will be extracted to <div data-demo="<demo>such an outcome.

I realize relatively simple method is to use the stack to determine the beginning and ending HTML tags.
  1. First, start with an index 0, Traversal string
    1. If the current character is a <, it is pushed onto the stack.
    2. If the current character is a >, and the top of the stack Shi <, it means the end of an HTML tag. Then start symbol <and the end symbol >string between the extracted results saved to an array just fine.
    3. If the current characters are ", and not the top of the stack ", it is pushed onto the stack.
    4. If the current character is a ", and the top of the stack Shi ", then pop the top element.

See the specific implementation lexer.js

Generate a virtual DOM node

At this stage, mainly to filter attribute nodes, most of the internal HTML tag attributes are not needed. In addition to a imgseveral other HTML elements. The results obtained after the word, can generate a parse HTML tag string object that contains the HTML tag information. Object types are as follows:

const obj = {
    // 固定属性
    tag,            // HTML标签名。如`div`, `span`
    type,           // 自定义的HTML标签名所对应的数字。
    position,       // 标签所在的位置。开始标签(Opening tag):1,结束标签(Closing tag):2,空元素(empty tag)和文本节点(text node):3
    // 可选属性
    attr,           // 标签内属性的键值对,这是一个对象。一些需要保留属性的元素如`a`元素需要保留`href` `title`用来生成Markdown文本。
    content         // 文本节点特有,用来保存文本
}
复制代码

The results of this process are as follows (a bit more, here only the first six taken more representative):

const result = [
    {
        tag: 'h2',
        type: 42,           // 不要在意`type`属性,这是自定义的,42代表`h2`元素对应数字
        position: 1
    },
    {
        tag: 'textNode',
        type: 1,
        position: 3,
        content: '逆向解析HTMl'
    },
    {
        tag: 'h2',
        type: 42,
        position: 2
    },
    {
        tag: 'p',
        type: 6,
        position: 1
    },
    {
        tag: 'a',
        type: 2,
        position: 1,
        attr: {
            href: 'https://www.baidu.com'
        }
    },
    {
        tag: 'textNode',
        type: 1,
        position: 3,
        content: 'Markdown'
    },
]
复制代码
This part of the realization of the idea is relatively simple, mostly string processing.
  1. tag: HTML tag structure is very simple, it is roughly the last several :( do not need treatment, can be ignored)

    1. <tagName attrKey="attrValue" attrKey> <tagName attrKey="attrValue" attrKey >
    2. <tagName/> <tagName />
    3. </tagName>

    It is easy to find in order to get tagName only need to find <and ( 空格or /string) between it.

  2. type: This property is convenient for the type of processing after adding, after the digital string is relatively better handling.

    I wrote a mapping table (in the configuration file configuration file ), with taga corresponding figure as a key value. So you can easily correspondence.

  3. position: Although this property is called position, in fact, typeis more suitable for it, because it identifies the start tag (Opening tag): 1, the closing tag (Closing tag): 2, empty elements (empty tag) and text nodes (text node): 3

    positionI wrote the judge is relatively simple, only takes into account the above tagsituations listed (but have been able to include most of the cases). From the above it for several situations. As long as the judge tagsubscript index start position is / is not 1, so you know / No Opening tag up.

    Judgment about the text node: a text node is no tag, and if you can not search tag, can be identified as node text node.

  4. attr: attrInside resolved to save some property Markdown text needed. Most cases are related and linked, such as the following:

    1. Related links Markdown syntax specification ( Links Images Heading IDs Footnotes).
      1. Linkssrc title
      2. Image: src title alt
      3. Heading IDsid
      4. Footnotesid
  5. content: Text nodes much to say.

See the specific implementation parser.js

End

To speak at the reverse analysis more complicated, this is the first part, is expected to finish three chapters.


PS. It for a point of interest, then praise or comment, I would be more motivated to write.

PSS. Junior, recently Internship ah, Ningbo, Hangzhou, Shanghai has recruited the front of it?

PSSS. I should be able to be considered a Markdown enthusiasts it. . .

Reproduced in: https: //juejin.im/post/5cf274936fb9a07ef443eeec

Guess you like

Origin blog.csdn.net/weixin_34077371/article/details/91428524