Reverse generating HTML parsing Markdown - one
The parsing process is divided into four stages. The following is a brief description of each stage.
- Segmentation: The original text is divided into HTML HTML tags
- Generating a virtual DOM node: HTML tags will be divided into corresponding node
- Building a virtual DOM tree: the node generates a corresponding DOM tree according to the order
- Generating Markdown text: The predefined HTML To Markdown conversion rule, the conversion of the DOM tree
Below this text as HTML sample text parsing:
<h2 id="逆向解析HTMl">逆向解析HTMl</h2>
<p><a href="https://www.baidu.com" rel="nofollow" target="_blank">Markdown</a>解析过程分为四个阶段</p>
<ul>
<li>分词</li>
<li>生成虚拟DOM节点</li>
<li>构建虚拟DOM树</li>
<li><p>生成Markdown文本</p></li>
</ul>
复制代码
Participle
We will source HTML text in accordance with the syntax of HTML elements , decomposition Opening tag
Closing tag
Enclosed text content
.
Because there is likely to internal HTML elements nested elements, the division will continue Enclosed text content
until only 文字文本
.
It is clear from the map view, Opening tag
and Closing tag
are made of <
>
two parcels of symbols, then we only need to conduct a search is willing to HTML text will be <
>
wrapped up in string extracted, placed in an array. After the search array is the result of our word. After dividing the original text, as follows:
const result = [
'<h2 id="逆向解析HTMl">',
'逆向解析HTMl',
'</h2>',
'<p>',
'<a href="https://www.baidu.com" rel="nofollow" target="_blank">',
'Markdown',
'</a>',
'解析过程分为四个阶段',
'</p>',
'<ul>',
'<li>',
'分词',
'</li>',
'<li>',
'生成虚拟DOM节点',
'</li>',
'<li>',
'构建虚拟DOM树',
'</li>',
'<li>',
'<p>',
'生成Markdown文本',
'</p>',
'</li>',
'</ul>'
]
复制代码
Note that, the property value html tags are allowed <
and >
these two symbols , which means that there will be similar to <div data-demo="<demo>asd</demo>">
this text. It should be noted here that can not be directly searched from start to finish <
>
and then extract the string inside, otherwise there will be extracted to <div data-demo="<demo>
such an outcome.
I realize relatively simple method is to use the stack to determine the beginning and ending HTML tags.
- First, start with an index 0, Traversal string
-
- If the current character is a
<
, it is pushed onto the stack. - If the current character is a
>
, and the top of the stack Shi<
, it means the end of an HTML tag. Then start symbol<
and the end symbol>
string between the extracted results saved to an array just fine. - If the current characters are
"
, and not the top of the stack"
, it is pushed onto the stack. - If the current character is a
"
, and the top of the stack Shi"
, then pop the top element.
- If the current character is a
See the specific implementation lexer.js
Generate a virtual DOM node
At this stage, mainly to filter attribute nodes, most of the internal HTML tag attributes are not needed. In addition to a
img
several other HTML elements. The results obtained after the word, can generate a parse HTML tag string object that contains the HTML tag information. Object types are as follows:
const obj = {
// 固定属性
tag, // HTML标签名。如`div`, `span`
type, // 自定义的HTML标签名所对应的数字。
position, // 标签所在的位置。开始标签(Opening tag):1,结束标签(Closing tag):2,空元素(empty tag)和文本节点(text node):3
// 可选属性
attr, // 标签内属性的键值对,这是一个对象。一些需要保留属性的元素如`a`元素需要保留`href` `title`用来生成Markdown文本。
content // 文本节点特有,用来保存文本
}
复制代码
The results of this process are as follows (a bit more, here only the first six taken more representative):
const result = [
{
tag: 'h2',
type: 42, // 不要在意`type`属性,这是自定义的,42代表`h2`元素对应数字
position: 1
},
{
tag: 'textNode',
type: 1,
position: 3,
content: '逆向解析HTMl'
},
{
tag: 'h2',
type: 42,
position: 2
},
{
tag: 'p',
type: 6,
position: 1
},
{
tag: 'a',
type: 2,
position: 1,
attr: {
href: 'https://www.baidu.com'
}
},
{
tag: 'textNode',
type: 1,
position: 3,
content: 'Markdown'
},
]
复制代码
This part of the realization of the idea is relatively simple, mostly string processing.
-
tag
: HTML tag structure is very simple, it is roughly the last several :( do not need treatment, can be ignored)<tagName attrKey="attrValue" attrKey>
<tagName attrKey="attrValue" attrKey >
<tagName/>
<tagName />
</tagName>
It is easy to find in order to get tagName only need to find
<
and (空格
or/
string) between it. -
type
: This property is convenient for the type of processing after adding, after the digital string is relatively better handling.I wrote a mapping table (in the configuration file configuration file ), with
tag
a corresponding figure as a key value. So you can easily correspondence. -
position
: Although this property is calledposition
, in fact,type
is more suitable for it, because it identifies the start tag (Opening tag): 1, the closing tag (Closing tag): 2, empty elements (empty tag) and text nodes (text node): 3position
I wrote the judge is relatively simple, only takes into account the abovetag
situations listed (but have been able to include most of the cases). From the above it for several situations. As long as the judgetag
subscript index start position is / is not1
, so you know / No Opening tag up.Judgment about the text node: a text node is no
tag
, and if you can not searchtag
, can be identified as node text node. -
attr
:attr
Inside resolved to save some property Markdown text needed. Most cases are related and linked, such as the following:- Related links Markdown syntax specification (
Links
Images
Heading IDs
Footnotes
).Links
:src
title
Image
:src
title
alt
Heading IDs
:id
Footnotes
:id
- Related links Markdown syntax specification (
-
content
: Text nodes much to say.
See the specific implementation parser.js
End
To speak at the reverse analysis more complicated, this is the first part, is expected to finish three chapters.
PS. It for a point of interest, then praise or comment, I would be more motivated to write.
PSS. Junior, recently Internship ah, Ningbo, Hangzhou, Shanghai has recruited the front of it?
PSSS. I should be able to be considered a Markdown enthusiasts it. . .
Reproduced in: https: //juejin.im/post/5cf274936fb9a07ef443eeec