Htm lxml Python crawler when parsing the get and set the object HtmlElement Method inner html

The lxml Python is a very powerful parsing html, XML module, the latest version supported python version from 2.6 to 3.6, it is an essential tool for writing reptiles. It is based on the C language library libxml2 and libxslt, conducted Python Fan children (Pythonic) bindings Python modules become a feature rich and easy to use. Although rich in features, but it lacks some interface modification time and the number of nodes, such as described in this paper to obtain inner html and settings (modified) inner html function.

parsed html page lxml.html module is generally used, steps are simple three steps:

(1) Import Module:

import lxml.html

(2) convert the html Document html tree, the root node is the <html> tag:

doc = lxml.html.fromstring(html)

(3) using xpath find the node to extract:

nodes = doc.xpath('//div[@class, 'the']/div[@id, 'xpath']')

Is divided into three steps or more simple, practical use, the third portion may be repeated to obtain a data extraction different nodes through different xpath.

It can be said, lxml parsing (read-only mode) html functionality and powerful and convenient. However, if you need to modify the html certain node (write mode) a bit difficult, API it offers very little in this regard, API node tag property only changes such as changing class node, id, href and other attributes can be of.

So how to operate the actual html string node it?

1. Obtain inner html node

So, what is the inner html it? First, let's look at a html code sample:

<Div class = "text"> This is the div <a href="/node"> node </a> SUMMARY </ div>

For this div html tags node, its inner html is:

This is the content div <a href="/node"> node </a>

I.e., all the contents of the tag comprising; the entire sample code contains a div div tag including the outer html.

Understand the inner and outer html html concept, we set out to obtain them.

Effect lxml.html.tostring (html_element) interface is converted into html tree to a node and its child nodes is formed, i.e. the node Outer html, whereby we get Inner html, and to achieve the following functions:


13717038-aeff367e4f03df68.png

inner html 2. Set node

Set inner html compared to get more complex, we were still those above html code as an example:

<Div class = "text"> This is the div <a href="/node"> node </a> SUMMARY </ div>

We should assume its inner html into the following string:

this is div<a href=”/node”>node</a>text

The steps are:

Empty node inside the div content: including its text and child nodes

The new inner html into fragments

After emptying div node is added to the fragments

The above step is to write a Python function:


13717038-1b802b5a38082b89.png

Through the above function can be successfully contents inside the node is set to html desired content, suitable for use in dynamically modify the structure of the page content.

Reproduced in: https: //www.jianshu.com/p/cda00016a152

Guess you like

Origin blog.csdn.net/weixin_33834075/article/details/91071198