Read the Software Engineering Notes (a) of the Python web crawler

  After Wang gave us publish job, I see reptiles Python, but Python basic grammar I will not. So I borrowed a Python web crawler tutorial in the library.

The so-called page parser, simply put, is the tool used to parse HTML pages, it is mainly used to extract needed data and links to valuable information from the HTML page. In Python parsing website mainly uses regular expressions, Lxml library, Beautiful Soup three tools.

First, the regular expression. Regular expression describes a set of strings. It can be used to check whether a string containing the certain substring matching sub-strings replaced or removed substrings meet a certain criteria from a string like. The advantages of the regular expression is a basic use regular expressions to extract all the information you want, more efficient, but the shortcomings are obvious - the regular expression is not very intuitive to write more complicated.

Second Lxml library. This library using XPath syntax is also more efficient parsing library. Xpath is a finding information in an XML document language. XPath can be used to traverse the elements and attributes in an XML document. XPath intuitive and easy to understand, with Chrome or Firefox browser, write up is very simple, its code speed to run fast and robust, it is generally the best choice for parsing data.

Third, Beautiful Soup. Beautiful Soup is a Python library can extract data from HTML or XML file. It enables document navigation by conventional converter we like to find. Beautiful Soup preparation of high efficiency, help save programmers hours or even days of work. Beautiful Soup relatively easy to learn, but compared Lxml and regular expressions, parsing speed is much slower.

Guess you like

Origin www.cnblogs.com/jccjcc/p/12128726.html
Recommended