[Python] Check in and study on the seventh day - crawler parser BeautifulSoup4


Event address: CSDN 21-day learning challenge

The biggest reason for learning is to get rid of mediocrity. One day earlier, there will be more splendor in life; Dear friends, if you:
want to systematically/deeply learn a certain technical knowledge point...
it is difficult to persist in learning alone, and want to learn efficiently in a group...
want to write a blog but can't start, and urgently need to inject energy into writing dry goods...
love writing, willing to let yourself become better people

...

Welcome to participate in the CSDN Learning Challenge and become a better self. Please refer to the free high-quality column resources of the high-quality column bloggers in the event (this part of the high-quality resources is free and open for a limited time in the event~), according to your own learning field and learning progress Learn and document your own learning process. You can choose one of the following three aspects to start (not mandatory), or publish column learning works according to your own understanding, as follows:

**

study diary

**
1. Learning knowledge points

Installation related operations

2. Problems encountered in learning

API not touched

3. Learning gains

API usage of beautifulsoup4

4. Practical operation

Installation related:

1. cmd command line: pip install beautifulsoup4

2、密包:form bs4 import BeautifulSoup

Parsing library:

1. Python standard library: BeautifulSoup(html,'html.parser'), Python's built-in standard library, with moderate execution speed and strong document fault tolerance. Versions of Python 2.7.3 and earlier than Python 3.2.2 have poor error tolerance.

2. lxml HTML parsing library: BeautifulSoup(html,'lxml'), fast speed and strong document fault tolerance. The C language library needs to be installed.

3. lxml XML parsing library: BeautifulSoup(html,'xml'), fast and the only parser that supports XML. The C language library needs to be installed.

4. htm5lib parsing library: BeautifulSoup(html,'htm5llib'), the best fault tolerance, parsing documents in the way of a browser, and generating documents in HTMLS format. Slow and does not rely on external extensions.

Object type:

1. tag: label.

2. NavigableString: the text in the label.

3. BeautifulSoup: the content, type, name, and attributes of the document.

4. Comment: Content that does not contain comment symbols.

Guess you like

Origin blog.csdn.net/qq_34217861/article/details/126433026