[Python web crawler] 150 lectures to easily get the Python web crawler paid course notes eight-crawler parsing library bs4 BeautifulSoup

1. Introduction to BeautifulSoup

BeautifulSoup is an HTML/XML parser like lxml. Its main function is to parse and extract HTML/XML data

Unlike lxml, BeautifulSoup is based on HTML DOM (Document Object Model ), which loads the entire document and parses the entire DOM tree, so the time and memory overhead will be much larger, and all performance will be lower than lxml.

 

1.1 Comparison of analysis tools:

1.2 Simple to use

When there is no parser, no error will be reported when parsing the document, but a warning will be given:

 Warning content:

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The rough meaning is that there is no given parser, so the best parser lxml is used by default, but if you run in different systems and environments, you will use different parsers, so the results are not necessarily the same. .

The official recommendation is lxml.

After using the parser, we found that even if the html document is imperfect, lxml will fill in the imperfect part when parsing. This is the content in the picture we saw above.

Normally, the html format has a certain indentation to distinguish each module. This operation is also very easy to implement in BeautifulSoup. You only need to call the prettify() function in the soup. The results are as follows:

 

2. Four common objects

2.1 Tag

Tag is the tag in html. When using it, directly use soup. Add the tag name to find the tag content.

Note that: The tag found by tag is the first one that meets the requirements in all content.

soup = BeautifulSoup(html, 'lxml')
print(soup.p)
print(type(soup.p))#查看类型

#两个重要属性name, attrs
print(soup.head.name)
print(soup.p.name)

print(soup.p.attrs)
print(soup.p['class'])
print(soup.p.get('class'))

soup.p['class'] = 'new'
print(soup.p.attrs)

 result:

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>
head
p
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
{'class': 'new', 'name': 'dromouse'}

2.2 NavigableString

The NavigableString tag can get the content text in the tag.

Or follow the above example, import the NavigableString module, and use it.

 

2.3 BeautifulSoup

The BeautifulSoup object represents the entire content of a document. Most of the time, it can be regarded as a Tag object. BeautifulSoup supports most methods of traversing the document tree and searching the document tree.

Inherited from Tag in the BeautifulSoup source code.

2.3.1 Traverse the document tree

  • contents returns a list of all child nodes
  • children returns an iterator of all child nodes
from bs4.element import Comment

head_tag = BeautifulSoup(html, 'lxml')
print(head_tag.contents) #返回时list
print(head_tag.children)
for i in head_tag.children:
    print(i)
  • strings: If the tag contains multiple strings, you can use .string to get it in a loop and return to the generator
for string in soup.strings:
    print(string)
    print(repr(string))
The Dormouse's story
"The Dormouse's story"


'\n'


'\n'
The Dormouse's story
"The Dormouse's story"


'\n'
Once upon a time there were three little sisters; and their names were

'Once upon a time there were three little sisters; and their names were\n'
,

',\n'
Lacie
'Lacie'
 and

' and\n'
Tillie
'Tillie'
;
and they lived at the bottom of a well.
';\nand they lived at the bottom of a well.'


'\n'
...
'...'


'\n'


'\n'

Process finished with exit code 0
  • stripped_string: .stripped_strings can remove extra blank content or blank lines in the output string and return to the generator
for string in soup.stripped_strings:
    print(string)
    # print(repr(string))
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

Process finished with exit code 0
  • get_text: Get the non-label string of a label's descendants and return it as a normal string

 2.3.2 Searching the document tree

See the next blog for details:

https://blog.csdn.net/weixin_44566432/article/details/108664325

2.4 Comment

Comment object is a special type of NavigableString object

 

Guess you like

Origin blog.csdn.net/weixin_44566432/article/details/108660050