[Reptile] a, BeautifulSoup library

Document content as I watch Beijing Institute of Technology Song-day teacher open class lecture notes and practice summary, the picture is downloaded from the screenshot of the curriculum materials, teacher thank Song.

Key_point: web content actually extract the contents of the label were extracted, the key is to extract the tag and tag content acquisition interest. Get label beautifulsoup / beautifulsoup.tag. Tag name of the method, e.g. soup.p or soup.body.p, acquires the like are used with direct access tag.name/attrs python class attributes soup.Pb any method can tag content .

Logical process: tag tree → → tag or set of tags label content

First, the library entry BeautifulSoup

1, understanding

The role of the library as: data oriented network crawling and parsing web pages. BS type is inherited Tag class. Html document to provide a mapping of python objects. BS can be simply understood as a package into the html tag set, we can access the object tag by the tag name comprising, e.g. .a, .body the like, if a plurality of identical labels only return to the first. If you want to retrieve all, find_all latter method can be used.

2, Tag method substantially

It can be divided into two categories, namely access to its own methods and traversal methods.
The method comprises the following four access its own:
Here Insert Picture Description
It is noted that, originally part of the string access tag.string labels, but in fact there are nested Tags, will access fails, the return value is null, when used with simple label (excluding when nested) can normally used, a method of remedy is .get_text () method, but returns all text information in the tag, i.e. tags contain nested string.

Other labels traversal method comprising: traversing downward, upward traverse parallel traversal.
Here Insert Picture Description

Downward traversal:

Here Insert Picture Description
.contents return sub-tab list, the list is still an element tag types.
.children returns an iterator type, can not directly access, you need to use a for loop to traverse.
Child in soup.body.children for:
Print (Child)
.decendants test method jupyter time error, with pycharm normal, somehow. (The original word is misspelled)

// 子孙节点
for son in soup.body.descendants:
         print(son)

Up traversal

Here Insert Picture Description
.parent return to the parent (composite) label, the parent tag support traversal, content is the contents of the label, in fact, in fact, as long as there is a label .contents, to support traversal.

tag return an .parents BS type and configuration of the generator class, for example:
Here Insert Picture Description
Here Insert Picture Description
types are:
. 1 <class' bs4.element.Tag '>
2 <class' bs4.element.Tag'>
. 3 <class' bs4.element.Tag '>
. 4 <class' bs4.BeautifulSoup'>
that is: the upstream end of traversal is beautifulsoup class is the root.

Parallel traversal:

First, a node having the same parent node tag was parallel relationship, as shown below:
Here Insert Picture Description
The method comprises traversing parallel:
Here Insert Picture Description

平行遍历(无论前后)即是在父节点的.contents列表里面遍历,所以有时会有换行符。
最后,bs4库的prettify()方法可以使得bs类对象更美观的显示,bs4和python3都默认utf-8编码,所以最好用3以上版本开发。
find_all方法和正则表达式结合,经常用于信息检索。find_all可以使用标签名称检索、属性值检索、导航字符串检索等。

3、Find_all方法:

Find方法是返回find_all内容列表的第一个元素。常用约束项:标签名称name,属性内容attrs,字符串内容。
例如:

for p in soup.find_all(‘p’) # soup中的所有p标签
for p in soup.find_all([‘a’, ‘p’]) # soup中所有的a标签和p标签
for tag in soup.find_all(True) # soup中的所有标签
soup.find_all(‘p’, ‘course’) # 所有属性值包含course的p标签
soup.find_all(attrs=‘course’) # 所有属性值为course的标签
for tag in soup.find_all(attrs=re.compile(‘py’)): # 属性中包含py
soup.find_all(‘p’, string=re.compile(‘python’)) # 字符串区域包含python的所有p标签

Note: findall big difference soup.find_all and regular library, which is in strict accordance with the pattern to match the target string, and soup approach If you add regular objects, such as the above sixth sentence, name returns that contains all the regular consistent set of tags, for example:

 soup.find_all(attrs=re.compile('py')):  # 属性中包含py

As long as the value corresponding to the class attribute dictionary contains 'py' without requiring a strictly 'py', name will be returned as one of the results.

输出结果:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
{'href': 'http://www.icourse163.org/course/BIT-1001870001', 'class': ['py2'], 'id': 'link2'}

attrs parameter on find_all this method it is necessary to mention: a general label attribute dictionary includes many key-value pair, for example:

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

Read the source code can be seen: if our attrs parameter is not passed in the form of a dictionary, then the default key for the class, if you want to use to remove the class as a key constraint, it must pass attrs dictionary. E.g:

for tag in soup.find_all(attrs=re.compile('org')): 
    print(tag)
    print(tag.attrs)

for tag in soup.find_all(attrs={"href":re.compile('org')}): 
    print(tag)
    print(tag.attrs)


输出结果:
上面代码的输出:[]
下面代码的输出:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
{'href': 'http://www.icourse163.org/course/BIT-1001870001', 'class': ['py2'], 'id': 'link2'}

Code links: BeautifulSoup practice

Published 12 original articles · won praise 1 · views 268

Guess you like

Origin blog.csdn.net/weixin_43522964/article/details/100046711