Reptile notes: Beautiful Soup use (6)

This article introduces how to use Beautiful Soup. This module is used to parse html. Its operation is very simple and easy to use.

This is a third-party module that needs to be installed

pip install beautifulsoup4
pip install lxml

　Beautiful Soup object

Beautiful converts complex HTML documents into a complex tree structure, each node is a python object, and all objects can be summarized into 4 types:

Tag

Tag, get content through html tag, for example a tag, if there are multiple, take the first one

html = """
<html><head><title>Baidu</title></head>
<body>
<div>
<p class="content">搜索
<a href="http://www.baidu.com" class="link" id="link1"><!--首页--></a>,
<a href="http://www.baidu.com/page/3.html" class="link" id="link2">搜索1</a> and
<a href="http://www.baidu.com/page/47.html" class="link" id="link3">搜索2</a>;
请点击上面的链接.</p>
<p class="content">.这是广告植入.</p>
<p class="title">百度</p>
</div>
</body>
</html>
""" 
# tag has two attributes, one is name and the other is attr
print (obj.p)
print (obj.title) # aidu. com takes the content of the title tag
print (obj.prettify ()) # formatted output html
obj = BeautifulSoup (html, 'lxml') # Followed by specifying the use of lxml parsing, lxml parsing speed is relatively fast, and fault tolerance is high.
# name, the name attribute is the name of the tag, for example, the a tag is the printed name is a 
# attrs, attrs is the attribute of this tag, for example, the attributes of the a tag above are class, href, id, he is a dictionary 
# Since attrs is a dictionary, you can use the key to get the value 

print (obj.a.name) # the name of a label, which is a 
print (obj.a.attrs) # the attribute of a label, which is class href id these, and what is the corresponding value 
print (obj.a.attrs ['href']) # Get the href attribute of the a tag, that is, http://www.baidu.com 
print (obj.a.attrs .get ('href')) # Because attrs is a dictionary, you can also use the .get method to get the same value as the brackets above

　　NavigableString

That is, the content and text in a tag are obtained, for example, the content in the title tag above is obtained

print(obj.title.string) #Baidu
print(obj.a.string)    #首页
print(type(obj.title.string)) # <class 'bs4.element.NavigableString'> NavigableString的类型

　　Beautifulobj

Beautifulobj object is to represent the entire html, for example, the above obj is Beautifulobj object, through which to operate various tags

print (type (obj)) #Beautifulobj object

　Comment

The Comment object is a special type of NavigableString object. In fact, the output content still does not include the comment symbol, but if it is not handled properly, it may cause unexpected troubles in our text processing. For example, in an a tag above, the home page is annotated.

print (obj.a.string) #Home page, not including <!--> Comment 
print (type (obj.a.string)) # <class 'bs4.element.Comment'> Comment type 
# This is actually The content of the comment, when we take it out of the string, there is no comment symbol, so pay attention here

　Key operations

The above is obtained through a certain label. If you want to directly obtain some labels, you need to use other methods to obtain certain attributes.

Search tags

# find_all method find_all (name, attrs, recursive, text, ** kwargs) 
# find_all method is used to search all current tags to determine whether they meet the filtering conditions, and if so, return a list of eligible 
print (obj.find_all (' p ')) # Find all p tags 
print (obj.find_all ([' a ',' p '])) # Find all a and p tags 

# Specify the attribute 
print (obj.find_all (id =' link1 ')) # Find 
print with id link1 (obj.find_all (id = ['link1', 'link2'])) # Find id with link1 and link2 

# Because class is a keyword in python, if you want to find the class attribute, you can't Write class directly, write class_ 
print (obj.find_all (class _ = 'link')) # class is link, 
print (obj.find_all (class _ = ['link', 'content'])) # class is link and conent 
print (obj.find_all (attrs = ('class': 'link', 'id': 'link1'})) # Multiple attributes can also be written directly into a dictionary, the attribute name is written as key, and the value is written as value 
print (obj.find_all ('p', class _ = 'content')) # Find the class with content from the p tag

print (type (obj.find (class _ = 'link'))) # The difference between the find method and the findall method is that findall will return all the labels and put them in a list. 
# find method returns a label and finds multiple If you want, take the first one. Other usages are the same

　　css selector

The css selector is to get the html tags through the CSS to get the elements. It is very convenient if people who are familiar with css use it. In the css selector, "." represents the selected class, "#" represents the selected id .

print ('p', obj.select ('p')) # select by label 
print (obj.select ('a')) # select by label 
print (obj.select ('. content')) # by class name Select 
print (obj.select ('# link1')) # Select 
print (obj.select ('p .link')) by id # Combine search, find 
print (obj.select ('p with class title under p tag # link1 ')) # Combined search, find the 
print with the link ID under link p (obj.select (' a # link1 ')) # Combined search, find the id with link1 in the tag a, and the difference between adding no spaces is , Find 
print (obj.select ('p> a')) at the same level # tag combination search, find the a tag under the p tag 

print (obj.select ('a [class = link]')) # attribute search , Find the 
print (obj.select ( 
    'pa [href = http: //www.baidu.com/page/47.html]')) whose class is link under the a tag # Use it in combination, from the a tag under the p tag Find href for http://www.baidu.com/page/47.html

　　Node content

The node is how to get the various nodes of html, such as other divs at the same level as the div, the sub tags below the div, etc.

# contents tag The .contents attribute can output the child nodes of the tag as a list. 
	 #children children is the same as contents, which is also to obtain child nodes, but children is not a list, but a generator 
print (obj.div.contents) # Get all the tags below the div 
print (obj.div.children) #This is a generator, the printout is a generator object, if you want to get it, you must loop 
for chil in obj.div.children: 
	print (chil ) 
 
 
 
# Through content and children are to obtain child nodes, if you want to get children and grandchildren nodes through descendants 
# print (obj.descendants) and the result of this acquisition is also an iterator 
# 
# parent node and ancestor node 
# 
# through obj. a.parent can get the information of the parent node 
# 
# The ancestor node can be obtained through obj.a.parents, the result returned by this method is a list, the information of the parent node of the a tag will be stored in the list, and the parent node The parent node of is also placed in the list, and finally the entire document will be placed in the list, the last element of all lists and the penultimate element are The information is kept whole document 
# 
# sibling 
# 
# obj.a.next_siblings get behind siblings 
# obj.a.previous_siblings get in front of sibling
# obj.a.next_sibling Get the next sibling label 
# obj.a.previous_sinbling Get the previous sibling label

　　to sum up

Mainly talked about how to get various tags and elements in html. Modification and deletion are not written, because crawlers generally do not need to modify, and obtaining data is enough. Both find_all () and the CSS selector are commonly used. If you are familiar with CSS, it is recommended to use the CSS selector. Tag search and css selector are very common here.

Reptile notes: Beautiful Soup use (6)

Guess you like