This article introduces how to use Beautiful Soup. This module is used to parse html. Its operation is very simple and easy to use.
This is a third-party module that needs to be installed
pip install beautifulsoup4 pip install lxml
Beautiful Soup object
Beautiful converts complex HTML documents into a complex tree structure, each node is a python object, and all objects can be summarized into 4 types:
Tag
Tag, get content through html tag, for example a tag, if there are multiple, take the first one
html = """ <html><head><title>Baidu</title></head> <body> <div> <p class="content">搜索 <a href="http://www.baidu.com" class="link" id="link1"><!--首页--></a>, <a href="http://www.baidu.com/page/3.html" class="link" id="link2">搜索1</a> and <a href="http://www.baidu.com/page/47.html" class="link" id="link3">搜索2</a>; 请点击上面的链接.</p> <p class="content">.这是广告植入.</p> <p class="title">百度</p> </div> </body> </html> """ # tag has two attributes, one is name and the other is attr print (obj.p) print (obj.title) # aidu. com takes the content of the title tag print (obj.prettify ()) # formatted output html obj = BeautifulSoup (html, 'lxml') # Followed by specifying the use of lxml parsing, lxml parsing speed is relatively fast, and fault tolerance is high. # name, the name attribute is the name of the tag, for example, the a tag is the printed name is a # attrs, attrs is the attribute of this tag, for example, the attributes of the a tag above are class, href, id, he is a dictionary # Since attrs is a dictionary, you can use the key to get the value print (obj.a.name) # the name of a label, which is a print (obj.a.attrs) # the attribute of a label, which is class href id these, and what is the corresponding value print (obj.a.attrs ['href']) # Get the href attribute of the a tag, that is, http://www.baidu.com print (obj.a.attrs .get ('href')) # Because attrs is a dictionary, you can also use the .get method to get the same value as the brackets above
NavigableString
That is, the content and text in a tag are obtained, for example, the content in the title tag above is obtained
print(obj.title.string) #Baidu print(obj.a.string) #首页 print(type(obj.title.string)) # <class 'bs4.element.NavigableString'> NavigableString的类型
Beautifulobj
Beautifulobj object is to represent the entire html, for example, the above obj is Beautifulobj object, through which to operate various tags
print (type (obj)) #Beautifulobj object
Comment
The Comment object is a special type of NavigableString object. In fact, the output content still does not include the comment symbol, but if it is not handled properly, it may cause unexpected troubles in our text processing. For example, in an a tag above, the home page is annotated.
print (obj.a.string) #Home page, not including <!--> Comment print (type (obj.a.string)) # <class 'bs4.element.Comment'> Comment type # This is actually The content of the comment, when we take it out of the string, there is no comment symbol, so pay attention here
Key operations
The above is obtained through a certain label. If you want to directly obtain some labels, you need to use other methods to obtain certain attributes.
Search tags
# find_all method find_all (name, attrs, recursive, text, ** kwargs) # find_all method is used to search all current tags to determine whether they meet the filtering conditions, and if so, return a list of eligible print (obj.find_all (' p ')) # Find all p tags print (obj.find_all ([' a ',' p '])) # Find all a and p tags # Specify the attribute print (obj.find_all (id =' link1 ')) # Find print with id link1 (obj.find_all (id = ['link1', 'link2'])) # Find id with link1 and link2 # Because class is a keyword in python, if you want to find the class attribute, you can't Write class directly, write class_ print (obj.find_all (class _ = 'link')) # class is link, print (obj.find_all (class _ = ['link', 'content'])) # class is link and conent print (obj.find_all (attrs = ('class': 'link', 'id': 'link1'})) # Multiple attributes can also be written directly into a dictionary, the attribute name is written as key, and the value is written as value print (obj.find_all ('p', class _ = 'content')) # Find the class with content from the p tag print (type (obj.find (class _ = 'link'))) # The difference between the find method and the findall method is that findall will return all the labels and put them in a list. # find method returns a label and finds multiple If you want, take the first one. Other usages are the same
css selector
The css selector is to get the html tags through the CSS to get the elements. It is very convenient if people who are familiar with css use it. In the css selector, "." represents the selected class, "#" represents the selected id .
print ('p', obj.select ('p')) # select by label print (obj.select ('a')) # select by label print (obj.select ('. content')) # by class name Select print (obj.select ('# link1')) # Select print (obj.select ('p .link')) by id # Combine search, find print (obj.select ('p with class title under p tag # link1 ')) # Combined search, find the print with the link ID under link p (obj.select (' a # link1 ')) # Combined search, find the id with link1 in the tag a, and the difference between adding no spaces is , Find print (obj.select ('p> a')) at the same level # tag combination search, find the a tag under the p tag print (obj.select ('a [class = link]')) # attribute search , Find the print (obj.select ( 'pa [href = http: //www.baidu.com/page/47.html]')) whose class is link under the a tag # Use it in combination, from the a tag under the p tag Find href for http://www.baidu.com/page/47.html
Node content
The node is how to get the various nodes of html, such as other divs at the same level as the div, the sub tags below the div, etc.
# contents tag The .contents attribute can output the child nodes of the tag as a list. #children children is the same as contents, which is also to obtain child nodes, but children is not a list, but a generator print (obj.div.contents) # Get all the tags below the div print (obj.div.children) #This is a generator, the printout is a generator object, if you want to get it, you must loop for chil in obj.div.children: print (chil ) # Through content and children are to obtain child nodes, if you want to get children and grandchildren nodes through descendants # print (obj.descendants) and the result of this acquisition is also an iterator # # parent node and ancestor node # # through obj. a.parent can get the information of the parent node # # The ancestor node can be obtained through obj.a.parents, the result returned by this method is a list, the information of the parent node of the a tag will be stored in the list, and the parent node The parent node of is also placed in the list, and finally the entire document will be placed in the list, the last element of all lists and the penultimate element are The information is kept whole document # # sibling # # obj.a.next_siblings get behind siblings # obj.a.previous_siblings get in front of sibling # obj.a.next_sibling Get the next sibling label # obj.a.previous_sinbling Get the previous sibling label
to sum up
Mainly talked about how to get various tags and elements in html. Modification and deletion are not written, because crawlers generally do not need to modify, and obtaining data is enough. Both find_all () and the CSS selector are commonly used. If you are familiar with CSS, it is recommended to use the CSS selector. Tag search and css selector are very common here.