BeautifulSoup parsing html introduction

The data crawled by the crawler is mainly html data. Sometimes it is also xml data. The analysis of tags in xml data is the same as that of html. Both are <tag> to distinguish data. The data structure of this format can be said to look like a page, which is very troublesome to parse. BeautifulSoup provides a powerful parsing function, which can help us save a lot of trouble. Install BeautifulSoup and lxml before use.


#pip install beautifulsoup4==4.0.1 #Specify the version, if not specified, the latest version will be installed #pip install lxml==3.3.6 Specify the version, if not specified, the latest version will be installed Enter the Python command line to try if the installation is successful>>> import bs4>>> import lxml>>>

No error is reported, indicating that the installation was successful. The version and release time of lxml can be viewed on the following website

image

First, the code must be introduced into this library


from bs4 import BeautifulSoup

Then, grab


try: r = urllib2.urlopen(request)except urllib2.URLError,e: print e.code exit() r.encoding='utf8'print r.codehtml=r.read() #urlopen gets all the content in html mysoup=BeautifulSoup(html,'lxml') #html information is in mysoup

Suppose we are interested in the following part of the data in html


    <data>        <day>20200214</day>        <id>1</id>        <rank>11</rank>        <name>张三</name>    </data>    <data>        <day>20200214</day>        <id>4</id>        <rank>17</rank>        <name>李四货</name>    </data>

First, find the data whose tag tag is <data>, and there is more than one such data, let's take two as an example. Then you need to use the find_all function of beautifulsoup, and the returned result should be two <data> data. When processing each <data> data, the tags such as <id><name> are unique, and then use the find function.


mysoup=BeautifulSoup(html,'lxml')data_list=mysoup.find_all('data')for data in data_list:#list should have two elements day = data.find('day').get_text() #get_text is to get String, you can use .string instead of id = data.find('id').get_text() rank = data.find('rank').get_text() name = data.find('name').get_text() #print name can print test analysis results

This is the simplest usage of beautifulsoup. Find and find_all can not only locate elements according to the name of the tag, but also find the content you are interested in according to various attributes such as class, style, and text content as a condition. It is very powerful.


Guess you like

Origin blog.51cto.com/15080029/2642972