A. Regular analytical data
Baidu News resolve each news title, url, check each news source may know, their title and url are located <a> </a> label, because of the specific parameters of the form which is not the same, the same regular and We can not match and extract all the news title and url, as shown below
target to determine the value, in a regular can write dead, class also determine the value, in a regular can also be written to die, but class does not exist in all of a label (own idea is to write two regular match (with class or not), and finally the resulting data summary), where the values are not the same Mon, it is necessary to match with regular out to the secondary treatment time (if necessary), the following codes (with class, empathy without class)
Re Import Import Requests URL = ' http://news.baidu.com/ ' headers = { " the User-- Agent " : ' the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_12_6) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome Safari /70.0.3538.77 / 537.36 ' } # response.text less accurate estimation by transcoding data = Requests. GET (URL, headers = headers) .content.decode () # regular analysis data [\ u4e00 - \ u9fa5]
pattern = re.compile('<a href="(.*?)" target="_blank" class="a3" mon="(.*?)"(.*)</a>')
result = pattern.findall(data)
print(result)
The results are as follows (FIG taken part):
Two. Xpath analytical data
1. Install and support can parse html and XML parsing library ------ lxml:
pip install lxml
2. Analytical data conversion type
xpath_data = etree.HTML(data)
3. xpath syntax
1. "/" represents a node
result = xpath_data.xpath ( '/ html / head / title // text ()') # acquires a content in accordance with an order node
2. "//" indicates cross-node
result = xpath_data.xpath ( '// a / text ()') # across the nodes acquire the content
3. accurate tags: // a [@ property = "attribute value"]
result = xpath_data.xpath ( '// a [ @ mon = "ct = 1 & a = 2 & c = top & pn = 18"]') # give a label object
result = xpath_data.xpath ( '// a [ @ mon = "ct = 1 & a = 2 & c = top & pn = 18 "] / text () ') # obtain content
4. Get a tag url: @href
result = xpath_data.xpath('//a[@mon="ct=1&a=2&c=top&pn=18"]/@href')
Code
Re Import Import Requests # installation support and html parsing XML parsing library lxml # PIP lxml the install from lxml Import etree URL = 'http://news.baidu.com/' headers = { "the User-- Agent": 'the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_12_6) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 70.0.3538.77 Safari / 537.36 ' } # response.text less accurate estimation by transcoding data = requests.get (url, headers = headers) .content.decode () # type 1. analytical transfected xpath_data = etree.HTML (Data) # 2 xpath method call Result = xpath_data.xpath ( '/ HTML / head / title // text ()') Result = xpath_data.xpath ( '// A / text ()') Result = xpath_data.xpath ( 'A // [@ Mon = "CT = 2. 1 & = A = C & Top PN = 18 is &"]') Result = xpath_data.xpath ( 'A // [@ Mon = "CT = A = 2. 1 & & & Top PN = C = 18 is"] / @ the href') result = xpath_data.xpath('//li/a/text()') print(result)
with open('02news.html', 'w') as f:
f.write(data)
Three Exercise 1 crawling btc forum title and the corresponding url