Business reptile study notes day6

A. Regular analytical data

Baidu News resolve each news title, url, check each news source may know, their title and url are located <a> </a> label, because of the specific parameters of the form which is not the same, the same regular and We can not match and extract all the news title and url, as shown below

 target to determine the value, in a regular can write dead, class also determine the value, in a regular can also be written to die, but class does not exist in all of a label (own idea is to write two regular match (with class or not), and finally the resulting data summary), where the values ​​are not the same Mon, it is necessary to match with regular out to the secondary treatment time (if necessary), the following codes (with class, empathy without class)

 

Re Import 
Import Requests 

URL = ' http://news.baidu.com/ ' 
headers = {
     " the User-- Agent " : ' the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_12_6) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome Safari /70.0.3538.77 / 537.36 ' 
} 

# response.text less accurate estimation by transcoding 
data = Requests. GET (URL, headers = headers) .content.decode () 

# regular analysis data [\ u4e00 - \ u9fa5]
pattern = re.compile('<a href="(.*?)" target="_blank" class="a3" mon="(.*?)"(.*)</a>') 
result = pattern.findall(data) 
print(result)

The results are as follows (FIG taken part):

Two. Xpath analytical data

 1. Install and support can parse html and XML parsing library ------ lxml:

pip install lxml

 2. Analytical data conversion type

xpath_data = etree.HTML(data)

 3. xpath syntax

1. "/" represents a node

result = xpath_data.xpath ( '/ html / head / title // text ()') # acquires a content in accordance with an order node

2. "//" indicates cross-node

result = xpath_data.xpath ( '// a / text ()') # across the nodes acquire the content

3. accurate tags: // a [@ property = "attribute value"]   

result = xpath_data.xpath ( '// a [ @ mon = "ct = 1 & a = 2 & c = top & pn = 18"]') # give a label object 
result = xpath_data.xpath ( '// a [ @ mon = "ct = 1 & a = 2 & c = top & pn = 18 "] / text () ') # obtain content

4. Get a tag url: @href

result = xpath_data.xpath('//a[@mon="ct=1&a=2&c=top&pn=18"]/@href')

 Code

Re Import 
Import Requests 

# installation support and html parsing XML parsing library lxml 
# PIP lxml the install 
from lxml Import etree 

URL = 'http://news.baidu.com/' 
headers = { 
    "the User-- Agent": 'the Mozilla / 5.0 (the Macintosh; the Intel the Mac the OS X-10_12_6) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 70.0.3538.77 Safari / 537.36 ' 
} 

# response.text less accurate estimation by transcoding 
data = requests.get (url, headers = headers) .content.decode () 

# type 1. analytical transfected 
xpath_data = etree.HTML (Data) 

# 2 xpath method call 
Result = xpath_data.xpath ( '/ HTML / head / title // text ()') 
Result = xpath_data.xpath ( '// A / text ()')  
Result = xpath_data.xpath ( 'A // [@ Mon = "CT = 2. 1 & = A = C & Top PN = 18 is &"]')
Result = xpath_data.xpath ( 'A // [@ Mon = "CT = A = 2. 1 & & & Top PN = C = 18 is"] / @ the href')
result = xpath_data.xpath('//li/a/text()')

print(result)

with open('02news.html', 'w') as f:
  f.write(data)

 

Three Exercise 1 crawling btc forum title and the corresponding url

 

 

 

 

 

 

 

   

 

Guess you like

Origin www.cnblogs.com/jj1106/p/11223218.html