04 three data analytical methods

Introduced

  Demand in most cases, we will specify to use focused crawler, which is crawling the page specified data value portion, rather than the entire page of data. Thus, using the data analysis in the focusing crawlers. Therefore, our data crawling process:

  • Specify the url
  • Based on requests initiation request module
  • Acquiring data in response to
  • data analysis
  • Persistent storage

  data analysis:

  - be applied focused crawler.

  - parsed data is stored in the corresponding attribute or tag between a tag

 

1 . Positive is resolved

Common regular expression

Single Character 1: 
 2: all characters except newline 
 3 []: [aoe] [ aw] matches any character set 
 4 \ d: number [0-9] 
 . 5 \ D: non-digital 
 6 \ w: number, letters, underline, Chinese 
 7 \ W: non \ W 
 . 8 \ S: all whitespace characters package, including spaces, tabs, page breaks and the like. Is equivalent to [\ f \ n \ r \ t \ v]. 
 9 \ S: non-blank 
10 
. 11 Number of modification: 
12 *: any number of times> 0 = 
13 +: at least 1> = 1 
14:? Dispensable zero or 1 
15 {m}: m times fixed hello . 3 {,} 
16 {m,}: at least m times 
17 {m, n}: mn times 
18 is 
. 19 boundary: 
20 $: in certain end 
21 ^: beginning with a certain 
22 is 
23 is the packet: 
24 (ab &)   
25 greedy mode: *
26 non-greedy (inert) mode:. *?
27 
28 re.I: ignore case 
29 re.M: a plurality of rows Match 
30 re.S: single row matches 
31 is 
32 the re.sub (regular expressions, content replacement, string)
View Code

Example: crawling all embarrassments Encyclopedia of embarrassing map Photos

Import Requests. 1 
 2 Import Re 
 . 3 Import OS 
 . 4 
 . 5 # Create a folder 
 . 6 Not os.path.exists IF ( './ qiutuLibs'): 
 . 7 os.mkdir (' ./ qiutuLibs') 
 . 8 
 . 9 = {headers 
10 ' - Agent-the User ':' the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.103 Safari / 537.36 ' 
. 11} 
12 is 
13 is a generic package url # template 
14 url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803' 
15 
16 for in Page Range (1,36): 
. 17 = NEW_URL the format (% Page URL)                             
18 is page_text = requests.get (URL = NEW_URL, headers = headers) .text 
. 19 
20 # perform data analysis (address pictures)
= EX 21 is '<div class = "Thumb">. *? <IMG the src = "(. *?)" Alt. *? </ Div>' 
22 is the re.findall src_list = (EX, page_text, re.S)                         
23 is 
24 # src attribute value is not found in a complete url, missing protocol header 
25 for src in src_list: 
26 is src = 'HTTPS:' src + 
27 # of the initiation request url image separately acquired image data is returned .content binary type of response data 
28 img_data = requests.get (= the src URL, headers = headers) .content 
29 img_name src.split = ( '/') [-. 1] 
30 
31 is img_path = './qiutuLibs/' + img_name 
32 Open with (img_path, 'WB') AS FP: 
33 is fp.write (img_data) 
34 is Print (img_name, 'download successful!')
View Code

 

2 . Xpath resolve

  安装:pip install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

  xpath  resolution process:

        - 1. Examples of a etree object about to be loaded and parsed page data to the source object.

        - 2. By calling etree object xpath method, combined with xpath expression label positioning data extraction and

 

Example. 1: BOSS straight employed

 1 import requests
 2 from lxml import etree
 3 import json
 4 headers = {
 5     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
 6 }
 7 
 8 url = 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&city=101010100&industry=&position='
 9 page_text = requests.get(url=url,headers=headers).text
10 
11 
12= Tree etree.HTML (page_text)
 13 is li_list = tree.xpath ( ' // div [@ class = "Job-List"] / UL / Li ' )
 14 job_data_list = []
 15  
16  for Li in li_list:
 . 17     # local parsing certain content begins with ./. etree element can call and XPath 
18 is      job_name = li.xpath ( ' .// div [@ class = "Primary-info"] / H3 / A / div / text () ' ) [0]                    
 . 19      the salary = li.xpath ( ' .// div [@ class = "Primary-info"] / H3 / A / span / text () ' ) [0]
 20 is      Company li.xpath = ( './/div[@class="company-text"]/h3/a/text()')[0]
21 detail_url = 'https://www.zhipin.com'+li.xpath('.//div[@class="info-primary"]/h3/a/@href')[0]
22 
23     #详情页的页面源码数据
24     detail_page_text = requests.get(url=detail_url,headers=headers).text
25     detail_tree = etree.HTML(detail_page_text)
26     job_desc = detail_tree.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()')
27     job_desc = ''.join(job_desc)
28     
29     
30     dic = {
31         'job_name':job_name,
32         'salary':salary,
33         'company':company,
34         'job_desc':job_desc
35     }
36 job_data_list.append(dic)
37 
38 fp = open('job.json','w',encoding='utf-8')
39 json.dump(job_data_list,fp,ensure_ascii=False)
40 fp.close()
41 print('over')
View Code

 

Example 2 : Image data download omelette Network: http://jandan.net/ooxx. [Key: src encryption ]

 1 import requests
 2 from lxml import etree
 3 from fake_useragent import UserAgent
 4 import base64
 5 import urllib.request
 6 
 7 url = 'http://jandan.net/ooxx'
 8 ua = UserAgent(verify_ssl=False,use_cache_server=False).random
 9 headers = {
10     'User-Agent':ua
11 }
12 page_text = requests.get(url=url,headers=headers).text
13 is Tree = etree.HTML (page_text)        
 14  
15  # acquired encrypted image data url 
16 imgCode_list = tree.xpath ( ' // span [@ class = "the hash-IMG"] / text () ' )
 . 17  
18 is imgUrl_list = []
 . 19  for URL in imgCode_list:
 20 is      img_url = ' HTTP: ' + base64.b64decode (URL) .decode ()     # base64.b64decode (URL) of a byte, it is necessary to turn STR 
21 is  imgUrl_list.append (img_url)
 22 is  
23 is  for url in imgUrl_list:
 24-     filePath = url.split('/')[-1]
25     urllib.request.urlretrieve(url=url,filename=filePath)
26     print(filePath+'下载成功')
View Code

 

Example 3 : crawling resume templates owners creatives

. 1  Import Requests
 2  Import Random
 . 3  from lxml Import etree
 . 4 headers = {
 . 5      ' Connection ' : ' Close ' ,                              # when the request is successful, the times immediately disconnect request (resource request timely release of the pool) 
. 6      ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 72.0.3626.119 Safari / 537.36 ' 
. 7  }
 . 8 URL = ' http://sc.chinaz.com/jianli /free_%d.html' 
. 9  for Page in Range (l, 4):                             # Since the first page and other pages url not the same format, so the discussion points     
10      IF Page. 1 == :
 . 11          NEW_URL = ' http://sc.chinaz.com/ Jianli / free.html ' 
12 is      the else :
 13 is          NEW_URL = the format (URL% Page)
 14      
15      Response = requests.get (URL = NEW_URL, headers = headers)
 16      response.encoding = ' UTF-. 8 '                         # Chinese distortion, first adjust encoding 
. 17      page_text = response.text
 18 is 
19     tree = etree.HTML(page_text)
20     div_list = tree.xpath('//div[@id="container"]/div')
21     for div in div_list:
22         detail_url = div.xpath('./a/@href')[0]
23         name = div.xpath('./a/img/@alt')[0]
24 
25         detail_page = requests.get(url=detail_url,headers=headers).text
26         tree = etree.HTML(detail_page)
27         download_list  = tree.xpath('div // [@ class = "clearfix MT20 downlist"] / ul / li / A / @ href ' )     # obtained in this way is that all download links for each of the 
28          DOWNLOAD_URL = random.choice (download_list)              # In order to prevent each link banned because too frequent request, randomly selects one of 
29          Data = requests.get (URL = DOWNLOAD_URL, headers = headers) .content
 30          fileName = name + ' .rar ' 
31 is          with Open (fileName, ' WB ' ) AS FP:
 32              FP .write (Data)
 33 is              Print (fileName, ' download successful ' )
View Code

 

3 . BeautifulSoup resolve

  Installation: PIP install BS4

  Resolution process:

        - Examples of a BeautifuSoup object and then load the source page data to the object

        - call the object properties and methods for locating labels and data extraction

 

Example: Using bs4 the contents of the poems famous sites in each chapter of the Three Kingdoms novel of crawling and stored in the local disk  

 1 import requests
 2 from bs4 import BeautifulSoup
 3 headers = {
 4     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
 5 }
 6 url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
 7 page_text = requests.get(url=url,headers=headers).text
 8 
 9 #数据解析:章节标题,章节内容
10 soup = BeautifulSoup(page_text,'lxml')
11 a_list = soup.select('.book-mulu > ul > li > a')
12 fp = open('./sanguo.txt','w',encoding='utf-8')
13 for a in a_list:                                #把a标签当soup对象使用,因为它也是源码
14     title = a.string
15     detail_url = 'http://www.shicimingju.com'+a['href']
16     detail_page_text = requests.get(url=detail_url,headers=headers).text
17     soup = BeautifulSoup(detail_page_text,'lxml')
18     content = soup.find('div',class_="chapter_content").text      #bs4中,把text提取出来的列表直接转换成字符串,与xpath不同
19     
20     fp.write(title+':'+content+'\n')
21     print(title,'保存成功!')
22 fp.close()
View Code

 

 

 

 

Guess you like

Origin www.cnblogs.com/Summer-skr--blog/p/11397434.html