Introduced
Demand in most cases, we will specify to use focused crawler, which is crawling the page specified data value portion, rather than the entire page of data. Thus, using the data analysis in the focusing crawlers. Therefore, our data crawling process:
- Specify the url
- Based on requests initiation request module
- Acquiring data in response to
- data analysis
- Persistent storage
data analysis:
- be applied focused crawler.
- parsed data is stored in the corresponding attribute or tag between a tag
1 . Positive is resolved
Common regular expression
Single Character 1: 2: all characters except newline 3 []: [aoe] [ aw] matches any character set 4 \ d: number [0-9] . 5 \ D: non-digital 6 \ w: number, letters, underline, Chinese 7 \ W: non \ W . 8 \ S: all whitespace characters package, including spaces, tabs, page breaks and the like. Is equivalent to [\ f \ n \ r \ t \ v]. 9 \ S: non-blank 10 . 11 Number of modification: 12 *: any number of times> 0 = 13 +: at least 1> = 1 14:? Dispensable zero or 1 15 {m}: m times fixed hello . 3 {,} 16 {m,}: at least m times 17 {m, n}: mn times 18 is . 19 boundary: 20 $: in certain end 21 ^: beginning with a certain 22 is 23 is the packet: 24 (ab &) 25 greedy mode: * 26 non-greedy (inert) mode:. *? 27 28 re.I: ignore case 29 re.M: a plurality of rows Match 30 re.S: single row matches 31 is 32 the re.sub (regular expressions, content replacement, string)
Example: crawling all embarrassments Encyclopedia of embarrassing map Photos
Import Requests. 1 2 Import Re . 3 Import OS . 4 . 5 # Create a folder . 6 Not os.path.exists IF ( './ qiutuLibs'): . 7 os.mkdir (' ./ qiutuLibs') . 8 . 9 = {headers 10 ' - Agent-the User ':' the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.103 Safari / 537.36 ' . 11} 12 is 13 is a generic package url # template 14 url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803' 15 16 for in Page Range (1,36): . 17 = NEW_URL the format (% Page URL) 18 is page_text = requests.get (URL = NEW_URL, headers = headers) .text . 19 20 # perform data analysis (address pictures) = EX 21 is '<div class = "Thumb">. *? <IMG the src = "(. *?)" Alt. *? </ Div>' 22 is the re.findall src_list = (EX, page_text, re.S) 23 is 24 # src attribute value is not found in a complete url, missing protocol header 25 for src in src_list: 26 is src = 'HTTPS:' src + 27 # of the initiation request url image separately acquired image data is returned .content binary type of response data 28 img_data = requests.get (= the src URL, headers = headers) .content 29 img_name src.split = ( '/') [-. 1] 30 31 is img_path = './qiutuLibs/' + img_name 32 Open with (img_path, 'WB') AS FP: 33 is fp.write (img_data) 34 is Print (img_name, 'download successful!')
2 . Xpath resolve
安装:pip install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
xpath resolution process:
- 1. Examples of a etree object about to be loaded and parsed page data to the source object.
- 2. By calling etree object xpath method, combined with xpath expression label positioning data extraction and
Example. 1: BOSS straight employed
1 import requests 2 from lxml import etree 3 import json 4 headers = { 5 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' 6 } 7 8 url = 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&city=101010100&industry=&position=' 9 page_text = requests.get(url=url,headers=headers).text 10 11 12= Tree etree.HTML (page_text) 13 is li_list = tree.xpath ( ' // div [@ class = "Job-List"] / UL / Li ' ) 14 job_data_list = [] 15 16 for Li in li_list: . 17 # local parsing certain content begins with ./. etree element can call and XPath 18 is job_name = li.xpath ( ' .// div [@ class = "Primary-info"] / H3 / A / div / text () ' ) [0] . 19 the salary = li.xpath ( ' .// div [@ class = "Primary-info"] / H3 / A / span / text () ' ) [0] 20 is Company li.xpath = ( './/div[@class="company-text"]/h3/a/text()')[0] 21 detail_url = 'https://www.zhipin.com'+li.xpath('.//div[@class="info-primary"]/h3/a/@href')[0] 22 23 #详情页的页面源码数据 24 detail_page_text = requests.get(url=detail_url,headers=headers).text 25 detail_tree = etree.HTML(detail_page_text) 26 job_desc = detail_tree.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()') 27 job_desc = ''.join(job_desc) 28 29 30 dic = { 31 'job_name':job_name, 32 'salary':salary, 33 'company':company, 34 'job_desc':job_desc 35 } 36 job_data_list.append(dic) 37 38 fp = open('job.json','w',encoding='utf-8') 39 json.dump(job_data_list,fp,ensure_ascii=False) 40 fp.close() 41 print('over')
Example 2 : Image data download omelette Network: http://jandan.net/ooxx. [Key: src encryption ]
1 import requests 2 from lxml import etree 3 from fake_useragent import UserAgent 4 import base64 5 import urllib.request 6 7 url = 'http://jandan.net/ooxx' 8 ua = UserAgent(verify_ssl=False,use_cache_server=False).random 9 headers = { 10 'User-Agent':ua 11 } 12 page_text = requests.get(url=url,headers=headers).text 13 is Tree = etree.HTML (page_text) 14 15 # acquired encrypted image data url 16 imgCode_list = tree.xpath ( ' // span [@ class = "the hash-IMG"] / text () ' ) . 17 18 is imgUrl_list = [] . 19 for URL in imgCode_list: 20 is img_url = ' HTTP: ' + base64.b64decode (URL) .decode () # base64.b64decode (URL) of a byte, it is necessary to turn STR 21 is imgUrl_list.append (img_url) 22 is 23 is for url in imgUrl_list: 24- filePath = url.split('/')[-1] 25 urllib.request.urlretrieve(url=url,filename=filePath) 26 print(filePath+'下载成功')
Example 3 : crawling resume templates owners creatives
. 1 Import Requests 2 Import Random . 3 from lxml Import etree . 4 headers = { . 5 ' Connection ' : ' Close ' , # when the request is successful, the times immediately disconnect request (resource request timely release of the pool) . 6 ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 72.0.3626.119 Safari / 537.36 ' . 7 } . 8 URL = ' http://sc.chinaz.com/jianli /free_%d.html' . 9 for Page in Range (l, 4): # Since the first page and other pages url not the same format, so the discussion points 10 IF Page. 1 == : . 11 NEW_URL = ' http://sc.chinaz.com/ Jianli / free.html ' 12 is the else : 13 is NEW_URL = the format (URL% Page) 14 15 Response = requests.get (URL = NEW_URL, headers = headers) 16 response.encoding = ' UTF-. 8 ' # Chinese distortion, first adjust encoding . 17 page_text = response.text 18 is 19 tree = etree.HTML(page_text) 20 div_list = tree.xpath('//div[@id="container"]/div') 21 for div in div_list: 22 detail_url = div.xpath('./a/@href')[0] 23 name = div.xpath('./a/img/@alt')[0] 24 25 detail_page = requests.get(url=detail_url,headers=headers).text 26 tree = etree.HTML(detail_page) 27 download_list = tree.xpath('div // [@ class = "clearfix MT20 downlist"] / ul / li / A / @ href ' ) # obtained in this way is that all download links for each of the 28 DOWNLOAD_URL = random.choice (download_list) # In order to prevent each link banned because too frequent request, randomly selects one of 29 Data = requests.get (URL = DOWNLOAD_URL, headers = headers) .content 30 fileName = name + ' .rar ' 31 is with Open (fileName, ' WB ' ) AS FP: 32 FP .write (Data) 33 is Print (fileName, ' download successful ' )
3 . BeautifulSoup resolve
Installation: PIP install BS4
Resolution process:
- Examples of a BeautifuSoup object and then load the source page data to the object
- call the object properties and methods for locating labels and data extraction
Example: Using bs4 the contents of the poems famous sites in each chapter of the Three Kingdoms novel of crawling and stored in the local disk
1 import requests 2 from bs4 import BeautifulSoup 3 headers = { 4 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' 5 } 6 url = 'http://www.shicimingju.com/book/sanguoyanyi.html' 7 page_text = requests.get(url=url,headers=headers).text 8 9 #数据解析:章节标题,章节内容 10 soup = BeautifulSoup(page_text,'lxml') 11 a_list = soup.select('.book-mulu > ul > li > a') 12 fp = open('./sanguo.txt','w',encoding='utf-8') 13 for a in a_list: #把a标签当soup对象使用,因为它也是源码 14 title = a.string 15 detail_url = 'http://www.shicimingju.com'+a['href'] 16 detail_page_text = requests.get(url=detail_url,headers=headers).text 17 soup = BeautifulSoup(detail_page_text,'lxml') 18 content = soup.find('div',class_="chapter_content").text #bs4中,把text提取出来的列表直接转换成字符串,与xpath不同 19 20 fp.write(title+':'+content+'\n') 21 print(title,'保存成功!') 22 fp.close()