Reptile - The second - data analysis

data analysis

  • Analysis: extracting data according to the specified rules
  • Role: Implement focused crawler
  • Crawler coding process focused:
    • Specify the url
    • Initiate a request
    • Fetch response data
    • data analysis
    • Persistent storage
  • Data analysis way:
    • Regular
    • bs4
    • xpath
    • pyquery (expand)
  • What is the general principle of data analysis?
    • In the data analysis requires the action of the source page (consisting of a set of html tags)
    • What is the central role of html?
      • Display data
    • How html data to show it?
      • html data to be displayed must be placed in the html tag, or in the properties.
    • Spoken principle:
      • 1. label positioning
      • 2. Take text or take properties

How crawling pictures

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}

method one:

url = 'https://pic.qiushibaike.com/system/pictures/12217/122176396/medium/OM37E794HBL3OFFF.jpg'
img_data = requests.get(url=url,headers=headers).content #content返回的是byte类型的数据
with open('./123.jpg','wb') as fp:
    fp.write(img_data)

Method Two:

from urllib import request
url = 'https://pic.qiushibaike.com/system/pictures/12217/122176396/medium/OM37E794HBL3OFFF.jpg'
request.urlretrieve(url,'./456.jpg')

The method can not use two mechanisms UA camouflage, urllib is an older network requests module, the module is not prior requests, requests are transmitted using the operation urllib

Regular data analysis to achieve

Demand: crawling data embarrassments Encyclopedia of FIG Qiu

import re
import os

dir_name = './qiutuLibs'
if not os.path.exists(dir_name):
    os.mkdir(dir_name)

url = 'https://www.qiushibaike.com/pic/'
page_text = requests.get(url,headers=headers).text
#数据解析:图片地址
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_src_list = re.findall(ex,page_text,re.S)
for src in img_src_list:
    src = 'https:'+src
    img_name = src.split('/')[-1]
    img_path = dir_name+'/'+img_name
    #对图片地址单独发起请求获取图片数据
    request.urlretrieve(src,img_path)
    print(img_name,'下载成功!!!')

bs4 resolve

  • Environmental installation:
    • pip install bs4
    • pip install lxml
  • The analytical principle bs4
    • Instantiate an object of BeautifulSoup, and loads the page is about to be parsed data to the source object
    • Properties and methods call BeautifulSoup object label positioning data extraction and
  • BeautifulSoup how to instantiate objects?
    • BeautifulSoup (fp, 'lxml'): dedicated as data to resolve local html files stored in
    • BeautifulSoup (page_text, 'lxml'): to be dedicated as a request to the data source on the Internet page parse

Label positioning

  • soup.tagName: TagName positioned on the first tab, return to a singular
  • Properties Location: soup.find ( 'tagName', attrName = 'value'), and the singular Returns
    • find_all: find and use the same, but the return value is a list
  • Location Selector: select ( 'selector'), the return value is a list of
    • Tag, category, id, level (>: a hierarchy space: a plurality of levels)

Extract data

  • Take text:
    • tag.string: immediate tag text content
    • tag.text: label all text
  • Take property:
    • tag [ 'attrName']
from bs4 import BeautifulSoup

fp = open('html/test.html', 'r', encoding='utf-8')
soup = BeautifulSoup(fp, 'lxml')  # 将即将被解析的页面源码加载到该对象中
print(soup.p)
soup.find('div', class_='song')
soup.find_all('div', class_='song')
soup.select('.tang')
soup.select('#feng')
soup.select('.tang > ul > li')
soup.select('.tang li')
li_6 = soup.select('.tang > ul > li')[6]
i_tag = li_6.i
i_tag.string
soup.find('div', class_='tang').text
soup.find('a', id="feng")['href']

Requirements: crawling the entire content of the novel Three Kingdoms http://www.shicimingju.com/book/sanguoyanyi.html

  • Section Name
  • Chapters
#在首页中解析章节名称&每一个章节详情页的url
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url,headers=headers).text
soup = BeautifulSoup(page_text,'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('sanguo.txt','w',encoding='utf-8')
for a in a_list:
    detail_url = 'http://www.shicimingju.com'+a['href']
    chap_title = a.string
    #对章节详情页的url发起请求,解析详情页中的章节内容
    detail_page_text = requests.get(detail_url,headers=headers).text
    soup = BeautifulSoup(detail_page_text,'lxml')
    chap_content = soup.find('div',class_="chapter_content").text
    fp.write(chap_title+':'+chap_content+'\n')
    print(chap_title,'爬取成功!')
fp.close()

parsing xpath

  • Installation environment: pip install lxml
  • Analytical principle of xpath
    • Etree instantiate an object type, and the source data into the page loaded in the object
    • xpath Call the object's method requires binding different forms of expression xpath label positioning data extraction and
  • etree object instantiation
    • etree.parse(fileNane)
    • etree.HTML(page_text)
  • xpath method returns a list is always

Label positioning

  • Xpath expression meaning in the leftmost / presentation is that the current positioning of the label must start from the root to locate
  • xpath expression // represents the leftmost label positioned from anywhere
  • xpath expression leftmost // Central Africa represents the multiple levels of meaning
  • xpath expression leftmost Africa / is a hierarchical representation of the meaning of
  • Location attribute: // tagName [@ arrtName = 'value']
  • Index Location: // tagName / li [3]

Extract data

  • Take text:
    • / Text (): get text lineal
    • // text (): Take all of the text
  • Take property:
    • attrName tag / @
from lxml import etree
tree = etree.parse('./test.html')
tree.xpath('/html/head/meta')[0] #绝对路径
tree.xpath('//meta')[0] #相对路径,将整个页面源码中所有的meta进行定位
tree.xpath('/html//meta')[0] 
#属性定位
tree.xpath('//div[@class="song"]')
#索引定位
tree.xpath('//div[@class="tang"]/ul/li[3]') #该索引是从1开始
tree.xpath('//div[@class="tang"]//li[3]') #该索引是从1开始
#取文本
tree.xpath('//p[1]/text()')
tree.xpath('//div[@class="song"]//text()')

#取属性
tree.xpath('//a[@id="feng"]/@href')

Requirements: crawling boss jobs

  • Position Title
  • company name
  • Payroll
  • Job description
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
    'cookie':'lastCity=101010100; __c=1566877560; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1566877561; _uab_collina=156687756118178796315757; __l=l=%2Fwww.zhipin.com%2F&r=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DidbSvNzz2fLSl1WXiEmtINauVHUZYSNqejHny725pc5RTwaHqh5uDx1LewpyGmaT%26wd%3D%26eqid%3Dbadf667700040677000000025d64a772&friend_source=0&friend_source=0; __zp_stoken__=91d9QItKEtUk5dMMnDG7lwzq8mBW1g%2FkEsFOHXIi%2FwMd%2FPRRXc%2FPMKjsDYwsfC4b7vAT3FVnTmYBjGp8gW1OeZ5TdA%3D%3D; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1566879753; __a=69160831.1566877560..1566877560.16.1.16.16'
}
url = 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&city=101010100&industry=&position='
page_text = requests.get(url,headers=headers).text
#数据解析
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="job-list"]/ul/li')
for li in li_list:
#     需要将li表示的局部页面源码数据中的相关数据进行提取
#     如果xpath表达式被作用在了循环中,表达式要以./或者.//开头
    detail_url = 'https://www.zhipin.com'+li.xpath('.//div[@class="info-primary"]/h3/a/@href')[0]
    job_title = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()')[0]
    salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()')[0]
    company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
    #对详情页的url发请求解析出岗位职责
    detail_page_text = requests.get(detail_url,headers=headers).text
    tree = etree.HTML(detail_page_text)
    job_desc = tree.xpath('//div[@class="text"]//text()')
    job_desc = ''.join(job_desc)
    
    print(job_title,salary,company,job_desc)
  • Another use of more common form xpath expression: Application of pipeline breaks in xpath expression for the page layout data analysis irregular
  • Encyclopedia of embarrassments crawling scripts content and author name
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}
url = 'https://www.qiushibaike.com/text/page/4/'
page_text = requests.get(url,headers=headers).text

tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@id="content-left"]/div')
for div in div_list:
    author = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()')[0]
    content = div.xpath('.//div[@class="content"]/span//text()')
    content = ''.join(content)
    print(author,content)

Deal with the problem of Chinese garbled

  • Crawling http://pic.netbian.com/4kmeishi/ pictures and picture names
#指定一个通用的url模板
url = 'http://pic.netbian.com/4kmeishi/index_%d.html'

for page in range(1,3):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeishi/'
    else:
        new_url = format(url%page)
    response =  requests.get(new_url,headers=headers)
    #response.encoding = 'utf-8'
    page_text = response.text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/b/text()')[0]
        img_name = img_name.encode('iso-8859-1').decode('gbk') 

Guess you like

Origin www.cnblogs.com/yinhaiping/p/11420902.html