table of Contents
02 reptiles / data analysis
1. Overview of data analysis
What is data analysis, data interpretation can do?
- Concept: that is, a set of data extracted partial data.
- Role: Use to implement focused crawler
General principles of data analysis
- Problem: Data can be stored in the display html where?
- Among the labels
- Property
- 1. label positioning
- 2. Take a text attribute or take
- Problem: Data can be stored in the display html where?
Common method of data analysis
re
bs4
xpath
pyquery
2. Regular data analysis to achieve
Demand: http: //duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/, carried the picture data site crawling
How the picture (binary) data crawling
method one:
import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' } url = 'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg' pic_data = requests.get(url=url,headers=headers).content # content返回的是二进制类型的响应数据 with open('1.jpg','wb') as fp: fp.write(pic_data)
Method Two: urllib is a low version of the requests
import urllib url = 'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg' urllib.request.urlretrieve(url=url,filename='./2.jpg')
What is the difference between two methods for crawling picture is?
- The method can be disguised UA, not Method 2
What is the difference page source capture page source code and developer tools displayed in response tool Element tab shows?
Element: source content page displays all of the data corresponding to the current page is loaded (data comprises dynamically loaded)
response: the content display data only to the current request is a request (that does not contain data dynamically loaded)
The sample code
- Implementation requirements of: crawling the data of one page
import re import os url = 'http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/' page_text = requests.get(url,headers=headers).text # 页面源码数据 # 新建一个文件夹 dirName = 'imgLibs' if not os.path.exists(dirName): os.mkdir(dirName) # 数据解析:每一张图片的地址 ex = '<article.*?<img src="(.*?)" alt=.*?</article>' img_src_list = re.findall(ex,page_text,re.S) # 爬虫中使用findall函数必须要使用re.S for src in img_src_list: imgName = src.split('/')[-1] imgPath = dirName+'/'+imgName urllib.request.urlretrieve(url=src,filename=imgPath) print(imgName,'下载成功!!!')
- Fulfilled requirements: full station data crawling: crawling all pages of picture data
# 制定一个通用的url模板,不可以被改变 url = 'http://duanziwang.com/category/搞笑图/%d/' for page in range(1,4): new_url = format(url%page) page_text = requests.get(new_url,headers=headers).text # 页面源码数据 # 新建一个文件夹 dirName = 'imgLibs' if not os.path.exists(dirName): os.mkdir(dirName) # 数据解析:每一张图片的地址 ex = '<article.*?<img src="(.*?)" alt=.*?</article>' img_src_list = re.findall(ex,page_text,re.S) # 爬虫中使用findall函数必须要使用re.S for src in img_src_list: imgName = src.split('/')[-1] imgPath = dirName+'/'+imgName urllib.request.urlretrieve(url=src,filename=imgPath) print(imgName,'下载成功!!!')
3. bs4 achieve data analysis
- Environmental installation:
- pip install bs4
- pip install lxml
- ANALYSIS PRINCIPLE
- Instantiate an object of a BeautifulSoup, the page is about to be resolved source content loaded into the object
- BeautifulSoup call methods and properties related to the object extracting data label location and herein
- Examples of object oriented approach BeautifulSoup:
- BeautifulSoup (fp, 'lxml'): the contents of the local file is loaded into the object data parsed
- BeautifulSoup (page_text, 'lxml'): the load request to the Internet data to the object data parsed
bs4 correlation analysis operations
Label location: The return value must be targeted to the tag
- soup.tagName: tagName positioned on the first label to the singular is returned.
- Properties Location: soup.find ( 'tagName', attrName = 'value'), to return to a singular
- find_all ( 'tagName', attrName = 'value') is returned by the plural (list)
- Location Selector: select ( 'selector'), is returned a list of
- Level Selector:
- Greater-than sign: a hierarchical representation
- Space: identifying a plurality of levels
- Level Selector:
Take text
- string: Tags can only be removed lineal text
- text: The entire contents of the label can be removed
Take property
- tag [ 'attrName']
The sample code
from bs4 import BeautifulSoup fp = open('./test.html','r',encoding='utf-8') soup = BeautifulSoup(fp,'lxml') soup.p soup.find('div',class_='tang') soup.find('a',id='feng') soup.find_all('div',class_='tang') soup.select('#feng') soup.select('.tang > ul > li') soup.select('.tang li') tag = soup.title tag.text li_list = soup.select('.tang > ul > li') li_list[6].text div_tag = soup.find('div',class_='tang') div_tag.text a_tag = soup.select('#feng')[0] a_tag['href']
Demand: http: //www.shicimingju.com/book/sanguoyanyi.html novels full content crawling
analysis:
- Home: url parse out the name of the section of the page + details
- Details page: the content analysis section
The sample code
# 爬取到首页的页面数据 main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(main_url,headers=headers).text fp = open('./sanguo.txt','a',encoding='utf-8') # 解析章节名称+详情页的url soup = BeautifulSoup(page_text,'lxml') a_list = soup.select('.book-mulu > ul > li > a') for a in a_list: title = a.string # 章节标题 detail_url = 'http://www.shicimingju.com'+a['href'] # 爬取详情页的页面源码内容 detail_page_text = requests.get(url=detail_url,headers=headers).text # 解析章节内容 detail_soup = BeautifulSoup(detail_page_text,'lxml') div_tag = detail_soup.find('div',class_="chapter_content") content = div_tag.text # 章节内容 fp.write(title+':'+content+'\n') print(title,'下载成功!!!') fp.close()
4. xpath resolve
- Environmental installation:
- pip install lxml
- ANALYSIS PRINCIPLE (process)
- Etree instantiate an object, the parsed data is loaded into the object
- xpath method needs to call etree different object binding of the label location xpath expressions and text data extraction
- etree object instantiation
- etree.parse ( 'filePath'): the data are loaded according to the etree
- etree.HTML (page_text): the data on the Internet is loaded into the object
- All html tags are complied with tree-like structure, for us to achieve efficient node traversal or Find (locate)
The return value must be plural xpath method (list)
Label positioning
- The leftmost /: xpath expression type must start from the root label positioning
- Non-leftmost /: Indicates a level
- @ Leftmost: positioning a label (used) from anywhere
- Non-left-most //: represent multiple levels
- // tagName: Targeting all the tagName tag
- Attribute Positioning: // tagName [@ attrName = "value"]
- Index Positioning: // tagName [index], index index is starting from 1
- Fuzzy matching:
- //div[contains(@class, "ng")]
- //div[starts-with(@class, "ta")]
Take text
- / Text (): take direct text content. List only one element
- // text (): all text content. List have multiple list elements
Take property
- /@attrName
The sample code
from lxml import etree tree = etree.parse('./test.html') tree.xpath('/html/head/meta') tree.xpath('/html//meta') tree.xpath('//meta') tree.xpath('//div') tree.xpath('//div[@class="tang"]') tree.xpath('//li[1]') tree.xpath('//a[@id="feng"]/text()')[0] tree.xpath('//div[2]//text()') tree.xpath('//a[@id="feng"]/@href')
Requirements: crawling eye teeth to resolve url room name in live, heat, details page
url = 'https://www.huya.com/g/lol' page_text = requests.get(url=url,headers=headers).text # 数据解析 tree = etree.HTML(page_text) li_list = tree.xpath('//div[@class="box-bd"]/ul/li') for li in li_list: # 实现局部解析:将局部标签下指定的内容进行解析 # 局部解析xpath表达式中的最左侧的./表示的就是xpath方法调用者对应的标签 title = li.xpath('./a[2]/text()')[0] hot = li.xpath('./span/span[2]/i[2]/text()')[0] detail_url = li.xpath('./a[1]/@href')[0] print(title,hot,detail_url)
- xpath garbled picture data processing crawling +
# url模板 url = 'http://pic.netbian.com/4kmeinv/index_%d.html' for page in range(1,11): new_url = format(url%page) # 只可以表示非第一页的页码连接 if page == 1: new_url = 'http://pic.netbian.com/4kmeinv/' page_text = requests.get(new_url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li') for li in li_list: img_name = li.xpath('./a/img/@alt')[0]+'.jpg' img_name = img_name.encode('iso-8859-1').decode('gbk') img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0] print(img_name,img_src)
- Use pipe character xpath
url = 'https://www.aqistudy.cn/historydata/' page_text = requests.get(url,headers=headers).text tree = etree.HTML(page_text) # hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()') all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()') all_cities
- xpath pipe character expression Application
- Objective: xpath expression that has more versatility
to sum up:
It returns the type of content is a binary response data
data = requests.get(url=url,headers=headers).content
Regular use of re.S
img_src_list = re.findall(ex,page_text,re.S)
If you do not use re.S parameters, only to be matched within each line, if there is no line, it replaced a row to start again, does not cross lines
And later use re.S parameters, this will be a regular expression string as a whole, the "\ n" as an ordinary character added to the string, the entire match
new_url = format(url%page)
def format(value, format_spec='', /) Return value.__format__(format_spec)
urllib request with the requested data, requests are not