The next day reptiles
review
- problem:
- ip was closed: agent
- Request parameters of the problem:
- Dynamically changing request parameter
- Encrypted request parameter
- Response of the data:
- cookie
- Request parameter
- encryption:
- Reverse js
- Important content
- Dynamic parameters
- data / prames
- Anti-climbing mechanisms:
- robots.txt
- UA detection
- Dynamic loading of data
- How to detect whether the data is dynamically loaded
- How to capture data dynamically loaded
- How to dynamically load data is generated?
- ajax
- js
- Dynamic parameters
data analysis
- Role: Implement focused crawler
- Method to realize:
- Regular
- bs4: Key
- xpath: Key
- pyquery: self-study
- What is the general principle of data analysis?
- The source data must be parsed html page
- Text labels stored
- Attribute value of the attribute tag
- principle:
- Label positioning
- Take a text attribute or take
- The source data must be parsed html page
- Reptile realization process
- Specified url
- Send request
- Fetch response data
- data analysis
- Persistent storage
In [2]:
import requests
import re
headers = {
#需要修改的请求头信息
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
Regular resolve
Single character:.: All characters except newline []: [aoe] [aw] matches any character set \ D: number [0-9] \ D: non-numeric \ W: numbers, letters, underline, Chinese \ W: non \ w \ s: packet all whitespace characters, including spaces, tabs, page breaks and the like. It is equivalent to [\ f \ n \ r \ t \ v]. \ S: the number of non-blank modification:
* : 任意多次 >=0
+ : 至少1次 >=1
? : 可有可无 0次或者1次
{m} :固定m次 hello{3,}
{m,} :至少m次
{m,n} :m-n次
边界:
$ : 以某某结尾
^ : 以某某开头
分组:
(ab)
贪婪模式: .*
非贪婪(惰性)模式: .*?
re.I : 忽略大小写
re.M :多行匹配
re.S :单行匹配
re.sub(正则表达式, 替换内容, 字符串)
In [ ]:
import re
#提取出python
key="javapythonc++php"
#####################################################################
#提取出hello world
key="<html><h1>hello world<h1></html>"
#####################################################################
#提取170
string = '我喜欢身高为170的女孩'
#####################################################################
#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
#####################################################################
#提取出hit.
key='[email protected]'#想要匹配到hit.
#####################################################################
#匹配sas和saas
key='saas and sas and saaas'
#####################################################################
In [3]:
key="javapythonc++php"
re.findall('python',key)
Out[3]:
['python']
In [4]:
#提取出hello world
key="<html><h1>hello world<h1></html>"
re.findall('<h1>(.*?)<h1>',key)
Out[4]:
['hello world']
In [6]:
#提取170
string = '我喜欢身高为170的女孩'
re.findall('\d+',string)
Out[6]:
['170']
In [11]:
#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)
Out[11]:
['http://', 'https://']
In [13]:
#提取出hit.
key='[email protected]'#想要匹配到hit.
re.findall('h.*?\.',key)
Out[13]:
['hit.']
In [14]:
#匹配sas和saas
key='saas and sas and saaas'
re.findall('sa{1,2}s',key)
Out[14]:
['saas', 'sas']
- Requirements: Use positive will http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/ corresponding piece of heading out
In [16]:
url = 'http://duanziwang.com/category/搞笑图/'
#捕获到的是字符串形式的响应数据
page_text = requests.get(url=url,headers=headers).text
#数据解析
ex = '<div class="post-head">.*?<a href="http://duanziwang.com/\d+\.html">(.*?)</a></h1>'
re.findall(ex,page_text,re.S)#爬虫使用正则做解析的话re.S必须要使用
#持久化存储
Out[16]:
['比较下老婆和老妈,一比吓一跳_段子网收录最新段子',
'搞笑夫妻:烦恼里面找快乐_段子网收录最新段子',
'夫妻界的搞笑奇葩_段子网收录最新段子',
'超囧冷人小夫妻_段子网收录最新段子',
'雷人夫妻:吃泡面、偷看日记和生日送花_段子网收录最新段子',
'脸皮薄的人难以成功_段子网收录最新段子',
'12秒的雷政富和21秒李小璐_段子网收录最新段子',
'从前有只麋鹿,它在森林里玩儿,不小心走丢了。于是它给它的好朋友长颈鹿打电话:“喂…我迷路啦。”长颈鹿听见了回答说:“喂~我长颈鹿啦~”_段子网收录最新段子',
'最萌挡车球_段子网收录最新段子',
'再高贵冷艳的喵星人! 也总有一天有被吓得屁滚尿流的时候。_段子网收录最新段子']
bs4 resolve
- Installation Environment:
- pip install bs4
- pip install lxml
- ANALYSIS PRINCIPLE
- 1. BeautifulSoup instantiate an object, it is necessary to wait for the page to be parsed data is loaded into the source object
- 2. BeautifulSoup call the object properties and methods associated with extraction of the text data and the label location
- BeautifulSoup how to instantiate
- Mode 1: BeautifulSoup (fp, 'lxml'), a html file specified data stored locally parses
- Mode 2: BeautifulSoup (page_text, 'lxml'), from crawling the Internet directly to the data analysis data
- Label positioning
- soup.tagName: Locate the first occurrence of tag tagName
- Location attribute: positioning a tag corresponding to the specified attribute
- soup.find('tagName',attrName='value')
- soup.find_all('tagName',attrName='value')
- Select positioning:
- soup.select ( 'selector')
- Level Selector:
- Greater-than sign: a hierarchical representation
- Spaces: represent multiple levels
- Take text
- tag.string: Remove text lineal
- tag.text: Remove all of the text
- Take property
- tag [ 'attrName']
In [18]:
from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
soup
In [28]:
soup.div
# soup.find('div',class_='song')
# soup.find('a',id='feng')
# soup.find_all('div',class_='song')
# soup.select('.tang')
# soup.select('#feng')
soup.select('.tang li')
In [38]:
soup.p.string
soup.p.text
soup.find('div',class_='tang').text
a_tag = soup.select('#feng')[0]
a_tag['href']
Out[38]:
'http://www.haha.com'
- The content of the novel Three Kingdoms flood crawling and persistent storage
- http://www.shicimingju.com/book/sanguoyanyi.html
In [44]:
main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=main_url,headers=headers).text
fp = open('./sanguo.txt','w',encoding='utf-8')
#数据解析
soup = BeautifulSoup(page_text,'lxml')
#解析出章节的标题&详情页的url
a_list = soup.select('.book-mulu > ul > li > a')
for a in a_list:
title = a.string
detail_url = 'http://www.shicimingju.com'+a['href']
#捕获章节内容
page_text_detail = requests.get(detail_url,headers=headers).text#获取了详情页的页面源码数据
#数据解析:章节内容
detail_soup = BeautifulSoup(page_text_detail,'lxml')
div_tag = detail_soup.find('div',class_='chapter_content')
content = div_tag.text
fp.write(title+':'+content+'\n')
print(title,'下载成功!!!')
fp.close()
- Picture of crawling data
- Based on requests
- Based urllib
- Difference: it can not be achieved UA camouflage
In [47]:
#基于requests
url = 'http://pic.netbian.com/uploads/allimg/190902/152344-1567409024af8c.jpg'
img_data = requests.get(url=url,headers=headers).content#content返回的是二进制类型的数据
with open('./123.png','wb') as fp:
fp.write(img_data)
In [49]:
#基于urllib
from urllib import request
request.urlretrieve(url=url,filename='./456.jpg')
Out[49]:
('./456.jpg', <http.client.HTTPMessage at 0x217a8af3470>)
parsing xpath
- Installation Environment:
- pip install lxml
- ANALYSIS PRINCIPLE
- Etree instantiate an object, and the loading page parsed data to the source object
- xpath method etree object xpath binding different forms of expressions and data label location extract
- Examples of objects:
- etree.parse ( 'filePath'): to load the locally stored data to a html file objects instantiated good etree
- etree.HTML(page_text)
- xpath expression
- Label positioning:
- The leftmost /: positioning the label must start from the root (almost no)
- Non-leftmost /: Indicates a level
- @ Leftmost: can be positioned at any position from the specified tag (most common)
- Non-leftmost //: represent multiple levels
- Attribute Positioning: // tagName [@ attrName = "value"]
- Index Positioning: // tagNamne [index], the index is starting from 1
- //div[contains(@class, "ng")]
- //div[starts-with(@class, "ta")]
- Take text
- / Text (): Remove text lineal
- // text (): Remove all of the text is
- Take property
- /@attrName
- Label positioning:
In [76]:
from lxml import etree
tree = etree.parse('./test.html')
tree.xpath('/html/head/meta')
tree.xpath('/html//meta')
tree.xpath('//meta')
tree.xpath('/meta')#error
tree.xpath('//p')
tree.xpath('//div[@class="tang"]')
tree.xpath('//li[1]')
tree.xpath('//a[@id="feng"]/text()')[0]
tree.xpath('//div[@class="tang"]//text()')
tree.xpath('//a[@id="feng"]/@href')
Out[76]:
['http://www.haha.com']
- Requirements: Use pictures parsing xpath address and name of the image to download and save to your local
- http://pic.netbian.com/4kmeinv/
Focus: Local resolved when the meaning indicated ./
In [86]:
import os
In [88]:
url = 'http://pic.netbian.com/4kmeinv/'
page_text = requests.get(url=url,headers=headers).text
dirName = 'imgLibs'
if not os.path.exists(dirName):
os.mkdir(dirName)
#数据解析
tree = etree.HTML(page_text)
#xpath是基于一整张页面源码进行数据解析
img_list = tree.xpath('//div[@class="slist"]/ul/li/a/img')
#局部数据解析
for img in img_list:
#./表示的是当前标签,xpath的调用者就是当前标签
img_src = 'http://pic.netbian.com'+img.xpath('./@src')[0]
img_name = img.xpath('./@alt')[0]+'.jpg'
img_name = img_name.encode('iso-8859-1').decode('gbk')
filePath = './'+dirName+'/'+img_name
request.urlretrieve(img_src,filename=filePath)
print(img_name,'下载成功!!!')
In [90]:
#全站操作
#1.指定一个通用的url模板:用来生成不同页码对应的url,模板是不可变
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
dirName = 'imgLibs'
if not os.path.exists(dirName):
os.mkdir(dirName)
for pageNum in range(1,6):
if pageNum == 1:
page_url = 'http://pic.netbian.com/4kmeinv/'
else:
page_url = format(url%pageNum)
page_text = requests.get(url=page_url,headers=headers).text
#数据解析
tree = etree.HTML(page_text)
#xpath是基于一整张页面源码进行数据解析
img_list = tree.xpath('//div[@class="slist"]/ul/li/a/img')
#局部数据解析
for img in img_list:
#./表示的是当前标签,xpath的调用者就是当前标签
img_src = 'http://pic.netbian.com'+img.xpath('./@src')[0]
img_name = img.xpath('./@alt')[0]+'.jpg'
img_name = img_name.encode('iso-8859-1').decode('gbk')
filePath = './'+dirName+'/'+img_name
request.urlretrieve(img_src,filename=filePath)
print(img_name,'下载成功!!!')
In [ ]: