Preface:
Recently learned some basic python syntax, but also follow Liao Xuefeng teachers use asynchronous IO framework to build a simple website (although some still do not understand), but slowly, python is also a fun place to write reptiles, while with ample time now to learn about you, recommend a small circle ape courses are particularly good teacher
0x00: Learn reptiles
In the classification of reptiles usage scenario
-通用爬虫
:
grab an important part of the system. Crawl is a whole page of data.
-聚焦爬虫
:
it is based on common reptiles, grab a page-specific local content.
-增量式爬虫
:
updates the data of the monitoring sites, and the wisdom to crawl out of the site the latest updates data
反爬机制
: Portal, by means of waving can develop appropriate strategies to prevent data reptile reptiles awakened website.
反反爬策略
: Crawlers can also develop relevant strategies or techniques, to break anti-climb mechanism portals have, to acquire data in the portal.
robots.txt
protocol:Gentleman's agreement. Web site provides data that can be reptiles that data can not be crawled
0x01: http & https protocol
http protocol:
Server and client as a form of data exchange
Common request headers:
User-Agent:请求载体的身份标识
Connection:请求完毕后,是断开连接还是保持连接
Common header information in response to:
Content-Type: 服务器响应回客户端的数据类型
https protocol
Secure hypertext transfer protocol, involving data encryption
Encryption:
- Symmetric encryption keys
- Asymmetric encryption secret key
- Certificate encryption keys
Symmetric encryption keys
The client encrypts the first data, and the key and the ciphertext is sent together to the server, the server using a key to unlock the ciphertext.
Disadvantages: easy to intercept unsafe
Asymmetric encryption secret key
Displays: RSA cryptosystem this
Certificate encryption keys
0x02: request module
request
Module is a python module natively based network requests, very powerful, and very efficient, the role is to simulate a browser initiates a request, since it is analog browser to initiate a request, it would need request
to do the same module and browser jobs
request
Encoding process:
- Specified url
- Initiate a request
- Fetch response data
- Persistent storage
Exercise 1: crawling Sogou home page data (basic flow)
Crawling Code:
# 爬取搜狗首页的页面数据
import requests
if __name__ == "__main__":
#setp1 指定url
url = 'https://www.sogou.com/'
#setp2 发起请求
#get方法返回一个响应对象
response = requests.get(url=url)
#setp3 获取响应数据
#text返回的是字符串形式的响应数据
page_text = response.text
print(page_text)
#setp4 持久化存储
with open('D:/爬虫/sogou.html','w',encoding='utf-8') as f:
f.write(page_text)
print('爬取完成')
response:
Exercise Two: web collector (GET)
Before writing, you need to understand some of the mechanisms of anti-climb:
UA
: User-Agent (identity of the requesting vehicle identification)
UA检测
: The server detects the corresponding portal request of the carrier identity, if it detects the identity of the carrier's request for a sign with a browser, the request is a normal request, however, if a carrier is detected identity identity of the request is not based on paragraph browser, then the request is not a normal request (reptiles), the server is likely to reject the request times.
UA伪装
: The User-Agent package corresponding to a dictionary, the code is added.
Here crawling Sogou browser
such as when searching for Sword Art, followed by a bunch of parameters, but there is a parameter query
particularly evident, this argument is that we want to query, leaving only the parameters found or can query this url on better specify
addition, we also need to add the UA camouflage, prevent denial
Here's to write code that crawl:
import requests
if __name__ == "__main__":
#UA伪装,封装到一个字典中
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#指定url
url = "https://www.sogou.com/sogou"
#自定义参数
kw = input('please input param')
#将参数封装到字典中
param = {
'query': kw
}
#发起请求,第二个参数将自定义的参数传进去,第三个参数传入header
reponse = requests.get(url=url,params=param,headers=header)
#获取响应信息
page_text = reponse.text
#保存信息
fileName = kw+'.html'
with open(fileName,'w',encoding='utf-8') as f:
f.write(page_text)
print("爬取成功")
Crawling success
if you want to crawl other content only need to change the parameters
Exercise Three: crack Baidu translation (Post)
Before writing should first lookAjax
AJAX
It is a technology for creating dynamic web pages quickly.
By exchanging small amounts of data with the server behind the scenes, AJAX can make asynchronous page updates. This means that, for certain parts of the page to be updated without reloading the entire page
Open Baidu translation, found partial page refresh, used here Ajax
, the request succeeds after a refresh will be local, so as long as the capture of the corresponding Ajax
request, the results can be stored in the corresponding translation of json
a text file which
Why to be stored in json
a text file which we capture what the server returns the information you can know
next went to capture Ajax
the request, such as input dog
, to capture the corresponding parameters, then write the code crawling
import requests
import json
#指定post的url
post_url = 'https://fanyi.baidu.com/sug'
#UA伪装
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#自定义post参数
word = input('please input')
data = {
'kw': word
}
#发起请求
post_response = requests.post(url=post_url,data=data,headers=headers)
#获取响应
#注:获取响应数据json()方法返回的是obj(如果确认响应数据是json,才可以使用json())
dict_text = post_response.json()
# print(dict_text)
#保存数据
fileName = word+'.json'
fp = open(fileName,'w',encoding = 'utf-8')
#注:因为返回来的json串是中文的,所以ensure_ascii为false,不能使用ascii进行编码
json.dump(dict_text,fp=fp,ensure_ascii=False)
print('Finish')
Crawling success
Exercise Four: crawling IMDb
Just order a type of film, when viewing the following to find that every time a page is 20 movies, a little later on you can continue to see more movies are not displayed down the page there is no change, only partial changes, so you can capture what Ajax
request
Ajax
will be carrying these parameters request, since we know that carrying what parameters, you can write a script crawl
# 爬取豆瓣电影
import requests
import json
if __name__ =="__main__":
url="https://movie.douban.com/j/chart/top_list"
#参数
params = {
"type":"24",
"interval_id": "100:90",
"action":"",
"start": "0",
"limit": "20",
}
#UA伪装
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
}
#请求
r=requests.get(url=url,params=params,headers=headers)
#反应
data=r.json()
#存储
fp = open("电影.json","w",encoding="utf-8")
#indent: 缩进(例如indent=4,缩进4格);
json.dump(data,fp=fp,ensure_ascii=False,indent=4)
print("爬取成功")
Job: crawling KFC restaurant location
When a query restaurant, only a partial page has changed, it can be determined Ajax
, so to capture what the request
is by POST
way of a request to
note here is no longer the json
type, but rather text
,
write reptiles Code:
import requests
if __name__ == "__main__":
url='http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
key = input("Please enter the query address\n")
page = input("Please enter the query page number\n")
data = {
'cname':'',
'pid': '',
'keyword':key,
'pageIndex':page,
'pageSize': '10',
}
reponse = requests.post(url=url,data=data,headers=headers)
data = reponse.text
print(data)
fileName = key+'.txt'
with open(fileName,'w',encoding='utf-8') as fp:
fp.write(data)
print("爬取成功")
Crawling success
to sum up:
This will first learn here, learn this great harvest, the next record will be more fun reptile trip! !