Python Reptile Tour _ONE

Preface:

Recently learned some basic python syntax, but also follow Liao Xuefeng teachers use asynchronous IO framework to build a simple website (although some still do not understand), but slowly, python is also a fun place to write reptiles, while with ample time now to learn about you, recommend a small circle ape courses are particularly good teacher

0x00: Learn reptiles

In the classification of reptiles usage scenario
- 通用爬虫:
grab an important part of the system. Crawl is a whole page of data.
- 聚焦爬虫:
it is based on common reptiles, grab a page-specific local content.
- 增量式爬虫:
updates the data of the monitoring sites, and the wisdom to crawl out of the site the latest updates data

反爬机制: Portal, by means of waving can develop appropriate strategies to prevent data reptile reptiles awakened website.
反反爬策略: Crawlers can also develop relevant strategies or techniques, to break anti-climb mechanism portals have, to acquire data in the portal.

robots.txtprotocol:

Gentleman's agreement. Web site provides data that can be reptiles that data can not be crawled

0x01: http & https protocol

http protocol:

Server and client as a form of data exchange

Common request headers:

User-Agent:请求载体的身份标识
Connection:请求完毕后,是断开连接还是保持连接

Common header information in response to:

Content-Type: 服务器响应回客户端的数据类型

https protocol

Secure hypertext transfer protocol, involving data encryption

Encryption:

  1. Symmetric encryption keys
  2. Asymmetric encryption secret key
  3. Certificate encryption keys

Symmetric encryption keys

The client encrypts the first data, and the key and the ciphertext is sent together to the server, the server using a key to unlock the ciphertext.
Disadvantages: easy to intercept unsafe

Asymmetric encryption secret key

Displays: RSA cryptosystem this

Certificate encryption keys

Here Insert Picture Description
Here Insert Picture Description

0x02: request module

requestModule is a python module natively based network requests, very powerful, and very efficient, the role is to simulate a browser initiates a request, since it is analog browser to initiate a request, it would need requestto do the same module and browser jobs

requestEncoding process:

  1. Specified url
  2. Initiate a request
  3. Fetch response data
  4. Persistent storage

Exercise 1: crawling Sogou home page data (basic flow)

Crawling Code:

# 爬取搜狗首页的页面数据
import requests
if __name__ == "__main__":
	#setp1 指定url
	url = 'https://www.sogou.com/'
	#setp2 发起请求
	#get方法返回一个响应对象
	response = requests.get(url=url)
	#setp3 获取响应数据
	#text返回的是字符串形式的响应数据
	page_text = response.text
	print(page_text)
	#setp4 持久化存储
	with open('D:/爬虫/sogou.html','w',encoding='utf-8') as f:
		f.write(page_text)
	print('爬取完成')

response:
Here Insert Picture Description

Exercise Two: web collector (GET)

Before writing, you need to understand some of the mechanisms of anti-climb:

UA: User-Agent (identity of the requesting vehicle identification)

UA检测: The server detects the corresponding portal request of the carrier identity, if it detects the identity of the carrier's request for a sign with a browser, the request is a normal request, however, if a carrier is detected identity identity of the request is not based on paragraph browser, then the request is not a normal request (reptiles), the server is likely to reject the request times.

UA伪装: The User-Agent package corresponding to a dictionary, the code is added.

Here crawling Sogou browser
Here Insert Picture Description
such as when searching for Sword Art, followed by a bunch of parameters, but there is a parameter queryparticularly evident, this argument is that we want to query, leaving only the parameters found or can query this url on better specify
Here Insert Picture Description
addition, we also need to add the UA camouflage, prevent denial
Here Insert Picture Description
Here's to write code that crawl:

import requests

if __name__ == "__main__":
	#UA伪装,封装到一个字典中
	header = {
		'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
	} 
	#指定url
	url = "https://www.sogou.com/sogou"
	#自定义参数
	kw = input('please input param')
	#将参数封装到字典中
	param = {
		'query': kw
	}
	#发起请求,第二个参数将自定义的参数传进去,第三个参数传入header
	reponse = requests.get(url=url,params=param,headers=header)
	#获取响应信息
	page_text = reponse.text
	#保存信息
	fileName = kw+'.html'
	with open(fileName,'w',encoding='utf-8') as f:
		f.write(page_text)
	print("爬取成功")

Crawling success
Here Insert Picture Description
if you want to crawl other content only need to change the parameters

Exercise Three: crack Baidu translation (Post)

Before writing should first lookAjax

AJAXIt is a technology for creating dynamic web pages quickly.
By exchanging small amounts of data with the server behind the scenes, AJAX can make asynchronous page updates. This means that, for certain parts of the page to be updated without reloading the entire page

Here Insert Picture Description
Open Baidu translation, found partial page refresh, used here Ajax, the request succeeds after a refresh will be local, so as long as the capture of the corresponding Ajaxrequest, the results can be stored in the corresponding translation of jsona text file which

Why to be stored in jsona text file which we capture what the server returns the information you can know
Here Insert Picture Description
next went to capture Ajaxthe request, such as input dog, to capture the corresponding parameters, then write the code crawling
Here Insert Picture Description

import requests
import json

#指定post的url
post_url = 'https://fanyi.baidu.com/sug'
#UA伪装
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#自定义post参数
word = input('please input')
data = {
    'kw': word
}
#发起请求
post_response = requests.post(url=post_url,data=data,headers=headers)

#获取响应
#注:获取响应数据json()方法返回的是obj(如果确认响应数据是json,才可以使用json())
dict_text = post_response.json()
# print(dict_text)
#保存数据
fileName = word+'.json'
fp = open(fileName,'w',encoding = 'utf-8')
#注:因为返回来的json串是中文的,所以ensure_ascii为false,不能使用ascii进行编码
json.dump(dict_text,fp=fp,ensure_ascii=False)
print('Finish')

Crawling success
Here Insert Picture Description

Exercise Four: crawling IMDb

Just order a type of film, when viewing the following to find that every time a page is 20 movies, a little later on you can continue to see more movies are not displayed down the page there is no change, only partial changes, so you can capture what Ajaxrequest
Here Insert Picture Description
Here Insert Picture Description
Ajaxwill be carrying these parameters request, since we know that carrying what parameters, you can write a script crawl

# 爬取豆瓣电影
import requests
import json
if __name__ =="__main__":
    url="https://movie.douban.com/j/chart/top_list"
    #参数
    params = {
        "type":"24",
        "interval_id": "100:90",
        "action":"",
        "start": "0",
        "limit": "20",
    }
   	#UA伪装
    headers={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
    }
    #请求
    r=requests.get(url=url,params=params,headers=headers)
    #反应
    data=r.json()
    #存储
    fp = open("电影.json","w",encoding="utf-8")
    #indent: 缩进(例如indent=4,缩进4格);
    json.dump(data,fp=fp,ensure_ascii=False,indent=4)
    print("爬取成功")

Here Insert Picture Description

Job: crawling KFC restaurant location

Here Insert Picture Description
When a query restaurant, only a partial page has changed, it can be determined Ajax, so to capture what the request
is by POSTway of a request to
Here Insert Picture Description
note here is no longer the jsontype, but rather text,
Here Insert Picture Description
write reptiles Code:

import requests

if __name__ == "__main__":
    url='http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
key = input("Please enter the query address\n")
page = input("Please enter the query page number\n")
data = {
    'cname':'',
    'pid': '',
    'keyword':key,
    'pageIndex':page,
    'pageSize': '10',
}

reponse = requests.post(url=url,data=data,headers=headers)
data = reponse.text
print(data)
fileName = key+'.txt'
with open(fileName,'w',encoding='utf-8') as fp:
	fp.write(data)
print("爬取成功")

Crawling success
Here Insert Picture Description

to sum up:

This will first learn here, learn this great harvest, the next record will be more fun reptile trip! !

Published 71 original articles · won praise 80 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43431158/article/details/104277393