Python crawler selection 09 episodes (IP proxy, requests.post parameters, crack Youdao dictionary)
1. IP proxy
1. Common agency platforms
Xishun agent, fast agent, sesame agent, whole network agent, Abuyun agent, agent wizard
2. Role and understanding
Hide your real IP to avoid being blocked
When requesting, first send the request to the proxy server, the proxy server requests the target server, then the target server transmits the data to the proxy server, and the proxy server sends the data to the crawler.
The proxy server changes frequently. When using a proxy server, pass a parameter: proxy, which is in the form of a dictionary.
3. Types of proxy IP
The normal proxy web terminal knows that someone is accessing through this proxy IP, but does not know the user's real IP
Transparent proxy Web can see the user's real IP and proxy IP
High anonymous proxy Web client can only see the proxy IP
4. Implementation method
4.1 Ordinary agency ideas
【1 】获取代理IP网站
西刺代理、快代理、全网代理、代理精灵、阿布云、芝麻代理. . . . . .
【2 】参数类型
proxies = {
'协议' : '协议://IP:端口号' }
proxies = {
'http' : 'http://IP:端口号' ,
'https' : 'https://IP:端口号' ,
}
4.2 Ordinary agency
import requests
url = 'http://httpbin.org/get'
headers = {
'User-Agent' : 'Mozilla/5.0' }
proxies = {
'http' : 'http://112.85.164.220:9999' ,
'https' : 'https://112.85.164.220:9999'
}
html = requests. get( url, proxies= proxies, headers= headers, timeout= 5 ) . text
print ( html)
4.3 Private Proxy + Exclusive Proxy
【1 】语法结构
proxies = {
'协议' : '协议://用户名:密码@IP:端口号' }
【2 】示例
proxies = {
'http' : 'http://用户名:密码@IP:端口号' ,
'https' : 'https://用户名:密码@IP:端口号' ,
}
4.4 Private Proxy + Exclusive Proxy-Sample Code
import requests
url = 'http://httpbin.org/get'
proxies = {
'http' : 'http://309435365:[email protected] :16816' ,
'https' : 'https://309435365:[email protected] :16816' ,
}
headers = {
'User-Agent' : 'Mozilla/5.0' ,
}
html = requests. get( url, proxies= proxies, headers= headers, timeout= 5 ) . text
print ( html)
4.5 Establish your own proxy IP pool-open proxy | private proxy
"""
收费代理:
建立开放代理的代理IP池
思路:
1、获取到开放代理
2、依次对每个代理IP进行测试,能用的保存到文件中
"""
import requests
class ProxyPool :
def __init__ ( self) :
self. url = '代理网站的API链接'
self. headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' }
self. f = open ( 'proxy.txt' , 'w' )
def get_html ( self) :
html = requests. get( url= self. url, headers= self. headers) . text
proxy_list = html. split( '\r\n' )
for proxy in proxy_list:
if self. check_proxy( proxy) :
self. f. write( proxy + '\n' )
def check_proxy ( self, proxy) :
"""测试1个代理IP是否可用,可用返回True,否则返回False"""
test_url = 'http://httpbin.org/get'
proxies = {
'http' : 'http://{}' . format ( proxy) ,
'https' : 'https://{}' . format ( proxy)
}
try :
res = requests. get( url= test_url, proxies= proxies, headers= self. headers, timeout= 2 )
if res. status_code == 200 :
print ( proxy, '\033[31m可用\033[0m' )
return True
else :
print ( proxy, '无效' )
return False
except :
print ( proxy, '无效' )
return False
def run ( self) :
self. get_html( )
self. f. close( )
if __name__ == '__main__' :
spider = ProxyPool( )
spider. run( )
actual effect
Two.requests.post() parameters
1. Applicable scenarios
Post type request website
View background request data through browser F12 or Fiddle capture
2. Parameter type
response = requests.post(url,data=data,headers=headers)
data: post data (Form data-dictionary format)
3.Features of post request method
GET request: The parameters are displayed in the URL address
POST request: Form form to submit data
4. The difference between get and post
Three. Youdao translation and cracking
1. Goal
破解有道翻译接口,抓取翻译结果
请输入要翻译的词语: code
翻译结果: 代码
**************************
请输入要翻译的词语: 警告
翻译结果: warm
2. Implementation steps
1、浏览器F12开启网络抓包,Network-All,页面翻译单词后找Form表单数据
2、在页面中多翻译几个单词,观察Form表单数据变化(有数据是加密字符串)
3、刷新有道翻译页面,抓取并分析JS代码(本地JS加密)
4、找到JS加密算法,用Python按同样方式加密生成加密数据
5、将Form表单数据处理为字典,通过requests.post( ) 的data参数发送
3. Concrete realization
1. Open F12 to capture the packet and find the form data as follows:
i: 喵喵叫
from : AUTO
to: AUTO
smartresult: dict
client: fanyideskweb
salt: 15614112641250
sign: 94008208919faa19bd531acde36aac5d
ts: 1561411264125
bv: f4d62a2579ebb44874d7ef93ba47e822
doctype: json
version: 2.1
keyfrom: fanyi. web
action: FY_BY_REALTlME
2. Translate a few more words in the page and observe the changes in the data of the Form
salt: 15614112641250
sign: 94008208919faa19bd531acde36aac5d
ts: 1561411264125
bv: f4d62a2579ebb44874d7ef93ba47e822
3. Generally, the local js file is encrypted, refresh the page, find the js file and analyze the JS code
【方法1 】 : Network - JS选项 - 搜索关键词salt
【方法2 】 : 控制台右上角 - Search - 搜索salt - 查看文件 - 格式化输出
【结果】 : 最终找到相关JS文件 : fanyi. min . js
4. Open the JS file, analyze the encryption algorithm, and implement it in Python
【ts】经过分析为13 位的时间戳,字符串类型
js代码实现) "" + ( new Date) . getTime( )
python实现) str ( int ( time. time( ) * 1000 ) )
【salt】
js代码实现) ts + parseInt( 10 * Math. random( ) , 10 ) ;
python实现) ts + str ( random. randint( 0 , 9 ) )
【sign】('设置断点调试,来查看 e 的值,发现 e 为要翻译的单词' )
js代码实现) n. md5( "fanyideskweb" + e + salt + "n%A-rKaT5fb[Gy?;N5@Tj" )
python实现)
from hashlib import md5
string = "fanyideskweb" + e + salt + "n%A-rKaT5fb[Gy?;N5@Tj"
s = md5( )
s. update( string. encode( ) )
sign = s. hexdigest( )
5. Regular processing headers and formdata in pycharm
【1 】pycharm进入方法 :Ctrl + r ,选中 Regex
【2 】处理headers和formdata
( . * ) : ( . * )
"$1" : "$2" ,
【3 】点击 Replace All
import requests
import time
import random
from hashlib import md5
class YdSpider ( object ) :
def __init__ ( self) :
self. url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
self. headers = {
"Cookie" : "[email protected] ; OUTFOX_SEARCH_USER_ID_NCOO=570559528.1224236; _ntes_nnid=96bc13a2f5ce64962adfd6a278467214,1551873108952; JSESSIONID=aaae9i7plXPlKaJH_gkYw; td_cookie=18446744072941336803; SESSION_FROM_COOKIE=unknown; ___rl__test__cookies=1565689460872" ,
"Referer" : "http://fanyi.youdao.com/" ,
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36" ,
}
def get_salt_sign_ts ( self, word) :
ts = str ( int ( time. time( ) * 1000 ) )
salt = ts + str ( random. randint( 0 , 9 ) )
string = "fanyideskweb" + word + salt + "n%A-rKaT5fb[Gy?;N5@Tj"
s = md5( )
s. update( string. encode( ) )
sign = s. hexdigest( )
return salt, sign, ts
def attack_yd ( self, word) :
salt, sign, ts = self. get_salt_sign_ts( word)
data = {
"i" : word,
"from" : "AUTO" ,
"to" : "AUTO" ,
"smartresult" : "dict" ,
"client" : "fanyideskweb" ,
"salt" : salt,
"sign" : sign,
"ts" : ts,
"bv" : "7e3150ecbdf9de52dc355751b074cf60" ,
"doctype" : "json" ,
"version" : "2.1" ,
"keyfrom" : "fanyi.web" ,
"action" : "FY_BY_REALTlME" ,
}
html = requests. post(
url= self. url,
data= data,
headers= self. headers
) . json( )
result = html[ 'translateResult' ] [ 0 ] [ 0 ] [ 'tgt' ]
print ( result)
def run ( self) :
word = input ( '请输入要翻译的单词:' )
self. attack_yd( word)
if __name__ == '__main__' :
spider = YdSpider( )
spider. run( )