Python crawler selection 09 episodes (IP proxy, requests.post parameters, crack Youdao dictionary)

1. IP proxy

1. Common agency platforms

  • Xishun agent, fast agent, sesame agent, whole network agent, Abuyun agent, agent wizard

2. Role and understanding

  • Hide your real IP to avoid being blocked
  • When requesting, first send the request to the proxy server, the proxy server requests the target server, then the target server transmits the data to the proxy server, and the proxy server sends the data to the crawler.
  • The proxy server changes frequently. When using a proxy server, pass a parameter: proxy, which is in the form of a dictionary.

3. Types of proxy IP

  • The normal proxy
    web terminal knows that someone is accessing through this proxy IP, but does not know the user's real IP
  • Transparent proxy
    Web can see the user's real IP and proxy IP
  • High anonymous proxy
    Web client can only see the proxy IP

4. Implementation method

4.1 Ordinary agency ideas

1】获取代理IP网站
   西刺代理、快代理、全网代理、代理精灵、阿布云、芝麻代理... ...2】参数类型
   proxies = {
    
     '协议':'协议://IP:端口号' }
   proxies = {
    
    
    	'http':'http://IP:端口号',
    	'https':'https://IP:端口号',
   }

4.2 Ordinary agency

# 使用免费普通代理IP访问测试网站: http://httpbin.org/get
import requests

url = 'http://httpbin.org/get'
headers = {
    
    'User-Agent':'Mozilla/5.0'}
# 定义代理,在代理IP网站中查找免费代理IP
proxies = {
    
    
    'http':'http://112.85.164.220:9999',
    'https':'https://112.85.164.220:9999'
}
html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

4.3 Private Proxy + Exclusive Proxy

1】语法结构
   proxies = {
    
     '协议':'协议://用户名:密码@IP:端口号' }2】示例
   proxies = {
    
    
	  'http':'http://用户名:密码@IP:端口号',
      'https':'https://用户名:密码@IP:端口号',
   }

4.4 Private Proxy + Exclusive Proxy-Sample Code

import requests
url = 'http://httpbin.org/get'
proxies = {
    
    
    'http': 'http://309435365:[email protected]:16816',
    'https':'https://309435365:[email protected]:16816',
}
headers = {
    
    
    'User-Agent' : 'Mozilla/5.0',
}

html = requests.get(url,proxies=proxies,headers=headers,timeout=5).text
print(html)

4.5 Establish your own proxy IP pool-open proxy | private proxy

"""
收费代理:
    建立开放代理的代理IP池
思路:
    1、获取到开放代理
    2、依次对每个代理IP进行测试,能用的保存到文件中
"""
import requests

class ProxyPool:
    def __init__(self):
        self.url = '代理网站的API链接'
        self.headers = {
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}
        # 打开文件,用来存放可用的代理IP
        self.f = open('proxy.txt', 'w')

    def get_html(self):
        html = requests.get(url=self.url, headers=self.headers).text
        proxy_list = html.split('\r\n')
        for proxy in proxy_list:
            # 依次测试每个代理IP是否可用
            if self.check_proxy(proxy):
                self.f.write(proxy + '\n')

    def check_proxy(self, proxy):
        """测试1个代理IP是否可用,可用返回True,否则返回False"""
        test_url = 'http://httpbin.org/get'
        proxies = {
    
    
            'http' : 'http://{}'.format(proxy),
            'https': 'https://{}'.format(proxy)
        }
        try:
            res = requests.get(url=test_url, proxies=proxies, headers=self.headers, timeout=2)
            if res.status_code == 200:
                print(proxy,'\033[31m可用\033[0m')
                return True
            else:
                print(proxy,'无效')
                return False
        except:
            print(proxy,'无效')
            return False

    def run(self):
        self.get_html()
        # 关闭文件
        self.f.close()

if __name__ == '__main__':
    spider = ProxyPool()
    spider.run()
  • actual effect
    Insert picture description here

Two.requests.post() parameters

1. Applicable scenarios

  • Post type request website
  • View background request data through browser F12 or Fiddle capture

2. Parameter type

  • response = requests.post(url,data=data,headers=headers)
  • data: post data (Form data-dictionary format)

3.Features of post request method

  • GET request: The parameters are displayed in the URL address
  • POST request: Form form to submit data

4. The difference between get and post

Insert picture description here

Three. Youdao translation and cracking

1. Goal

破解有道翻译接口,抓取翻译结果
# 结果展示
请输入要翻译的词语: code
翻译结果: 代码
**************************
请输入要翻译的词语: 警告
翻译结果: warm

2. Implementation steps

1、浏览器F12开启网络抓包,Network-All,页面翻译单词后找Form表单数据
2、在页面中多翻译几个单词,观察Form表单数据变化(有数据是加密字符串)
3、刷新有道翻译页面,抓取并分析JS代码(本地JS加密)
4、找到JS加密算法,用Python按同样方式加密生成加密数据
5、将Form表单数据处理为字典,通过requests.post()的data参数发送

3. Concrete realization

  • 1. Open F12 to capture the packet and find the form data as follows:
i: 喵喵叫
from: AUTO
to: AUTO
smartresult: dict
client: fanyideskweb
salt: 15614112641250
sign: 94008208919faa19bd531acde36aac5d
ts: 1561411264125
bv: f4d62a2579ebb44874d7ef93ba47e822
doctype: json
version: 2.1
keyfrom: fanyi.web
action: FY_BY_REALTlME
  • 2. Translate a few more words in the page and observe the changes in the data of the Form
salt: 15614112641250
sign: 94008208919faa19bd531acde36aac5d
ts: 1561411264125
bv: f4d62a2579ebb44874d7ef93ba47e822
# 但是bv的值不变
  • 3. Generally, the local js file is encrypted, refresh the page, find the js file and analyze the JS code
【方法1: Network - JS选项 - 搜索关键词salt
【方法2: 控制台右上角 - Search - 搜索salt - 查看文件 - 格式化输出

【结果】 : 最终找到相关JS文件 : fanyi.min.js
  • 4. Open the JS file, analyze the encryption algorithm, and implement it in Python
【ts】经过分析为13位的时间戳,字符串类型
   js代码实现)  "" + (new Date).getTime()
   python实现) str(int(time.time()*1000))

【salt】
   js代码实现)  ts + parseInt(10 * Math.random(), 10);
   python实现)  ts + str(random.randint(0,9))

【sign】('设置断点调试,来查看 e 的值,发现 e 为要翻译的单词')
   js代码实现) n.md5("fanyideskweb" + e + salt + "n%A-rKaT5fb[Gy?;N5@Tj")
   python实现)
   from hashlib import md5
   string = "fanyideskweb" + e + salt + "n%A-rKaT5fb[Gy?;N5@Tj"
   s = md5()
   s.update(string.encode())
   sign = s.hexdigest()
  • 5. Regular processing headers and formdata in pycharm
1】pycharm进入方法 :Ctrl + r ,选中 Regex
【2】处理headers和formdata
    (.*): (.*)
    "$1": "$2",3】点击 Replace All
  • 6. Code implementation
import requests
import time
import random
from hashlib import md5

class YdSpider(object):
  def __init__(self):
    # url一定为F12抓到的 headers -> General -> Request URL
    self.url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
    self.headers = {
    
    
      # 检查频率最高 - 3个
      "Cookie": "[email protected]; OUTFOX_SEARCH_USER_ID_NCOO=570559528.1224236; _ntes_nnid=96bc13a2f5ce64962adfd6a278467214,1551873108952; JSESSIONID=aaae9i7plXPlKaJH_gkYw; td_cookie=18446744072941336803; SESSION_FROM_COOKIE=unknown; ___rl__test__cookies=1565689460872",
      "Referer": "http://fanyi.youdao.com/",
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
    }

  # 获取salt,sign,ts
  def get_salt_sign_ts(self,word):
    # ts
    ts = str(int(time.time()*1000))
    # salt
    salt = ts + str(random.randint(0,9))
    # sign
    string = "fanyideskweb" + word + salt + "n%A-rKaT5fb[Gy?;N5@Tj"
    s = md5()
    s.update(string.encode())
    sign = s.hexdigest()

    return salt,sign,ts

  # 主函数
  def attack_yd(self,word):
    # 1. 先拿到salt,sign,ts
    salt,sign,ts = self.get_salt_sign_ts(word)
    # 2. 定义form表单数据为字典: data={}
    # 检查了salt sign
    data = {
    
    
      "i": word,
      "from": "AUTO",
      "to": "AUTO",
      "smartresult": "dict",
      "client": "fanyideskweb",
      "salt": salt,
      "sign": sign,
      "ts": ts,
      "bv": "7e3150ecbdf9de52dc355751b074cf60",
      "doctype": "json",
      "version": "2.1",
      "keyfrom": "fanyi.web",
      "action": "FY_BY_REALTlME",
    }
    # 3. 直接发请求:requests.post(url,data=data,headers=xxx)
    html = requests.post(
      url=self.url,
      data=data,
      headers=self.headers
    ).json()
    # res.json() 将json格式的字符串转为python数据类型
    result = html['translateResult'][0][0]['tgt']

    print(result)

  # 主函数
  def run(self):
    # 输入翻译单词
    word = input('请输入要翻译的单词:')
    self.attack_yd(word)

if __name__ == '__main__':
  spider = YdSpider()
  spider.run()

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/108164057