Python crawler agent

Python crawler agent

Disclaimer: Since the publication of this article, this article is for reference only and may not be reproduced or copied. If the party who browses this article is involved in any violation of national laws and regulations, all consequences shall be borne by the party who browses this article and has nothing to do with the blogger of this article. And due to the reprinting, copying and other operations of the parties who browse this article, any disputes caused by violation of national laws and regulations and all the consequences shall be borne by the parties who browse this article and have nothing to do with the blogger of this article.

1. Basic knowledge

Proxy : proxy server, proxy IP, anti-crawl mechanism that solves IP blocking.

The role of the agent :

  1. Break through the restrictions of your own IP access.
  2. Hide your real IP.

Agency website :

  1. Fast agent
  2. West Spurs Agent
  3. www.goubanjia.com

Type of proxy IP :

  1. http : Applies to the URL corresponding to the http protocol.
  2. https : Applies to the URL corresponding to the https protocol.

Anonymity of proxy IP :

  1. Transparent : The server knows that the proxy is used, and knows the real IP of the request.
  2. Anonymous : The server knows that a proxy is used, and does not know the real IP of the request.
  3. Gao An : The server does not know that the proxy is used, nor does it know the real IP of the request.

2. Examples

Add a parameter:proxies
I use the local virtual machine proxy.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# 导包
import requests
from lxml import etree

if __name__ == '__main__':
    
    # 参数 url
    url = "http://2021.ip138.com/"
    headers = {
    
    
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0",
        'Connection': 'close'
    }
    text = requests.get(url=url, headers=headers, timeout=15).text
    # 解析
    html = etree.HTML(text, etree.HTMLParser(encoding="utf-8"))
    xpath = html.xpath("//a//text()")[0]
    print(xpath)

    # 添加代理服务器
    proxies = {
    
    
        "http": "http://192.168.19.131:8080/",
        "https": "https://123.169.98.82:9999"
    }
    response = requests.get(url=url, headers=headers, proxies=proxies, timeout=30)
    response_text = response.text
    # 解析
    html_proxies = etree.HTML(response_text, etree.HTMLParser(encoding="utf-8"))
    xpath_proxies = html_proxies.xpath("//a/text()")[0]
    print(xpath_proxies)

Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/YKenan/article/details/111997820