Python Reptile 1.2 - urllib Advanced Usage tutorial

Overview

At the same time in this series of documents for Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4

The article describes the use of urllib foundation This article will introduce urllib slightly high point of usage.

There are many online sites want to request need to set some request headers, if you want to increase the number of request header when requested, then we must use request.Request class to be achieved, for example, to add an User-Agentaddition of a Refererheader information.

Reptiles are usually prevent anti-mainly in the following strategies:

  1. Dynamic setting request header headers (User-Agent) (random handover User-Agent, the user's browser to simulate different information)
  2. Use the IP address pool: VPN and proxy IP, and now most of the site are based on the IP to ban
  3. Cookies
  4. Set the delay download (preventing access too often, is set to 2 seconds or more) is important to understand the crawler is to get the data (this time the library used time.sleep()method here is not explained)

Setting request header (urllib.request.Request)

urllib.request.Request urllib is an abstract class for constructing an object instance http request.
Request request class built-in methods commonly used methods:

  • Request.add_data (data) parameter setting data, if time does not begin to create a parameter to the data, you can use this method append data parameters;
  • Request.get_method () method returns an HTTP request, the POST or GET generally returns;
  • Request.has_data () to see whether a data parameter;
  • Request.get_data () parameter data to obtain data;
  • Request.add_header (key, val) adding a header information, key to the first domain, val domain value;
  • Request.get_full_url () for a complete request url;
  • Request.get_host () Returns the Host (primary domain) of url requests;
  • Request.set_proxy (host, type) setting agent, the first parameter is ip and the port agent, the second agent is a type parameter (http / https).
  1. The sample code
    # 导入urllib库
    import urllib.parse
    import urllib.request
    
    # 声明定义请求头
    headers = {
        # 在这个头字典里面可以将你所有需要传递的头添加进来
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
    }
    
    # 向指定的url发送请求,并返回
    post_url = 'https://fanyi.baidu.com/sug'
    # 传入参数
    form_data = {
        'kw': 'honey'
    }
    # 格式化参数
    form_data = urllib.parse.urlencode(form_data).encode()
    # 创建Request类
    req = urllib.request.Request(url=post_url, headers=headers, data=form_data)
    
    # 进行请求,打印结果
    ret = urllib.request.urlopen(req)
    print(ret.read())
  1. Examples of reptiles

Python crawling pull hook net Careers information (due to the pull hook nets anti-climb, visit its interface needs to be passed Cookie, I copied directly from the browser to write out the request header).

    # 引入urllib库
    import urllib.parse
    import urllib.request
    
    # 声明定义请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
        'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
        'Host': 'www.lagou.com',
        'Origin': 'https://www.lagou.com',
        'X-Requested-With': 'XMLHttpRequest',
        'Cookie': '_ga=GA1.2.1158944803.1554684888; user_trace_token=20190408085447-e8216b55-5998-11e9-8cbc-5254005c3644; LGUID=20190408085447-e8216df3-5998-11e9-8cbc-5254005c3644; JSESSIONID=ABAAABAAAFCAAEG89414A0A463BB593A6FCB8B25161B297; WEBTJ-ID=20190803154150-16c566d76a7a71-0d0261dbc1a413-a7f1a3e-2073600-16c566d76a8664; _gid=GA1.2.646413362.1564818110; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1564818110; LGSID=20190803154149-2746a012-b5c2-11e9-8700-525400f775ce; PRE_UTM=; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; index_location_city=%E5%85%A8%E5%9B%BD; X_HTTP_TOKEN=7b450c20fc1c8ebb1028184651d95c44e86182119a; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1564818202; _gat=1; LGRID=20190803154322-5e321a85-b5c2-11e9-8700-525400f775ce; TG-TRACK-CODE=index_search; SEARCH_ID=f2eeeba9273a435281d59597e5b8b7ba',
    }
    # 拉钩接口地址
    url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
    # 接口参数
    data = {
        'first': 'true',
        'pn': '1',
        'kd': 'python'
    }
    from_data = urllib.parse.urlencode(data).encode()
    # 声明定义Request类
    req = urllib.request.Request(url=url, headers=headers, data=from_data)
    # 请求接口,打印结果
    res = urllib.request.urlopen(req)
    print(res.read().decode())

Use a proxy (urllib.request.ProxyHandle)

Many members of the site visits to detect an IP within a certain time (through traffic statistics, system logs, etc.), if the number of visits and more like normal people, it will ban the IP access. So we can set some proxy server, a proxy change from time to time, even if IP is disabled, you can still change the IP continue crawling.

  1. The basic principle of
    the proxy server actually refers to the proxy, its function is to proxy network users to obtain network information. Figuratively speaking, it is a transit point for network information. When we requested a normal website, in fact, it sent a request to the Web server, Web server to respond back to us. If you set up a proxy server, in fact, between the machine and the server to build a bridge, this time the machine is not directly initiate a request to the Web server, but the request to the proxy server, the request is sent to the proxy server, and then by the the proxy server and then sent to the Web server, then the proxy server then returns the Web server forwards the response to the machine. So that we can access the same page, but this process Web server recognizes the real IP is no longer our IP of the machine, and on the successful implementation of IP masquerading, which is the basic principle of the agent.

  2. effect

    • Break their own IP access restrictions, access to some sites not normally accessible;
    • Improve access speed;
    • Hide the real IP.
  3. Commonly used agents

    • West stab Agent: https://www.xicidaili.com/
    • Fast Acting: https://www.kuaidaili.com/
    • Cloud Agent: http://www.ip3366.net/
  4. Usage Example

        # 引入所需要的库
        import random
        import urllib.request
        
        # 声明定义代理服务器列表
        proxy_list = [
            {"http": "220.184.144.80:8060"},
            {"http": "180.175.170.210:8060"},
            {"http": "116.226.28.17:8060"},
            {"http": "123.123.137.72:8060"},
            {"http": "116.226.31.161:8060"}
        ]
        # 随机选择一个代理
        proxy = random.choice(proxy_list)
        
        # 使用选择的代理构建代理处理器对象
        http_proxy_handler = urllib.request.ProxyHandler(proxy)
        # 通过 urllib.request.build_opener(),创建自定义opener对象
        opener = urllib.request.build_opener(http_proxy_handler)
        # 创建Request对象
        url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
        }
        req = urllib.request.Request(url=url, headers=headers)
        # 使用opener.open()方法发送请求才使用自定义的代理,而urlopen()则不使用自定义代理
        # 这么写,就是将opener应用到全局,之后所有的,不管是opener.open()还是urlopen()发送请求,都将使用自定义代理
        res = opener.open(req)
        print(res.read().decode())
    

Cookie(urllib.request.HTTPCookieProcessor)

  1. What cookie

    on your site, http request when stateless, which means that even after the first time and server links and after a successful landing, a second request to the server still does not know the current request is the user, cookie appears to solve the problem, for the first time after landing the server returns some data (cookie) to the browser, and then stored in the local browser, when the user requests a second time, the cookie will automatically request the data from the last to carry automatic server, the server through a browser cookie data carried by the current user is able to determine that a. cookie data storage is limited, different browsers and different storage sizes, but generally not more than 4kb, so use cookie can only store some small amount of data. (Explanation is relatively simple, you can see a detailed description of the cookie Wikipedia description)

  2. cookie format

    Set-Cookie: NAME=VALUE; Expires=DATE; Domain=DOMAIN_NAME; Path=PATH; SECURE
    
    • NAME = VALUE: This is a Cookie must be there every section. NAME is the name of Cookie, VALUE is the value of the Cookie. String "NAME = VALUE", the non semicolons, commas, and characters such as spaces;
    • Expires = DATE: Expires variable is a write-only variable that determines the Cookie effective date of termination. The variables can be saved, if the default, the Cookie attribute values ​​are not saved in the user's hard drive, but only stored in the memory of them, Cookie file with the closure of the browser and go away;
    • Domain = DOMAIN-NAME: Domain This variable is a write-only variable that determines which Web server Internet domain can read the Cookie browser to access that only pages from this domain can use the information in the Cookie . This setting is optional, if the default, set the Cookie attribute value of the domain name of the Web server;
    • Path = PATH: Path attribute which defines the next page on the Web server can obtain path Cookie server settings;
    • SECURE: Mark the variables in the Cookie, indicating that only when the communication protocol browser, the browser was submitted to the appropriate Cookie server is encrypted authentication protocol. This agreement is currently only one, that is HTTPS.
  3. http.cookiejar

    cookie cookielib generally used for client information processing HTTP cookie, cookie information can be obtained from the server through which, in turn, can be obtained by sending it to the server. cookielib provides different class to automatically handle HTTP cookie information, the use of more classes include CookieJar, MozillaCookieJar and Cookie.

Code Example of use (use http.cookiejar and urllib.request.HTTPCookieProcessor landing all networks):


    # 引入所需要的库
    import http.cookiejar
    import urllib.parse
    import urllib.request
    
    # 真实的模拟浏览器,当发送完post请求的时候,将cookie保存到代码中
    # 创建一个cookiejar对象
    cj = http.cookiejar.CookieJar()
    # 通过cookiejar创建一个handler
    handler = urllib.request.HTTPCookieProcessor(cj)
    # 根据handler创建一个opener
    opener = urllib.request.build_opener(handler)
    # 人人网登陆地址
    post_uel = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2019621044248'
    form_data = {
        'email': '188****7357',  # 这是人人网账号
        'icode': '',
        'origURL': 'http://www.renren.com/home',
        'domain': 'renren.com',
        'key_id': '1',
        'captcha_type': 'web_login',
        'password': '01cb55635986f56265d3b55aaddaa79337d094cb56d6cf7724343a93ad586fe7',
        'rkey': 'd5ff51375d8eb17a011cad5622d835fd',
        'f': 'http%3A%2F%2Fwww.renren.com%2F971686685%2Fprofile'
    }
    # 声明定义header
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
    }
    # 创建Request对象,格式化参数
    req = urllib.request.Request(url=post_uel, headers=headers)
    form_data = urllib.parse.urlencode(form_data).encode()
    # 构造访问
    res = opener.open(req, data=form_data)
    print(res.read().decode())
    print('*' * 50)
    
    # 人人网个人中心地址
    get_url = 'http://www.renren.com/971686685/profile'
    # 创建Request对象
    req1 = urllib.request.Request(url=get_url, headers=headers)
    # 构造访问(自带cookie),打印结果
    res1 = opener.open(req1)
    print(res1.read().decode())

  1. Save cookie information
    # 引入所需要的库
    import urllib.request
    from http.cookiejar import MozillaCookieJar
    
    # 声明cookiejar
    cj = MozillaCookieJar('cookie.txt')
    # 创建handler\opener
    handler = urllib.request.HTTPCookieProcessor(cj)
    opener = urllib.request.build_opener(handler)
    # 声明定义url,进行访问
    url = 'http://httpbin.org/cookies/set?wei=weizhihua'
    res = opener.open(url)
    # 保存cookie, ignore_discard 设置为True,将过期的cookie也进行保存
    # 如果在声明cookiejar时没有写入文件保存地址,则在save()函数中需写入文件地址参数
    cj.save(ignore_discard=True)
  1. Load cookie information
    import urllib.request
    from http.cookiejar import MozillaCookieJar
    
    # 声明cookiejar
    cj = MozillaCookieJar('cookie.txt')
    # ignore_discard 设置为True,将过期的cookie也加载出来
    cj.load(ignore_discard=True)
    # 打印内容
    for cookie in cj:
        print(cookie)

Other Bowen link

Published 154 original articles · won praise 404 · Views 650,000 +

Guess you like

Origin blog.csdn.net/Zhihua_W/article/details/98477507