Business reptile study notes day2

 

1. get parameter passing

(1) url contains Chinese error solution

urllib.request.quote ( "contains the Chinese url", safe = "string.printtable")

(2) Dictionary of mass participation

The final url parameter and spliced ​​together by the url, but the type of the parameter (the params) is a dictionary, the dictionary To splice into a string parameter type, as follows:

import urllib.request
import urllib.parse
import string

def get_params():
    url = "http://www.baidu.com/s?"

    params = {
        "wd": "中文",
        "key": "zhang",
        "value": "san"
    }
    str_params = urllib.parse.urlencode(params)
    print (str_params) # result of printing here WD =% E4% B8% the AD% E6% 96% 87 & Key = Zhang & value = San
    FINAL_URL = url + str_params 
    # url with the translated Chinese url of the computer can understand 
    end_url = urllib.parse.quote (FINAL_URL, Safe = String .printable) 
    Print (end_url) 
    Response = the urllib.request.urlopen (end_url) 
    Data = response.read (). decode ( " UTF-. 8 " ) 
    Print (Data) 
get_params ()

2. post

urllib.request.openurl (url, data = "server receives data")

3. request header:

urlib.request.openurl (), this method can be found View source does not attribute headers, so to have their own definition of a request object, and this object has attribute headers

= urllib.request.Request Request (URL) 
request_headers # = Request.Headers acquired information request header

(1) Create a request object: urllib.request.Request (url)

(2) plus User-Agent: simulate real-world browser sends a request (if used for a short time with a User-Agent to send multiple requests, the other server will be able to discern a reptile, it is possible to define a User-Agent pool, from which random Gets agent)

import urllib.request
import random

def load_baidu():

    url = "http://www.baidu.com"
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50"

    ]
    # Each browser request is not the same 
    random_user_agent = The random.choice (user_agent_list) 
    Request = urllib.request.Request (URL) 
    # corresponding increase request header (user_agent) 
    request.add_header ( " the User-- Agent " , random_user_agent) 

    # request data 
    Response = the urllib.request.urlopen (request) 
    # request information header 
    Print (request.get_header ( " the User-Agent " )) 

load_baidu ()

 

(3) request.add_header (dynamically added data head)

(4) response header response.header

Code

urllib.request Import 

DEF load_baidu (): 
    url = " http://www.baidu.com " 
    header = { 
        # browser version of 
        " the User-Agent " : " Mozilla / 5.0 (Macintosh; Intel Mac OS the X-10_13_2) AppleWebKit /537.36 (KHTML, like the Gecko) the Chrome / 69.0.3497.100 Safari / 537.36 " ,
         " Haha " : " Hehe "   # invalid, no such information in the request header, the print request header can also be seen behind this no 
    } 
    # Create request Object 
    request = urllib.request.Request (URL) 
    # Print (request) 
    # Print (Request.Headers) 
    # dynamically add information to the head
    request.add_header ( " the User-Agent " , " Mozilla / 5.0 (Macintosh; Intel Mac OS the X-10_13_2) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 69.0.3497.100 Safari / 537.36 " ) 
    # request the network data (not here request header is added because this method does not provide system parameters) 
    Response = the urllib.request.urlopen (request) 
    Data = response.read (). decode ( " UTF-. 8 " ) 

    # for complete URL 
    FINAL_URL = request.get_full_url ( ) 
    Print (FINAL_URL) 
    # access to information (all header information) request header 
    request_headers =  Request.Headers first method #
    Print (request_headers) 
    # the second method, pay attention: the first letter capitalized, all other letters in lower case
    request_headers = request.get_header("User-agent")
    print(request_headers)
    with open("02header.html", "w", encoding="utf-8")as f:
        f.write(data)

load_baidu()

 

4. IP Agent:

When multiple requests to use the same IP short time, the requested server reptile can be found, then we must use different ip request to send, it relates to an IP Proxy

(1) Free IP: poor timeliness, error rate

          Pay-IP: your money, but also the failure can not be used

(2) IP classification:

Transparency: We know each other's real IP

Anonymous: we do not know our true ip, but know that you are using a proxy

High hiding: they do not know our true IP, we do not know to use the proxy

5. handler (not clear)

(1) urllib.request.urlopen () method does not add proxy function, this function needs to define ourselves, so talk about the next handler

Code:

urllib.request Import 
Import urllib.request 

DEF handler_openner (): 
    urlopen # system does not add proxy functionality so we need to customize this feature 
    Why #urlopen can request data handler processor 
    # own oppener request data 
    # urllib.request. urlopen () 
    url = " https://blog.csdn.net/m0_37499059/article/details/79003731 " 
    # create their own processor 
    Handler = urllib.request.HTTPHandler () 
    # create your own oppener 
    opener = urllib.request. build_opener (Handler) 
    # call the open method opener with a request to create their own data 
    the Response = opener.open (url) 
    # the data = response.read () 
    the data= response.read().decode("utf-8")
    with open("02header.html", "w", encoding="utf-8")as f:
        f.write(data)

handler_openner()

(2) creates a corresponding processor (Handler)

1. Agent Processor: ProxyHandler

2. Use ProxyHandler created opener:

3. opener.open (url) may request data

Specific code as follows

import urllib.request

def proxy_user():

    proxy_list = [
        {"https":""},
        # {"https":"106.75.226.36:808"},
        # {"https":"61.135.217.7:80"},
        # {"https":"125.70.13.77:8080"},
        # {"https":"118.190.95.35:9001"} 
    ] 
    For Proxy in proxy_list: 
        Print (Proxy) 
        # ip to traverse out of creating processor 
        proxy_handler = urllib.request.ProxyHandler (Proxy) 
        # Create opener 
        opener = urllib.request.build_opener (proxy_handler) 

        the try : 
            Data = opener.open ( " http://www.baidu.com " , timeout = . 1 ) 

            haha = data.read () 
            Print (haha) 
        the except Exception AS E: 
            Print (E) 


proxy_user ()

 

Guess you like

Origin www.cnblogs.com/jj1106/p/11210408.html