Spider data mining-3, requests library understanding

REquesets library: (simpler than other http libraries)
1. Introduction: Write
get and other methods as interfaces for direct call, which basically fully meets all the requirements of web requests.

2. Initiate a request.
Each request method has a corresponding API. For example, GET request can use the get() method and
post() method. Pass the data that needs to be submitted to the data parameter. POST request sends the form data through data. , Send json data through json parameters,
json and data cannot be added at the same time, and adding at the same time will cause no data in json.
The json method of sending data will not form a form, the data method will only form a form without forming a json data
request no longer Like urllib, you need to construct various Requests, openers, and handlers. Use the method of Requests construction and pass in the required parameters.
Construct a dictionary, and pass it to the params parameter (including automatic encoding function) when requesting the
same URL parameter name, but with different values, and the python dictionary does not support the same key name, then we can change the key The value is represented by the list of
headers of the custom request. Similarly, the dictionary data is passed to the headers parameter.
The custom cookies no longer need to construct the CookieJar object, and the dictionary is directly passed to the cookies parameter.
If requests are to construct a cookieJar, the requests construction can be cross-site Cross-path use, more complete functions
First construct a class object, jay=requests.cookies.RequestsCookiesJar()
This object has a set method, jay.set('cookie11','ai', domain='site', path'/cookies) (The path in the site)'),
if the site and the path do not match, the cookie will not be carried. If the site and the path exist, you can save cookies across sites and paths across multiple sites
When using a proxy, the proxy dictionary is also constructed and passed to the proxies parameter, and the proxy addresses of the http and https protocols can be written at the same
time. Set the access timeout and set the timeout parameter, without passing the dictionary. Basically it will be used. If you don't use it, the program may lose response forever. As long as the server responds, it will not limit the time . Requests are
redirected by default.
Some websites require certificate verification. If sslerror appears, you can set verify=false to solve the
certificate verification failure. Change verify to false. After closing, there will be a warning. You can use code. To turn off this warning,
requests.packages.urllib3.disable_warnings() calls the urllib3 library through requests to solve the warning without importing urllib3.
3. Receive
the attribute
text of the response object (with decode): directly converted to a string, not a byte code. When text is not correctly decoded, we need to manually specify the encoding format for decoding.
Requests is to guess what the encoding of the webpage is, and then decode the content of the webpage according to the guessed encoding. Sometimes the guess is wrong, and then manually decode it
manually: res.enconde='utf-8' (if it is replaced by gbk or ascii code format, later Then the text used will also be brought into use)
print (res.text) (reuse the encoding format that defines res, text will immediately use the encoding format of enconde)
content (image data uses this parameter): get the most original data, words Section code, when text matches, it cannot be interpreted by non-bytecode, and when it is empty, content can be used.
raw: Get more primitive data than content, which belongs to the response data from the socket. Before use, you must use the method stream in the request request to adjust to true, raw.read(10) reads the length
json() with a size of 10 : automatically converted to a dictionary format (you need to return json data to use this method, otherwise Error), followed by ['headers'] is a way to take the value of key, and
then ['ko']['ui'] is to take the value of the ui key in the ko key (take the dictionary in the dictionary)
cookie: view the response The cookie value
request.url: view the requested url
headers: get the response header
request.headers: get the request header
Four: the session object
session method is a method for the requests library to initiate a request, this method will automatically save the cookie value obtained by visiting the page , So that when you visit again, it will automatically carry the
classic login logic
function of the cookie : automatically update the request header information, commonly used when the account is logged in, first visit the login page url, and then access the url of the data submission (the url collects data on the login page For submission)
The use of session is similar to requests.get. When using session, you must first create a session object
session=requests.session() create a session object
session.headers=headers to add request headers directly to the request headers set in the session, and then There is no need to add request header methods to the parameters
session.get (url) where the requests object is replaced by the session object, which is better than the requests object. The request header is automatically updated in the brackets, so there is no need to add headers parameters, and there is no need to change the cookie
access page cookie and the login url cookie in the page It is not the same, so the session object will be automatically changed to the correct set_cookie to make the request. The
server gives the cookie to the cookie:~~

(This is the content automatically updated by the session), the cookie for requesting verification is set.cookie:~~~

Requests to crawl multiple pages (multiple pages) of Baidu pictures go
through 3 consecutive page pages, compare the URLs and find that the pn (PageNumber) value has a certain pattern, 30, 60, 90, one crawl 30 pages, the second time a total of 60 , 90 30 images per page for three times
https://cn.bing.com/images/search?q=%E6%9C%BA%E7%94%B2&form=HDRSC2&first=3 (here is the number of pages, use here After the loop is changed, multiple pages can be crawled) &tsc=ImageHoverTitle
After getting the URL, you need to send a request to the URL in the for loop to get the information, otherwise you can only get the base64 garbled code

Use the purchased proxy ip:
from proxy_dir~~~~~ import proxy ~~~~~
proxymeta='http~~%~~%~%

(The user name and password data is required here, and the user needs to check the real name in advance)'%{'host': proxyHost,'~ ': ~,~~·}
proxy={"http": proxymeta, "https": proxymeta} This step and the ip proxy that does not need to be purchased must

anaconda: a tool to manage python, manage python installation packages, the location between each package, using some libraries is more convenient than using python alone.
Create anaconda and create an anaconda virtual environment. Finally remember to switch the environment
locally When scrapy is used, it will be incompatible with the twisted package. The compatibility is better when used in anaconda environment.

The code to enter the secondary webpage often needs to construct the incoming url parameter, because the secondary URL generally only has more url parameters than the URL of the initial website

Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114116735