Spider data mining-----2, understanding of urllib library and urllib3 library

1. The urllib library (built-in official standard library, built-in), python3 integrates the 1st and 2nd generation
1, urllib.request request module (the core part of the crawler’s camouflage) to
construct network requests, you can add Headers, proxy, etc.
(1 ) Initiate a simple network request:
urlopen method: urllib.request.urlopen(url (required), data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
url: It can be a string or a request object. The
data can be written as GET (displayed directly on the request) or POST (the form will not be displayed in the url, the form is data=b "~~~~~~"), GET security Lower than POST
timeout: If there is no response after this time, an error will be reported, in seconds, when a fast speed is given, an error will be reported if the program cannot keep up.
*The following must be forced to pass into
cafile in the form of keyword parameters The ca in is the organization that manages the issuance of certificates. It is only after certification and has the function of notarization.
Return value: After the request, a urllib.response object will be returned, which contains some attributes and methods for us to process the result.
Response attributes: read (only available Once) byte string, all content getcode get status code info get response header information
(2) add Headers (build a more complete request combined with urlopen method to use, only use urlopen method can only use two methods (when there is data) post), 4 types can be used with the request object)
User-Agent among them tells the server what server you are using
Request is an object that sends a complete network request, and then returns the request object. Headers
Headers: "User-Agent": "Python-urllib/3.6" needs to be in the form of a dictionary
request.urlopen (request.Request······)

Return an object of urllib.response, the methods of which are read (can only be used once, adding multiple prints will return empty characters) readline()info() geturl()getcode()
(3) Operation cookie (used when logging in, Session technology, crawlers must pretend not to manipulate cookies)
Create cookie objects-create cookie handlers-create Openner objects with cookie handlers as parameters-use this openner to send requests, User-Agent indicates that they are friendly proxy servers and
cookies will expire, Time limit
(4) Use a proxy (the most secure request method)
url="http~~~~~/ip (ip proxy)", if you operate a cookie for crawling requests, it is easy to block the IP. In order to pretend to be better, it is also If you want to add Headers request headers, you don’t need to write a successful request without Headers.
Proxy address (refers to the address through which http is proxyed)-proxy processor (hereafter similar to cookie)-create Openner object,
use proxy after object creation, Looking at the return value read, you will find that the address has become the proxy address.
2. The urllib.error exception handling module (the processing performed when the request cannot be processed normally)
URLError: the base class of the error exception module, the exception generated by the request module You can use this class to handle
HTTPError: it is a subclass of URLError, which mainly contains three attributes Code, reason, headers,
you can use try and except to check errors, except error.HTTPError as e: print (e.Code),
It can be checked that the server detects that the crawler is not accessed and the error cause and other information are fed back, and then the error analysis is carried out through URLError or HTTPError
3. urllib.parse: url parsing module (as important as the request module, in constructing special url when the resolution)
URL can contain ascii characters,
a single transcoding parameters:
parse.quote () ascii code translation, parse.unquote () back to the original intent of
the plurality of transcoding parameters:
parse.urlencode ()
parse.parse_qs () will be It goes back to the dictionary
4. urllib.robotparse: robots.txt parsing module
robots.txt is an announcement on the website, crawlers can disobey, but the parsing module can learn the content, generally you can see
Robots by adding robots.txt after the URL Protocol (Internet crawler exclusion criteria), websites are generally written not to allow any crawlers to visit. But crawlers generally do not comply with
User-agent: Baiduspider means crawling for Baidu crawlers.
Disallow: /baidu means crawling all data in the Baidu directory is not allowed.
Second, urllib3 library (third-party standard library, unofficial. First. Establish a connection pool for subsequent operations, which can complement each other with urllib)
Python library for HTTP client, 100% test coverage (very stable library)
First instantiate an object that manages the pool, and then send a request
for it The method is http (management pool object).request ("GET", "~~~") The
request method of the request module is also used :
Source code: request (self, method, url, fields (equivalent to urllib data)=None, headers=None, **urlopen_kw (custom keyword parameters))

url: It can only be in the form of a string, not a request object, that is, it is no longer in Request (url (here can be a request object)) format, but http.request (url (here is not a request object)).
Return value: and urllib The same is the response object
print ("status", response.status) print string status and the status code of the response object. The
proxy form is similar to that of urllib. Here instead of http management pool object, it is replaced by proxy object, proxy=urllib3 .proxyManager("http~~~")
res=proxy.request("get,~~~~") The
stream method can extract a large amount of data
Request data:

For get, head, and delete requests, query parameters can be added by providing dictionary-type parameter fields (not to add a form (only submitted by post), but to add parameters). Generally,
post and input requests are followed by url, which need to be encoded by url Turn the parameters into the correct format and then splice them into the url. The transfer form can be added by the fields parameter: fields={'field':'·······'}
The content of the fields in the get will become a parameter, in the post The content of the fields will become a form.
Through the json module, loads is the method to convert the dictionary data type json.loads(~~~~~) ["json" (loads first and then extracts, extracts the json data,
if it is a form, it is Extract the data of the form form)]
If it is ['fields'] is to extract the content of the file upload, extract a variety of data, can be written in the back
File upload: Files
image upload: binary data
response object is similar to urllib, also provided corresponding attribute, wherein the data attribute data is returned
in response json data is returned, it can json module
by process stream to return to a better response to binary data,
three, urllib3 download image Baidu

Crawler general development process
Send a request to the page page to get the url corresponding to the picture, then extract the url of the picture, and then get the picture resource through the url of the picture, and then save it (data persistence, save to the directory can use the os library)
write regular The extracted one must be in the source code before it can be used. Data-imgurl can't use other url keys. For example, thumbURL
uses enumerate() added in the for loop to extract images. After for, there is one more variable to receive the enumerate container. The subscript
os.mkdir ("cat") command of the enumerated object is used to create a directory.
img_name='cat/'+str(inedx)='jpg'
“wb” is opened in binary writing mode, only the file can be written, if the file does not exist, create the file

Use HTTPBin to test the HTTP library:
http://httpbin.org http://httpbin.org/get visit the website and enter get at the back to get the information requested by the user. The unprocessed ip is the user’s local ip Address
http://httpbin.org/get? name='·······'Here, add after get? is a format, pass parameters to the request, and then you can use some methods to see what kind of
bytes the return value is. Byte string: storage bytes (0-255 )
str string: store Unicode characters (0-65535)

if'\' in img_url:
img_url = img_url.replace("\", "") Remove \, turn \ into empty, become http without \, and then you can use it

for index,img_url in enumerate(imgs_url): Enumerate the extracted data before storing, otherwise the image will overwrite the file that already has the image

os module create folder

For the anti-crawl mechanism, you can go to the picture webpage to view the source code, find the request header or request url under XHR to avoid the security protocol

Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114116471