The python crawler handles Requests in minutes.

At present, there are two request libraries commonly used for python crawlers. One is urllib and the other is requests. For urllib, I have written the corresponding blog before, but urllib is more "original". It is more troublesome to use, and of course it is more difficult than requests. I’m a little tired of blogging, so I won’t update my blog about urllib anymore. After all, requests are powerful and more user-friendly. So here is a blog to write about how to use requests. It is estimated that a blog is enough.
It involves: request method, string decoding, sending post request incoming parameters, crawler camouflage, IP proxy settings, cookie information processing, etc.

Request method

Let me talk about the ways that requests can deliver requests, one is get and the other is post. The following code:
here is the use of Baidu website as an example.

import requests
url='https://www.baidu.com/'
date=requests.get(url)
#data=requests.post(url)

The post request can provide a form, and send a request to the server by passing in a user name, password, etc.
The following cookie processing will be demonstrated in the QQ space.

Decode string

There are two ways when the website returns the request
1.1 text

import requests
url='https://www.baidu.com/'
date=requests.get(url)
print(date.text)

1.2 content

import requests
url='https://www.baidu.com/'
date=requests.get(url)
print(date.content)

It should be noted here that text returns a decoded unicode string, while content is bytes,
Insert picture description here
so you can decode it according to your needs, such as: utf-8 or gbk
Insert picture description here

Incoming parameters (disguise (user agent), passing parameters)

requests is much simpler and more convenient than urllib. You can check the picture first;
we can pass it directly as a parameter.
Insert picture description here
Post and get are the same.
So we can pass the value directly, such as headers

import requests
url='https://www.baidu.com/'
headers={
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

date_s=requests.get(url)
date_l=requests.get(url,headers=headers)

print(len(date_s.content.decode('utf-8')))
print(len(date_l.content.

In this case, I made a distinction between the one without headers and the other with added. We can find that the returned data is less for the ones that are not added. In fact, this is because the server found that we are a crawler script and made a joke and mislead. we.
Sometimes we find that the target server keeps giving us incomplete feedback or incorrect data that is different from what we see in the browser interface. So at this time we must do a disguise, such as adding the request header. Of course, sometimes it is not enough, such as when crawling Lagou nets.
So it can be more complicated and comprehensive when constructing the request header.
Insert picture description here

IP proxy

It is also very convenient.
Everyone should know this more or less. Every computer has an IP address for accessing the network. But sometimes as a crawler crawling a website multiple times at the same time and using the same IP, then this time may cause the server to suspect whether it is a crawler, then login may be restricted at this time. At this time, you can use IP proxy, that is, to access the website through someone else's IP, then you can reduce suspicion and ensure security.
1 Get IP
This can be purchased from major IP websites or free, but if it is not stable, you can consider building an IP proxy pool yourself. Of course, this is only for large-scale acquisition visits, otherwise it is generally sufficient.
Insert picture description here
2. Use a proxy
Here is a free proxy, which may be unstable.

Insert picture description here
Well, this is not stable, but if we want to use IP proxy, we only need to set the proxy first and then fill in the proxies parameter in the request.

import requests
url='http://httpbin.org/ip'
headers={
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
proxy={
    
    'http':'2121.232.144.155:9000'}
date=requests.get(url,headers=headers,proxies=proxy)
print(date.content.decode('utf-8'))

Insert picture description here
Okay, this time it succeeded. Here pay attention to the test site I visited: http://httpbin.org/ip

Cookie information processing.

1 get cookie

import requests
url='https://www.baidu.com/'
headers={
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

date=requests.get(url,headers=headers)
#print(date.content.decode('utf-8'))
print(date.cookies.get_dict())

  1. Using cookie (simulation exercise)
    Here is an example of logging in to your own QQ space: the
    Insert picture description here
    code is as follows:


import requests
url_get_cookie='https://qzone.qq.com/'
url_qq_space='https://user.qzone.qq.com/3139541502'
date={
    
    'username':'3139541502(写自己的qq)','password':'(写自己的QQ密码)'}
headers={
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
session =requests.session()#创建session对象
session.post(url_get_cookie,data=date,headers=headers)#获取cookie
resq=session.get(url_qq_space)#访问空间
print(resq.text)
#print(session.cookies)


To log in to the QQ space, you need to fill in the login, and then enter the space. The premise of entering the space is to enter the account password. At this time, cookie information will be generated to prove that you are logging in to your own space. So I need to get this information (of course cookie information is time-sensitive).
Then we need to save the cookie for access (corresponding access)
and the session in the requests can obtain and save the cookie. Then proceed to the next visit.
The process is as follows
Insert picture description here
Insert picture description here

Handling untrusted SSL certificates

Sometimes an error will be reported when crawling an untrusted certificate. In fact, python did not have this setting before, but it is generally not encountered, but sometimes crawling some xx websites may encounter such as
Insert picture description here

Then like this

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:606)>

The solution is also simple as shown in the figure:
Insert picture description here
here is also how to deal with urllib.
Just pay attention to the code I show here.

from urllib import request
import re
import os
import ssl
context = ssl._create_unverified_context()
省略若干代码
 b = request.urlopen(url, timeout=tolerate,context = context).read().decode('gb2312', 'ignore')
省略若干代码

Or you can set to ignore SSL globally

from urllib import request
import re
import os
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
省略若干代码

Guess you like

Origin blog.csdn.net/FUTEROX/article/details/107428496
Recommended