I believe we are set too reptile request header
user-agent
this parameter, right? When requested, adding this parameter, you can to some extent disguised as a browser, the server will not be directly identified asspider.demo.code
far as I know, every time I have many readers fromnetwork
going replicationuser-agent
and then paste it into him In the code, there isuser-agent
nothing wrong with getting it this way , and it can be used, but if the website anti-climbing measures are stronger, using a fixed request header may be a bit of a problem, so we need to set a random request header. Here, I will share my general Use the three ways to set random request headers, just like what you learn and comment! ! !
Idea introduction:
- In fact, to achieve random effects, to a large extent we can use the random function library
random
to achieve this. You can call one of therandom.choice([user-agent])
random pick arrays. This is one of my methods. - Python, as a language with many third-party packages, naturally has packages that can generate random request headers. Yes, it is
fake-useragent
this third-party library. We will introduce the simple use of this library later. - Since others can write third-party libraries, naturally they can also implement such a function. In most cases, a lot of my code is directly calling a
GetUserAgentCS
class I implemented, and I can directly get a random request header and write the function directly The library is awesome and comfortable. I will also introduce how to write a function library below.
Write your own third-party library:
- I don't know what the framework of your code is, process-oriented or object-oriented? For one-off code, just simply code it. If you think this code can be used in many places and can be reused, then you can use the class method to write this code, then in other In the file, you can directly call your write this file, directly call various methods in the class you wrote, and I also implemented a third-party library for a random request header, as follows:
import random
import csv
class GetUserAgentCS(object):
"""
调用本地请求头文件, 返回请求头
"""
def __init__(self):
with open('D://pyth//scrapy 项目//setting//useragent.csv', 'r') as fr:
fr_csv = csv.reader(fr)
self.user_agent = [str(i[1]) for i in fr_csv]
def get_user(self):
return random.choice(self.user_agent)
The useragent file is as follows:
1,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
2,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36"
3,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (X11; CrOS i686 3912.101.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
-------
------- # too much
100...
The code is very simple, read the local csv file, and then random one out, then someone will ask me now, how did you get this file, it is very simple, naturally there is a way, I will be in the next module Speaking of, here, we only need to write a GetUserAgentCS
class, the code can be copied directly from my above, and then save it as get_useragent.py
it is, then you put this package file in your own crawler folder , and then call it like this:
from get_useragent import GetUserAgentCS
headers = {
}
ua = GetUserAgentCS().get_user()
headers['user-agent'] = ua
return headers
If you are GetUserAgentCS
unsuccessful in this call , or there will be a red wavy line underneath, then you have not set the current working environment, you only need to set it like this (set your crawler folder):
You need to click Sources Root
on it!
Use the third-party library fake-useragent:
- This is a third-party library that someone else has written. You need to install it and then call the API. It can get various request headers. The only drawback is that the request is unstable . Sometimes network fluctuations may lead to unsuccessful retrieval. , It is not very comfortable to use in Scrapy, so I wrote my own package as above on the basis of this package. As for how the request header data comes from, it is always changing when the package is running normally
user-agent
, and then continuously requesthttp://httpbin.org/user-agent
then continue to save the data, write local files on it.
Let's talk about how to use this package!
installation
pip install fake-useragent
You can check pip list to see if the installation is successful
How to use
from fake_useragent import UserAgent
headers= {
'User-Agent':str(UserAgent().random)}
r = requests.get(url, headers=headers)
- UserAgent().random can get the request header of any browser
- UserAgent().Chrome can get the request header of Google Chrome
- UserAgent().firefox can get the request header of Firefox browser
At this time, just use random directly, simple.
After reading, you will gain something, like, follow, add to favorites, and encourage each other!
Click me, click me, click me! Other blog posts
Read the memory array:
- At this time, many people said, I just change the request header. Does it need to be so troublesome? Of course, there is a simple way, but it needs to be copied every time. It is not a very method, as follows:
ua = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"]
Put the request header into the array in advance and use it.
import random
ua = [.....]
r = requests.get(url, headers={
"user-agent":random.choice(ua)})
The above are several ways to set the request header. If you need to add, you can leave a message in the comment area.
Teach you to use three ways to set random request headers. It is inevitable for the crawler to set the request header (user-agent). How to generate a random request header is also what we crawlers must master. After reading this article, you can easily master it!
After reading, you will gain something, like, follow, add to favorites, and encourage each other!
Click me, click me, click me! Other blog posts