In order to change the crawler, I use python to implement three random request header methods!

I believe we are set too reptile request header user-agentthis parameter, right? When requested, adding this parameter, you can to some extent disguised as a browser, the server will not be directly identified as spider.demo.codefar as I know, every time I have many readers from networkgoing replication user-agentand then paste it into him In the code, there is user-agentnothing wrong with getting it this way , and it can be used, but if the website anti-climbing measures are stronger, using a fixed request header may be a bit of a problem, so we need to set a random request header. Here, I will share my general Use the three ways to set random request headers, just like what you learn and comment! ! !

Insert picture description here


Idea introduction:

  • In fact, to achieve random effects, to a large extent we can use the random function library randomto achieve this. You can call one of the random.choice([user-agent])random pick arrays. This is one of my methods.
  • Python, as a language with many third-party packages, naturally has packages that can generate random request headers. Yes, it is fake-useragentthis third-party library. We will introduce the simple use of this library later.
  • Since others can write third-party libraries, naturally they can also implement such a function. In most cases, a lot of my code is directly calling a GetUserAgentCSclass I implemented, and I can directly get a random request header and write the function directly The library is awesome and comfortable. I will also introduce how to write a function library below.

Write your own third-party library:

  • I don't know what the framework of your code is, process-oriented or object-oriented? For one-off code, just simply code it. If you think this code can be used in many places and can be reused, then you can use the class method to write this code, then in other In the file, you can directly call your write this file, directly call various methods in the class you wrote, and I also implemented a third-party library for a random request header, as follows:
import random
import csv


class GetUserAgentCS(object):
    """
    调用本地请求头文件, 返回请求头
    """

    def __init__(self):
        with open('D://pyth//scrapy 项目//setting//useragent.csv', 'r') as fr:
            fr_csv = csv.reader(fr)
            self.user_agent = [str(i[1]) for i in fr_csv]

    def get_user(self):
        return random.choice(self.user_agent)

The useragent file is as follows:

1,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
2,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36"
3,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
4,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (X11; CrOS i686 3912.101.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"
5,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
-------
-------  # too much 
100...

The code is very simple, read the local csv file, and then random one out, then someone will ask me now, how did you get this file, it is very simple, naturally there is a way, I will be in the next module Speaking of, here, we only need to write a GetUserAgentCSclass, the code can be copied directly from my above, and then save it as get_useragent.pyit is, then you put this package file in your own crawler folder , and then call it like this:

from get_useragent import GetUserAgentCS
headers = {
    
    }
ua = GetUserAgentCS().get_user()
headers['user-agent'] = ua
return headers

If you are GetUserAgentCSunsuccessful in this call , or there will be a red wavy line underneath, then you have not set the current working environment, you only need to set it like this (set your crawler folder):

Insert picture description here
You need to click Sources Rooton it!


Use the third-party library fake-useragent:

  • This is a third-party library that someone else has written. You need to install it and then call the API. It can get various request headers. The only drawback is that the request is unstable . Sometimes network fluctuations may lead to unsuccessful retrieval. , It is not very comfortable to use in Scrapy, so I wrote my own package as above on the basis of this package. As for how the request header data comes from, it is always changing when the package is running normally user-agent, and then continuously request http://httpbin.org/user-agentthen continue to save the data, write local files on it.

Let's talk about how to use this package!

installation

pip install fake-useragent

You can check pip list to see if the installation is successful

How to use

from fake_useragent import UserAgent
headers= {
    
    'User-Agent':str(UserAgent().random)}
r = requests.get(url, headers=headers)
  • UserAgent().random can get the request header of any browser
  • UserAgent().Chrome can get the request header of Google Chrome
  • UserAgent().firefox can get the request header of Firefox browser

At this time, just use random directly, simple.

After reading, you will gain something, like, follow, add to favorites, and encourage each other!
Click me, click me, click me! Other blog posts

Read the memory array:

  • At this time, many people said, I just change the request header. Does it need to be so troublesome? Of course, there is a simple way, but it needs to be copied every time. It is not a very method, as follows:
ua = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.17 Safari/537.36"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36,Mozilla/5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36"]

Put the request header into the array in advance and use it.

import random
ua = [.....]
r = requests.get(url, headers={
    
    "user-agent":random.choice(ua)})

The above are several ways to set the request header. If you need to add, you can leave a message in the comment area.

Teach you to use three ways to set random request headers. It is inevitable for the crawler to set the request header (user-agent). How to generate a random request header is also what we crawlers must master. After reading this article, you can easily master it!

After reading, you will gain something, like, follow, add to favorites, and encourage each other!
Click me, click me, click me! Other blog posts

Guess you like

Origin blog.csdn.net/qq_45906219/article/details/108563192