Python anti-climbing artifact "fake_useragent" that doesn't talk about Wude

Hello, everyone, I am the little gray ape, a programmer who can write bugs.

Friends who have studied HTTP should all know that when we visit a website, we usually need to add a request header. Otherwise, in many cases, the browser will think that you are an illegal request and reject your request.

Therefore, when we visit the website, we usually add the request header, and the most commonly used method of this is to add user-agent to disguise this request as a browser, so that when we visit the website, we will not It is considered an illegal request.

So what is a user-agent?

User-agent is also called user agent , or UA for short. It is a very special string header that enables the website server to access the operating system, configuration information, CPU type, browser version and other information used by the client. The User-Agent will tell the web server what tool the visitor used to request. If it is a crawler request, it will generally be rejected, and if it is the user's browser, it will respond.

Under normal circumstances, the crawler we generally write will tell the server itself to send a Python crawl request by default, and the general website is not allowed to be accessed by the crawler. The main reason may be that it involves commercial issues. Therefore, we can easily fool the website by changing the User-Agent field and avoid triggering the corresponding anti-crawl mechanism.

The fake_useragent library of Python solves the problem of frequent replacement of user_agent manually. It can be said that it is a very friendly anti-picking artifact for Python crawler development.

Next, my friends and I will introduce in detail the use of this library:

Since fake_useragent belongs to a third-party library, we first need to use the corresponding pip command to import the fake_useragent library when using it. The corresponding pip command is as follows:

pip install fake-useragent

Then import the library into the program, check whether the program reports an error, if there is no error, the installation is successful!

 

fake-useragent使用

After the installation is successful, the specific usage of the library is as follows:

First import the library in the program,

from fake_useragent import UserAgent

What we want to use is the UserAgent method behind it, and we can use this method to randomly generate a request header.

details as follows:

from fake_useragent import UserAgent
url = 'https://www.sogou.com/web'
headers={
    'User-Agent': UserAgent().random
}
res =requests.get(url=url,params=param,headers=headers)

Using the above method, we can randomly generate a request header and avoid the request to trigger the anti-climbing mechanism. And it also reduces the trouble that we need to update the request header frequently.

 

Example verification

Next, we use a specific instance to verify the fake_useragent library.

Crawling Sogou browser page data according to the search information we entered

# 爬取搜狗首页的页面数据
import requests,os
from fake_useragent import UserAgent

url = 'https://www.sogou.com/web'
word=input('enter a word:')
param = {
    'query': word
}

headers={
    'User-Agent': UserAgent().random
}
res =requests.get(url=url,params=param,headers=headers)
print(res.request.headers)
# print(res.text)
html =res.text
file_name= word+'.txt'
with open(file_name,'w',encoding='utf-8') as f:
    f.write(html)
if os.path.exists(file_name):
    print('爬取结束')

Well, about the use of Python's fake_useragent library, I will share with my friends here. If there are deficiencies, I hope you can correct me and make progress together!

Feel good, remember to like and follow !

The big bad wolf accompanies you to make progress together!

 

Guess you like

Origin blog.csdn.net/weixin_44985880/article/details/110882028