requests-crawler implements a simple web page collector

Implement a simple web page collector

  • Crawl the corresponding page data based on Sogou for specifying different keywords

1 Introduce a new knowledge point

Parameter dynamics:

  • If the requested url carries parameters, and we want to perform dynamic operations on the carried parameters, then we must:
    • 1. Encapsulate the carried dynamic parameters into a dictionary in the form of key-value pairs
    • 2. Apply the dictionary to the params parameter of the get method
    • 3. It is necessary to delete the parameters to be carried in the original url carrying parameters

2 demos

Next, we will demonstrate the code and results of crawling Sogou's search for "Mr. Hello Tree"

import requests

keyWord = input("输入一个要查询的内容:")
# 携带了请求参数的url,如果想要爬取不同关键字对应的页面,我们需要将url携带的参数进行动态化
# 实现参数动态化
params = {
    
    
    'query': keyWord
}
url = 'https://www.sogou.com/web'
# params参数(字典):保存请求时url携带的参数
response = requests.get(url=url, params=params)
page_text = response.text
fileName = keyWord + '.html'
with open(fileName, 'w', encoding='utf-8') as fp:
    fp.write(page_text)

print(fileName, '爬取完毕!')

The running screenshot shows that it has been successfully crawled

image-20220723173436964

But looking at the crawled webpage, I found that it is not the same as the real interface

real interface

crawled pictures

In fact, there may be garbled characters

3 problem solving

  1. Solve the garbled problem first

    The reason for garbled characters is that the written encoding is inconsistent with the original encoding. The rules are consistent. Add in the code

    # 修改相应数据的编码格式
    # encoding返回的是响应数据的原式的编码格式,如果给其赋值则表示修改了响应数据的编码格式
    response.encoding = 'utf-8'
    
  2. Solve the problem that the page displays [abnormal access request] and the requested data is missing

  • unusual access request
    • The background of the website has detected that the request is not initiated by a browser but by a crawler program. (Requests not initiated through the browser are all abnormal requests)
  • How does the background of the website know whether the request is initiated through the browser?
    • It is determined by the user-agent in the request header of the determination request
  • What is User-Agent
    • The identity of the request bearer
    • What is the request vector:
      • browser
        • The identity of the browser is uniform and fixed, and the identity can be obtained from the packet capture tool.
      • Crawler
        • IDs are different

The code implementation to solve this problem refers to 2.6

4 How to get the User-Agent of the browser

  1. Hit F12 in a web page.
  2. Click on Network
  3. refresh page
  4. Find any request information, select header on the right
  5. Find the User-Agent

image-20220723180454245

5 The second anti-climbing mechanism

  • UA detection: The website background will detect the User-Agent corresponding to the request to determine whether the current request is an abnormal request.

  • Anti-climbing strategy:

    • UA camouflage: It has been applied to some websites. In the future, the crawler programs we write will have UA detection operations by default.
    • Masquerade process:
      • Capture a certain User-Agent value based on the browser request from the packet capture tool, apply its disguise to a dictionary, and apply the dictionary to the headers parameter of the request method (get, post).

6 Abnormal access request code solution

Store the User-Agent value we got from the browser into the dictionary headers

Then assign the headers dictionary to the headers parameter of the request method

import requests

keyWord = input("输入一个要查询的内容:")
# 携带了请求参数的url,如果想要爬取不同关键字对应的页面,我们需要将url携带的参数进行动态化
# 实现参数动态化
params = {
    
    
    'query': keyWord
}
headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
url = 'https://www.sogou.com/web'
# params参数(字典):保存请求时url携带的参数
# 加入了headers,实现了UA伪装
response = requests.get(url=url, params=params, headers=headers)
# 修改相应数据的编码格式
# encoding返回的是响应数据的原式的编码格式,如果给其赋值则表示修改了响应数据的编码格式
response.encoding = 'utf-8'
page_text = response.text
fileName = keyWord + '.html'
with open(fileName, 'w', encoding='utf-8') as fp:
    fp.write(page_text)
print(fileName, '爬取完毕!')

Crawl successfully

Follow the column to see more details

Guess you like

Origin blog.csdn.net/qq_45842943/article/details/125952152