Python web crawler from 0 to 1 (3): Getting started with crawlers based on the Requests library

  After learning the basic usage of the Requests library, we can use the Requests library for some of the simplest web crawling. Since the Beautifulsoup4 library has not been learned to analyze the response, the crawler cannot automatically analyze and extract the response content. This chapter will use multiple samples to crawl web pages, analyze common crawling problems and give solutions. It mainly includes four examples: commodity information query, search engine related search, network image storage, and IP address localization query.

Getting started with the Requests library


1. Crawling information about certain products

The process of initiating a request is the same. After defining the URL of the web page to be crawled, we requests.get()request the specified web page information from the server through the method, and Responseoutput the entity content in the object obtained in response . ( Note that the URL is hidden, you need to manually change the link of a product )

import requests

url1 = 'https://item.jd.com/10000xxxxxxx.html'

r = requests.get(url1, allow_redirects=False)
print(r.status_code)
r.encoding = r.apparent_encoding
print(r.text)

But there was a problem in the response : Although the response status code is 200, the message body is not the information seen from the browser, but a JavaScript code is returned , similar to the 303 See Other redirection in HTTP

200
<script>window.location.href='https://passport.jd.com/uc/login?ReturnUrl=http%3A%2F%2Fitem.jd.com%2F10000xxxxxxx.html'</script>

A login page actually pops up instead of the desired product page

 Humanoid visit simulation of crawler

  Based on the knowledge of HTTP, we know that the only way for the server to understand the client accessing the web page is to analyze the header information of the request message. Just like distinguishing some pigs from sheep through a small seam, we can accurately distinguish these two animals by observing the characteristics of some of their fur; the server analyzes some fields in the header of the request message to distinguish between the browser access Users and crawlers that automatically visit. The main distinguishing field is the user agent field User-Agent.

By comparing the User-Agentheader fields of the request constructed by general browsers and crawlers , we easily discovered the reason why the website returned a login page to the crawler.

#一般浏览器头部
{
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51'}
#爬虫头部
{
    
    'User-Agent': 'python-requests/2.24.0'}

The Requests library uses its own dedicated head very steadily, as if to declare to everyone: I am a crawler, who am I afraid of?

The ending was tragic, and the visit was blocked directly...

This question has taught us a very important lesson: under normal circumstances, large websites will enable User-Agentautomatic user agent ( ) filtering, and all access from non-browser and friendly crawlers will be denied. Under the circumstance of ensuring that our crawlers comply with ethical and legal standards, when we want to crawl the content of such large websites, we need to pretend to be a certain amount.

Use the user agent field constructed by the browser as the user agent for crawler access,

import requests

url1 = 'https://item.jd.com/10000xxxxxxx.html'

headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
                         Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51'}
r = requests.get(url1, headers=headers, allow_redirects=False)
# print(r.request.headers)
print(r.status_code)
r.encoding = r.apparent_encoding
print(r.text)

In this way, through a little disguise, we successfully bypassed the site’s user agent screening and got the web content

200
<!DOCTYPE HTML>
<html lang="zh-CN">
<head>
...

2. Submit search by keywords of major search engines

  In addition to collecting information from specific web pages, web crawlers can also find information through search engines. In addition to providing users with page-based "normal search" services, search engines have developed APIs and search interfaces to facilitate the use of search engine services by various applications. By using the API interface or constructing a search URL that conforms to the format specification, the search engine can provide search services and return the results. Similarly, crawlers can also use them to obtain search services. (Note that all crawlers except designated friendly crawlers are prohibited on Baidu's entire site. Any private access must follow the humanoid principle to reduce the loss of website network and system resources.)

 Use dictionary to pass request parameters

The following uses a certain degree and a certain non-existent search engine to introduce related search by passing search request parameters through the search interface

Note: When using the URL search interface for associated search, the keywords are transmitted in plain text (not encoded), pay attention to access security. If you need more secure search, please use API

  Some degree

After consulting the data, the search interface format of a certain degree is

https://www.baidu.com/s?wd=[key word]
#[key word]为搜索关键字

In fact, the search pages are all in the /sdirectory, and the keywords you want to search are provided to a wdparameter named . You can use the paramsparameters in the Requests library to synthesize the search keywords into the access URL to search.

  Non-existent search engine

Similarly, the search interface format of this non-existent search engine is roughly the same to some extent

https://www.google.hk/search?q=[key word]

  Implementation code

Note that the user agent field above must be used, otherwise a verification page will be returned to prevent access

import requests

headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
                         Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51',
           'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6'}

url_bd = 'http://www.baidu.com/s'
url_gg = 'https://www.google.hk/search'

# key_word = 'Hello'
key_word = input('Search something:\n')
param_bd = {
    
    'wd': key_word}
param_gg = {
    
    'q': key_word}

r = requests.get(url_bd, params=param_bd)
# r = requests.get(url_gg, params=param_gg)
print(r.status_code)
print(r.request.url)
r.encoding = r.apparent_encoding
print(r.text[:1000])

Due to unknown circumstances, a search using the HTTPS protocol will return a JavaScriptcontent, requesting to jump to the HTTP protocol, and directly changing to the HTTP protocol can be accessed normally


3. Crawling and storage of web pictures

The response obtained by using the Requests library to send a request is not only text, but when the response body is a picture, the body can be written into a file and saved in a specified directory. Need to use the osstandard library to check the legitimacy and existence of the file path, and use the file method to open and write files.

import requests
import os

url = 'https://http.cat/'
# status = input('Input the Status code you want download:')
status = '200'
# status = '402'
Path = 'G:\\Downloads\\' + status + '.jpg'
url = url + status + '.jpg'
# Path = 'G:\\Downloads\\' + url.split('/')[-1]
try:
    if not os.path.exists(Path):
        r = requests.get(url)
        with open(Path, 'wb') as f:
            f.write(r.content)
            print('Download Success!')
    else:
        print('File already exists')
except:
    print('Error')

(Since the server in the example is on the external network, the download will take some time, please be patient)

  • Regarding file operations and osthe contents of the library, this article will not repeat them, and readers who need to know can search for them.

  • Since backslashes in Python have an escape function, the backslashes in the save path are all escaped with two (the path can also use slashes, and two are also required)

  • When downloading some content, you may not know the file extension in advance. You can also use the following code to replace the statement defining the path to make the downloaded URLfile name the same as the file name in the remote resource. You may need to check whether the directory exists and create it

    Path = root + url.split('/')[-1]
    # root 为下载根目录,变量另外可调
    
    # 以下为判断根目录是否存在并自动创建目录
    if not os.path.exists(root):
        try:
            os.mkdir(root)
        except:
            print('Error while creating the root path')
    

4. IP attribution query

  This IP attribution query example mainly uses the ip138website (https://m.ip138.com/) to search for the attribution of an IP address. The IP territorial query service of this website provides a paid API interface. For the white prostitution party, how to not use the paid API to query? In fact, like search engines, we can also implement a query by constructing a search URL, which requires manually using the service and analyzing the structure of the URL.

Through observation, we have obtained that the website also uses the method of constructing parameter URL to send the IP address that the user needs to query to the server. The structure is as follows:

https://www.ip138.com/iplookup.asp?ip=xxx.xxx.xxx.xxx

Similarly, we only need to construct such a URL to query the desired content. Due to the lack of beautifulsouplibrary response analysis, we can only manually observe the content of the response body to get the desired result.

import requests

url = 'http://m.ip138.com/iplookup.asp?ip='
url = url + input('Input IP address you want search:')

headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
                         Chrome/85.0.4183.102 Safari/537.36 Edg/85.0.564.51',
           'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6'}

r = requests.get(url, headers=headers)
print(r.status_code)
r.encoding = r.apparent_encoding
print(r.text)


to sum up

  Using only the Requests library, we can easily construct requests and roughly obtain response information. In this article, we crawled the product information of JD.com, used the search interface of the search engine for crawler search, downloaded and saved the pictures in the webpage, and used the query interface of the website to implement IP territorial query. The Requests library gave us a preliminary understanding of the powerful functions of crawlers. After learning the beautifulsouplibrary and Scrapyframework later, we can extract information more accurately and perform more complex actions.

Guess you like

Origin blog.csdn.net/Zheng__Huang/article/details/108611816