Python Notes: Introduction to Web Information Crawling (1)

0. Introduction

Here, we encountered a practical problem in our work, namely:

  • It is necessary to obtain the sku title and main picture link information in the web page according to the given Taobao website link.

Taking this opportunity, we just come to learn the basic skills related to web crawling, and then to solve the above problems.

Furthermore, we also learn how to download pictures, videos and other files in web pages.

However, it should be noted that, here, we are just a blog post of a learning nature, and the content is just a taste. It is only used for the simple realization of the functions required by the work, and for sharing and communication with everyone.

However, if there are readers who have conducted in-depth learning based on this article and used related technologies to cause related legal issues, please forgive this article for no responsibility.

1. Web page information acquisition

First, let's look at how to obtain web page information.

1. Mo Fan Tutorial Method

In the video tutorial of Mofan in the reference link 1, he uses the urllib.request.urlopenmethod of urllib library to crawl the content of the webpage.

The specific commands are:

from urllib.request import urlopen

html = urlopen("https://detail.tmall.com/item.htm?spm=a230r.1.14.24.7acb2075Uiwtjj&id=601871231483&ns=1&abbucket=20").read()

But in actual operation, we found that there are many pits, mainly including:

  1. I encountered a certificate problem after the two crawling commands were run, which resulted in a failure to obtain web content for the second time, and the following error occurred:
    URLError: <urlopen error [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:748)>
    
  2. When the html content is decoded, because the web page may not utf-8be encoded according to the html.decode("utf-8")order , the following error may appear in the command:
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 1742: invalid start byte
    

After consulting with friends in the company's data group, it was found that the former is most likely to be blocked by the webpage judged to be a crawler, and the latter is because the webpage content does not support utf-8encoding.

A better way to obtain web content is to use requests plus header information for data crawling.

2. Header information acquisition

In the ordinary urlopen method, what we obtain is ordinary stream information, and cannot know the encoding method of the content in the http information. Therefore, the above-mentioned decoding does not know what method to decode.

To do this, we need to know the header information of the relevant website in advance when requesting the URL, so that we can smoothly decode and analyze the content of the webpage in the subsequent operations. However, in a more general case, we actually bring the header information directly in the request process, so that we can analyze the content of the web page while reading the web page information.

Therefore, we need to examine how to obtain the header information of the web page request.

The way to get the header information we can get the request command directly by getting the request curl command of the webpage and then using the online conversion tool (for example, refer to the website tool in link 6 ).

We open the Taobao website link and use the F12 shortcut key to open the developer toolbar. The first request obtained after refreshing is the direct request command of the website.

Insert picture description here

Right-click the link and copy its curl address, we can get the command directly requested in the cmd command line, and then use the above online conversion tool to convert it to python code.

After the conversion, we can obtain the corresponding request code. We only take the header information headersand delete the useless information such as cookies and give an example as follows:

headers = {
    
    
    'authority': 'detail.tmall.com',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
    'referer': 'https://s.taobao.com/',
    'accept-language': 'zh-CN,zh;q=0.9,zh-TW;q=0.8',
}

3. Use requests to get web content

Finally, let's look at how to get the content of the webpage.

In Mofan’s video, after the web page information is obtained, there is a data stream. We first need to read()obtain its content through a decodemethod , and then convert it into a readable code (if there is Chinese content in it) through the method.

If it is in accordance with urllibthe urlopenmethod used in the Mofan tutorial , then we also need to manually pass readand decodefunction to read the content, which is a bit similar to the way of reading python files.

However, if the method of adding the header information to the requests is adopted, the above process has actually been configured in the header information. Therefore, the result we directly obtain is the result that we can read.

In fact, the above-mentioned curl conversion python tool will generate the request call request. We give the calling code as follows:

import requests

headers = {
    
    
    'authority': 'detail.tmall.com',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
    'referer': 'https://s.taobao.com/',
    'accept-language': 'zh-CN,zh;q=0.9,zh-TW;q=0.8'
}

params = (
    ('id', '629648945951'),
)

response = requests.get('https://detail.tmall.com/item.htm', headers=headers, params=params)

Or, we can be more violent and directly pass the url link of the web page without passing in the parameter via params.

import requests

headers = {
    
    
    'authority': 'detail.tmall.com',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
    'referer': 'https://s.taobao.com/',
    'accept-language': 'zh-CN,zh;q=0.9,zh-TW;q=0.8'
}

response = requests.get('https://detail.tmall.com/item.htm?spm=a230r.1.14.24.349e20750ClDZd&id=601871231483&ns=1&abbucket=20', headers=headers)

In this way, we can directly obtain the content information in the web page.

4. Investigation of the reasons for the failure of web crawling

In the actual test, we found that even with the use of header information, we still encountered the problem caused by the excessive frequency of the above requests.

Below, we briefly analyze it to see if we can circumvent this problem in any way.

I asked a colleague who worked on data later and found that this problem is not a good solution. The essential reason is the anti-crawl mechanism of the web page. The reason that can be found to be crawler information is the request behavior of python and the actual occurrence in the browser. The request behavior of is inconsistent.

Specifically, in a browser, every time you open a webpage, it will actually trigger a large number of requests for related webpages, and the cookie information in the browser's request will change with the occurrence of browsing behavior, and the cookie information in the request sent by the request It is often fixed, which leads the web page to find out whether the request comes from the code or the user's actual browser behavior.

Of course, the follow-up does not mean that there is no strategy to circumvent these anti-climbing mechanisms, but on the whole it is a process of fighting wits and courage with the website designer. This is just a general introduction blog post, and there is no need to study so deeply. Up. . .

2. Web page information analysis

Next, let's examine the analysis method of web page information.

We use the BeautifulSoup tool to analyze web content.

BeautifulSoup is an advanced regular expression package for web page information. We can directly use the built-in methods in BeautifulSoup to obtain web page information without writing complicated regular matching rules.

1. Installation of BeautifulSoup

First of all, we quickly give the following BeautifulSoup installation process. This actually only requires pip to install. The only thing to note is that the name of the pip package of the BeautifulSoup library beautifulsoup4is no difference.

After the installation is complete, you also need to pay attention to the import method. The import method of BeautifulSoup is:

from bs4 import BeautifulSoup

2. Use of BeautifulSoup

Now, let's take a look at the specific use of BeautifulSoup.

To explain in detail how to use it, we need to first look at the structure of the information content on the web page.

Generally speaking, the information on the webpage will look like this:

<meta name="keywords" content="花花公子男装夹克男春季新款休闲冲锋衣连帽宽松潮流短款男士外套"/>

To get the information, the syntax of BeautifulSoup is:

soup = BeautifulSoup(html)
skutitle = soup.find("meta", {
    
    "name": "keywords"})["content"]

Among them, meta is the information field of the file tree, the following parameters are the filter conditions, and finally the content of the content field is retrieved.

Similarly, we can quickly get the python command to get the first image of Taobao products:

image = soup.find("img", {
    
    "id": "J_ImgBooth"})["src"]

3. Download of files from the webpage

Finally, let's take a look at how to download files from a web page, for example, how to get the pictures obtained above.

Give the url link of the obtained picture:

url = "https://img.alicdn.com/imgextra/i4/1851041537/O1CN01qd5ZSB1NDzO4pNexv-1851041537.jpg_430x430q90.jpg"

There are two ways to implement this part:

  1. One way to achieve this is to use the wget library to download it as a file;
  2. The second is to read it as a data stream and then write it to a file.

Below, we will examine them separately:

1. Read the file in the webpage as a data stream and write it into a file

The code sample is as follows:

import requests 
url = 'https://img.alicdn.com/imgextra/i4/1851041537/O1CN01qd5ZSB1NDzO4pNexv-1851041537.jpg_430x430q90.jpg' 

with open("image.jpg", "wb") as fp:
    r = requests.get(url) 
    fp.write(r.content)

In this way, we can get the image file from the original link.

2. Use wget to download files directly

If under the bash command, if we want to get the above network picture, we only need to use the following command:

wget https://img.alicdn.com/imgextra/i4/1851041537/O1CN01qd5ZSB1NDzO4pNexv-1851041537.jpg_430x430q90.jpg image.jpg

Similarly, in python, there is also a wget library, which can quickly download files:

import wget

wget.download(url, "image.jpg")

3. Reference link

  1. Python crawler basic tutorial (Scraping Tutorial)
  2. Python uses wget to download network files
  3. Three ways to download files in python
  4. Beautiful Soup 4.4.0 documentation
  5. Curl to python online tool
  6. https://curl.trillworks.com/

Guess you like

Origin blog.csdn.net/codename_cys/article/details/109631920