3. Simple web crawler development

Table of contents

1. Legal and ethical issues in crawler development

1. Legal issues of data collection

(1) Compromising personal information security

(2) Involving national security information

(3) Interfering with the normal operation of the website

(4) Infringement of the interests of others

(5) Insider trading

2. Ethical Agreement

(1) Robots protocol

(2) Do not open source the source code of crawlers

Second, the use of the Requests library

1.requests introduction

2.requests installation

3. Use requests to get web content

(1) get method

(2) post method

(3) The difference between the get method and the post method

3. Use Requests and regular expressions to get web page content

1. Analyze the web page structure

2. Segment extraction

3. Extract the title of the novel

4. Extract the URL of the novel

5. Extract author information

6. Single page extract complete code

7. Multi-page extraction of complete code

4. Homework

references


Starting from this chapter, we will learn to use web crawlers to obtain data directly from the Internet. In order to avoid falling into the trap, we will first briefly discuss the legal and moral issues in crawler development, and establish a bottom line for ourselves, which will neither harm the interests of others, but also better protect yourself.

1. Legal and ethical issues in crawler development

Science and technology is an incomparably sharp double-edged sword. Whether it is good or evil, the key lies in the user's choice. Web crawler itself is a basic technology for quickly and accurately obtaining data information from the Internet, and it is of great value in processing huge amounts of network data. However, there are also some crawlers who do not abide by the rules, crawl data and information that others are unwilling to share as they please, and occupy a large number of server resources, becoming "network pests".

1. Legal issues of data collection

In recent years, many national laws and regulations have regulated the reasonable scope of the collection, use, and dissemination of data information. Web crawlers are no longer a gray industry wandering on the edge of the law, but have become law-based.

(1) Compromising personal information security

"Personal information of natural persons is protected by law" is included in the "Code of the People's Republic of China", and illegal collection, dissemination, and sale of other people's personal information will bear legal responsibility. If it is found that the massive amount of information collected by the crawler contains sensitive personal information of others, if it involves illegal activities, the crawling should be stopped immediately, the relevant code should be corrected and the collected personal information of others should be deleted immediately.

Personal information of natural persons is protected by law. Any organization or individual that needs to obtain personal information of others shall obtain it according to law and ensure information security, and shall not illegally collect, use, process, or transmit personal information of others, and shall not illegally buy, sell, provide or disclose personal information of others. "Civil Code of the People's Republic of China"

(2) Involving national security information

According to Article 285 of the "Criminal Law of the People's Republic of China", the crime of illegal intrusion into computer systems refers to acts of intrusion into computer information systems in the fields of state affairs, national defense construction, and cutting-edge science and technology in violation of state regulations.

If the information data involves the national key information infrastructure and needs to be provided overseas, it should pass the security assessment of the relevant state departments. (See Article 37 of the "Network Security Law of the People's Republic of China" for details)

The personal information and important data collected and generated by the operators of key information infrastructure during their operations within the territory of the People's Republic of China shall be stored within the territory of the People's Republic of China. If it is really necessary to provide overseas due to business needs, a security assessment shall be conducted in accordance with the measures formulated by the national network information department in conjunction with the relevant departments of the State Council; where laws and administrative regulations provide otherwise, follow their provisions. Article 37 of the "Network Security Law of the People's Republic of China"

(3) Interfering with the normal operation of the website

When a web crawler accesses the system, it will occupy the server resources of the website. If the crawler frequently visits a certain website in large quantities, it will occupy a large amount of server resources and even hinder the normal operation of the server. This behavior is also clearly stipulated in the "Data Security Management Measures (Draft for Comment)".

Network operators who use automated means to access and collect website data must not hinder the normal operation of the website; such behaviors seriously affect the operation of the website. If the traffic of automated access and collection exceeds one-third of the average daily traffic of the website, when the website requests to stop the automatic access and collection, it should stop . Article 16 of the "Data Security Management Measures (Draft for Comment)"

(4) Infringement of the interests of others

Public data is not necessarily allowed to be used for third-party profit-making purposes. For example, without the permission of the copyright owner, you may face legal risks if you download and disseminate the public information of a certain website and use it for profit.

Network operators analyze and use the data resources they have, and publish market forecasts, statistical information, personal and business credit and other information, and must not affect national security, economic operation, and social stability, and must not damage the legitimate rights and interests of others. Article 32 of the "Data Security Management Measures (Draft for Comment)"

(5) Insider trading

It is legal to grab public data on a company’s website through a crawler, find that the company’s stock has investment value after analysis, and then buy the company’s stock and make a profit.

It is also legal to crawl public data on a company’s website through a crawler, find that the company’s stock has investment value after analysis, and then sell the analysis results and data to a fund company to obtain sales income.

If a crawler crawls the public data of a company’s website and finds that the company’s stock has investment value after analysis, first sell the analysis results and data to a fund company, and then buy the crawled company’s stock and make a profit. At this time, this behavior is suspected of insider trading, which is a serious violation of the law. [1]

Here is an article that discusses the cases of reptiles being punished for violating regulations. I suggest that everyone understand that the title of the article is " Wrote a reptile, the effect is excellent, and more than 200 people in the company were arrested!" "

2. Ethical Agreement

(1) Robots protocol

The Robots protocol is an agreement between the website and the crawler. It tells the corresponding crawler the permission permission in a simple and direct txt format. That is to say, robots.txt is the first file to be checked when the search engine visits the website. When a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will determine the scope of access according to the contents of the file; if the file does not exist, all search spiders will be able to access all pages on the site that are not password protected.

Figure 1 Robots.txt of Zhihu website

This Robots.txt file indicates that, for any crawler, it is allowed to crawl other URLs except those starting with Disallow. However, the Robots agreement is only a "gentleman's agreement" and has no legal effect. This agreement represents more of a contractual spirit. Only by complying with this rule can Internet companies ensure that the privacy data of websites and users will not be violated. Violating the Robots protocol will bring huge security risks. Baidu once sued 360 for violating Baidu's Robots agreement.

Figure 2 Baidu sued Qihoo for violating the Robots agreement

(2) Do not open source the source code of crawlers

Open source is a good thing, but don't just disclose the source code of the crawler. Because people with ulterior motives may get the published crawler code to attack the website or maliciously grab data.

Second, the use of the Requests library

1.requests introduction

Regarding the introduction of requests, the official website of requests says " Requests  is an elegant and simple HTTP library for Python, built for human beings." It means that requests is an elegant and simple HTTP library for Python, built for human beings. Requests also has a subtitle "HTTP for humans" ("This is the HTTP library for humans"). In addition, you can learn about the introduction of the Chinese version of requests, which is full of humor.

Requests is the only non-GMO Python HTTP library that's safe for humans to enjoy.

WARNING : Unprofessional use of other HTTP libraries can lead to dangerous side effects including: security flaws, redundant code syndrome, reinventing the wheel syndrome, document gnawing syndrome, depression, headaches, and even death. [2]

The urllib and urllib2 libraries that come with the Python system can also obtain web content, but it is not so convenient to use. It is extremely easy to send HTTP1.1 requests using Requests. There is no need to manually add query strings to URLs, nor to format POST data. Keep-alive and HTTP connection pool are 100% automated, all thanks to the support of the underlying urllib3.

Requests meets all the needs of today's web. Requests supports HTTP features including:

Keep Alive and
Connection Pooling
International domain names
and URLs
cookie
persistent session
Browser-based SSL verification Automatic Content Decoding Basic/Digest Authentication Elegant Key/Value Cookies automatic decompression
Unicode
Response Body
HTTP(S)
proxy support

segment file

upload

stream download Connection timed out chunked request .netrc support thread safety

2.requests installation

Use pip to install the requests library, enter the following code in the terminal or command window:

pip install requests
Figure 3 requests automatic installation process

After the installation is complete, enter import requests in the interactive environment of python or pycharm. If no error is reported, the installation is successful.

Figure 4 Test whether requests are installed successfully

3. Use requests to get web content

(1) get method

The get method mainly obtains information from the server. To implement a simple web link, you need to use at least two things, one is the request method: such as .get(), .post(), .head(), .put() and other methods, and the other is the url address that needs to be linked , and then pass the url as a parameter in the request method. Here we take the get function as an example. After execution, we will get a response from the server. If we output it, we will see <Response [200]>. The 200 in brackets is the HTTP status code status code.

import requests

url = "http://www.baidu.com"
response = requests.get(url)
print(response)

# 输出
<Response [200]>

The status code can be obtained using the .status_code property. The status code is written in the HTTP protocol. When the client communicates with the server, the client sends a request, and the server returns the result and a status value to tell the user whether the request succeeded or failed. " Common HTTP Status Codes ".

import requests

url = "http://www.baidu.com"
response = requests.get(url)
status_code = response.status_code
print(status_code)

# 输出
200

Line 4 uses the .content attribute to obtain the bytes-type web page source code, and line 5 uses the .decode() method to decode the bytes-type web page source code into a string-type web page source code.

Figure 5 Get the source code of the web page

The reason why the bytes code is converted to a string type code is that Chinese cannot be displayed normally under the bytes code. These two lines can be combined into one line. 

import requests
url = "http://www.baidu.com"
response = requests.get(url)
content_stirng = response.content.decode()
print(content_stirng)

The parameters in decode() are used to specify the encoding format. If it is default, the UTF-8 encoding format is used by default to decode the byte type into the string type source code. Some webpages may not be encoded in UTF-8, so the name of the encoding format needs to be indicated in brackets.

Figure 6 Check the webpage encoding format in the webpage source code

 

Figure 7 The web page encoded with gb2312, but I don’t know why it actually needs to be decoded with gb18030

 The following is the format-based decoding of web pages in different encoding formats.

import requests
sina_url = "http://www.sina.com.cn"
jb51_url = "https://www.jb51.net"
sina_html = requests.get(sina_url).content.decode('utf-8')
print(sina_html)

jb51_html = requests.get(jb51_url).content.decode('gb18030')
print(jb51_html)

Common encodings include ASCII, GBK, GB2312, GB18030, Unicode, UTF-8, etc. There is an interesting article about the relationship between these encodings, and it is recommended to read. But unfortunately only reprinted, the original link can not be opened. "The difference between ASCII, Unicode, GBK and UTF-8 character encoding" .

Both the content and text properties of requests can be used to return page content, but there are certain differences between them. The content returned is a bytecode, and if it needs to be displayed normally, it needs to be decoded into a string again. And text returns a Unicode-processed string, which will be decoded according to the charset in the page header. If it cannot be found in the charset, the chardet module will be called to guess the encoding format. In terms of storage, content saves source code, which saves more space than text. However, the decoding of text may not be completely correct, so it needs to be re-encoded using content.decode.

Use chardet to guess the encoding format:

import requests
import chardet
sina_url = "http://www.sina.com.cn"
jb51_url = "https://www.jb51.net"
sina_html = requests.get(sina_url).content
print(chardet.detect(sina_html))

jb51_html = requests.get(jb51_url).content
print(chardet.detect(jb51_html))

# 输出
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

Next, did you also think of using the value of 'encoding' in chardet.detect to help complete the decoding of content? Here I suggest that you explore it yourself, and then you can see what you will find.

(2) post method

The post method mainly transmits data to the server. Below we test in the http://httpbin.org/ website.

httpbin.org This website can test various information of HTTP requests and responses, such as cookie, ip, headers and login authentication, etc., and supports GET, POST and other methods, which is very helpful for web development and testing.

Some webpages must use a specified request method to obtain content, for example: http://httpbin.org/post only allows the use of the post method, otherwise a 405 error occurs.

# 使用了错误的访问方式
import requests
import json

url = 'http://httpbin.org/post'

response = requests.get(url).content.decode()
print(response)

# 输出
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>405 Method Not Allowed</title>
<h1>Method Not Allowed</h1>
<p>The method is not allowed for the requested URL.</p>

Use the post method to get the source code of the webpage normally

import requests
import json

url = 'http://httpbin.org/post'

response = requests.post(url).content.decode()
print(response)

# 输出
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.25.1", 
    "X-Amzn-Trace-Id": "Root=1-6054bac9-3536341b67d231c2571c3eb9"
  }, 
  "json": null, 
  "origin": "***.***.**.*", 
  "url": "http://httpbin.org/post"
}

Use the post method to submit information to the form and get the source code of the web page

import requests
import json

url = 'http://httpbin.org/post'
data = {
    'key1':'value1',
    'key2':'value2'
}
response = requests.post(url,data = data).content.decode()
print(response)

# 输出
{
    "args":{},
    "data":"",
    "files":{},
    "form":{
        "key1":"value1",
        "key2":"value2"
    },
    "headers":{
        "Accept":"*/*",
        "Accept-Encoding":"gzip, deflate",
        "Content-Length":"23",
        "Content-Type":"application/x-www-form-urlencoded",
        "Host":"httpbin.org",
        "User-Agent":"python-requests/2.25.1",
        "X-Amzn-Trace-Id":"Root=1-6054b70b-073d1a773c27f3fa1d91862d"
    },
    "json":null,
    "origin":"***.***.**.*",  # ‘*’部分为IP
    "url":"http://httpbin.org/post"
}

(3) The difference between the get method and the post method

  • The main function of the get method is to obtain information from the server, and can also pass some information to the server, and this information also tells the server what the user wants.
  • Use the get method to pass parameters to the url. The get method submits data through the URL, and the data can be seen in the URL; the post method, the data is placed in the HTML HEADER and submitted.
url_get = 'http://httpbin.org/get'
param = {
    'key1': 'value1',
    'key2': 'value2'
}
response = requests.get(url_get,params=param)
print(response.url)

# 输出
http://httpbin.org/get?key1=value1&key2=value2 # 提交的参数在url中可以看到
  •  Because the get method is limited by the length of the URL, different browsers have different restrictions on it. The amount of data transmitted by post is large, and there is no limit in theory. [3]

3. Use Requests and regular expressions to get web page content

Example: extract the basic information of the novel completed by Qidian.com.cn. This actual combat is only for exchange and learning, and originality is supported. Please subscribe and watch on the Qidian Chinese website.

Main tasks: analyze the structure of the webpage, use requests to obtain the source code of the webpage, use regular expressions to express the rules and extract the novel title, URL and author information, and save the extracted information to a csv file.

1. Analyze the web page structure

Figure 8 Analysis of web page structure

2. Segment extraction

Regular expression: 'li data-rid(.*?)</em'

3. Extract the title of the novel

Regular expression: 'data-bid="\\d*">(.*?)</a'

4. Extract the URL of the novel

Regular expression: 'href="(.*?)" data-bid'

5. Extract author information

Regular expression: 'class="author".*?_blank">(.*?)</a>'

6. Single page extract complete code

import requests
import csv
import re

url = 'https://www.qidian.com/finish'
content_list = []
reponse = requests.get(url).content.decode('utf-8')
content_list_1 = re.findall('li data-rid(.*?)</em',reponse,re.S)

for content in content_list_1:
    result = {}
    result['title'] = re.findall('data-bid="\\d*">(.*?)</a',content,re.S)[0]
    result['url'] = 'https:' + re.findall('href="(.*?)" data-bid',content,re.S)[0]
    result['author'] = re.findall('class="author".*?_blank">(.*?)</a>',content,re.S)[0]

    content_list.append(result)

with open('qidian_finish.csv','w',encoding='utf-8-sig') as f:
    writer =csv.DictWriter(f,fieldnames=['title','url','author'])
    writer.writeheader()
    writer.writerows(content_list)

7. Multi-page extraction of complete code

import requests
import csv
import re
url = "https://www.qidian.com/finish?action=hidden&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=2&page="
content_list = []
for page in range(1,6):
    reponse = requests.get(url+str(page)).content.decode('utf-8')

    content_list_1 = re.findall('li data-rid(.*?)</em',reponse,re.S)

    for content in content_list_1:
        result = {}
        result['title'] = re.findall('data-bid="\\d*">(.*?)</a',content,re.S)[0]
        result['url'] = 'https:' + re.findall('href="(.*?)" data-bid',content,re.S)[0]
        result['author'] = re.findall('class="author".*?_blank">(.*?)</a>',content,re.S)[0]

        content_list.append(result)

with open('qidian_finish.csv','w',encoding='utf-8-sig') as f:
    writer =csv.DictWriter(f,fieldnames=['title','url','author'])
    writer.writeheader()
    writer.writerows(content_list)
Figure 9 The effect of saving data to a csv file

4. Homework

Find a novel you like in Nunu Bookstore ( https://www.kanunu8.com/ ), such as the Jin Yong series, the four major masterpieces, etc., use the requests library and regular expressions to combine the chapter URLs and The chapter names are crawled and saved in csv. The files that need to be submitted for the job: .py file, .csv file.

references

[1] Python crawler development from entry to practice. Xie Qiankun. People's Posts and Telecommunications Press

[2]Requests document.https://requests.readthedocs.io/zh_CN/latest/index.html

[3] Post transmission data size. https://blog.csdn.net/w8998036/article/details/105971328

Guess you like

Origin blog.csdn.net/qq_40407729/article/details/113933514