You have to secretly learn Python, and then stun everyone (7th day)

Article Directory

  • foreword
  • welcome to our circle
  • First sight of reptiles
  • why reptiles
  • General crawler architecture
  • The working steps of the crawler
  • Characteristics of a good reptile
  • 1. High performance
  • 2. Scalability
  • 3. Robustness
  • 4. Friendliness
  • First experience with reptiles
  • requests.get()
  • Four properties commonly used in Response objects

foreword

Early review: You have to learn Python secretly, and then stun everyone (Day 6)

As I said the day before, we are going to learn about reptiles today. Yes, we will start crawling today.

img

本系列文默认各位有一定的C或C++基础,因为我是学了点C++的皮毛之后入手的Python。
本系列文默认各位会百度,学习‘模块’这个模块的话,还是建议大家有自己的编辑器和编译器的,上一篇已经给大家做了推荐啦?

本系列也会着重培养各位的自主动手能力,毕竟我不可能把所有知识点都给你讲到,所以自己解决需求的能力就尤为重要,所以我在文中埋得坑请不要把它们看成坑,那是我留给你们的锻炼机会,请各显神通,自行解决。
1234567

If you are a novice, you can take a look at the following paragraph:

welcome to our circle

img
If you encounter difficulties in learning and want to find a python learning and communication environment, you can scan the QR code of CSDN official certification below on WeChat to join us ==

insert image description here

First sight of reptiles

I'm not a bigwig, so I won't present a lot of particularly tall crawler skills as soon as I come up, let's take it step by step.

Web crawler, also called web spider (Web Spider). It crawls web page content based on the web page address (URL), and the web page address (URL) is the website link we enter in the browser. For example: https://www.baidu.com/, it is a URL.

why reptiles

The processing objects of general search engines are Internet webpages. At present, the number of Internet webpages has reached tens of billions. Therefore, the first problem that search engines face is: how to design an efficient download system to transmit such a large amount of webpage data to the local area. A mirror backup of Internet web pages is formed locally.

Web crawlers can play such a role and complete this arduous task. It is a very critical and basic component in the search engine system.

Take a very common chestnut: Baidu. Baidu, a company, will continuously crawl tens of thousands of websites and store them on its own servers. The essence of your search on Baidu is to search for information on its server. The results you search are some hyperlinks. After the hyperlinks jump, you can visit other websites.

General crawler architecture

img

Okay, can you understand the picture above? If not, let's look at some flow charts for users to access the website:

img

This is a human-computer interaction process, so let's take a look at what jobs the crawler can replace in this closed loop:

img

Yes, it is very consistent with the characteristics of our "artificial intelligence" and frees our hands.

The working steps of the crawler

第1步:获取数据。爬虫程序会根据我们提供的网址,向服务器发起请求,然后返回数据。

第2步:解析数据。爬虫程序会把服务器返回的数据解析成我们能读懂的格式。

第3步:提取数据。爬虫程序再从中提取出我们需要的数据。

第4步:储存数据。爬虫程序把这些有用的数据保存起来,便于你日后的使用和分析。
1234567

This is the working principle of crawlers. No matter how the learning content changes later, the core is the principle of crawlers.

This chapter aims to understand reptiles straightforwardly, so too many unnecessary concepts will not be extended.

Characteristics of a good reptile

In other words, good code seems to have these characteristics. But can anyone name the characteristics of a good architecture? It made my eyes light up, and I exclaimed: "Boss, take me"

1. High performance

The performance here mainly refers to the crawling speed of web pages downloaded by the crawler. A common evaluation method is to use the number of web pages that the crawler can download per second as the performance index. The more web pages that can be downloaded per unit time, the higher the performance of the crawler.

To improve the performance of the crawler, the operation method (disk IO) of the program accessing the disk and the selection of the data structure in the specific implementation are critical during design. For example, for the queue of URLs to be captured and the queue of URLs that have been captured, because the number of URLs is very large, The performance of different implementations is very different, so efficient data structures have a great impact on crawler performance.

2. Scalability

Even if the performance of a single crawler is very high, it still takes a long time to download all web pages locally. In order to shorten the crawling cycle as much as possible, the crawler system should have good scalability, that is, it is easy to increase Crawl server and number of crawlers for this purpose.

At present, the practical large-scale web crawlers must run in a distributed manner, that is, multiple servers are dedicated to crawling. Each server deploys multiple crawlers, and each crawler runs in multiple threads to increase concurrency in various ways.

For giant search engine service providers, it may be necessary to deploy data centers globally and in different regions, and crawlers are also assigned to different data centers, which is very helpful for improving the overall performance of the crawler system.

3. Robustness

When a crawler wants to access various types of web servers, it may encounter many abnormal situations: for example, the HTML code of the web page is not standardized, the crawled server suddenly crashes, or even climbed into the trap, etc. It is very important that the crawler can correctly handle various abnormal situations, otherwise it may stop working irregularly, which is unbearable.

From another point of view, assuming that the crawler program dies during the crawling process, or the server where the crawler is located is down, a robust crawler should be able to: when the crawler is started again, it can restore the previously crawled content and data structure, Instead of having to do all the work from scratch every time, this is also a manifestation of the robustness of the crawler.

4. Friendliness

The friendliness of crawlers includes two meanings: one is to protect some privacy of the website; the other is to reduce the network load of the crawled website. The objects crawled by crawlers are various types of websites. For website owners, some content does not want to be found by everyone, so it is necessary to set up a protocol to inform crawlers which content is not allowed to be crawled. Currently, there are two mainstream methods to achieve this goal: crawler prohibition protocol and webpage prohibition mark.

This point will be explained in detail later.

First experience with reptiles

The first step of the web crawler is to obtain the HTML information of the web page according to the URL. In Python3, you can use urllib.request and requests for web crawling.

 urllib库是python内置的,无需我们额外安装,只要安装了Python就可以使用这个库。
 requests库是第三方库,需要我们自己安装。
12

The basic methods of the requests library are as follows:

img

requests.get()

Look at a piece of pseudocode:

import requests
#引入requests库
res = requests.get('URL')
#requests.get是在调用requests库中的get()方法,
#它向服务器发送了一个请求,括号里的参数是你需要的数据所在的网址,然后服务器对请求作出了响应。
#我们把这个响应返回的结果赋值在变量res上。
123456

Just now I told them in the group that the most important thing to learn Python is to lay the foundation, starting from data types and data structures. Then let's take a look at what data type the crawler returns to get the data.

Just find a URL first, or just the URL of the little turtle at the beginning: https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1604032500192&di=67b6cdd3eb1722f845fd0cc39625b386&imgtype=0&src=http%3 A%2F%2Fwx1. sinaimg.cn%2Flarge%2F006m97Kgly1g5voen881dj30ag0aawfo.jpg

The URL is a bit long, but it can be experimented with.

import requests 
res = requests.get('URL') 
print(type(res))
#打印变量res的数据类型
1234

结果:<class ‘requests.models.Response’>

Four properties commonly used in Response objects

img

The first is our status_code, which is a very commonly used attribute, used to check whether the request is successful, you can print its return value to see.

img

The next attribute is response.content, which can return the content of the Response object in the form of binary data, which is suitable for downloading pictures, audios, and videos. You can understand it by looking at an example. Let's get that little turtle down:

import requests
res = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1604032500192&di=67b6cdd3eb1722f845fd0cc39625b386&imgtype=0&src=http%3A%2F%2Fwx1.sinaimg.cn%2Flarge%2F006m97Kgly1g5voen881dj30ag0aawfo.jpg')
#发出请求,并把返回的结果放在变量res中
pic=res.content
#把Reponse对象的内容以二进制数据的形式返回
photo = open('乌龟.jpg','wb')
#新建了一个文件ppt.jpg,这里的文件没加路径,它会被保存在程序运行的当前目录下。
#图片内容需要以二进制wb读写。你在学习open()函数时接触过它。
photo.write(pic)
#获取pic的二进制内容
photo.close()
#关闭文件
123456789101112

You can also climb the small photos on the website by yourself. Some friends may ask: how do I know the URL of my small photo? In fact, it's easy to do: right-click the small photo, open a new tab, and the URL will be there.

If it doesn’t work, just drag the small photo on this article and drag it to a new window, and the URL will be there.

Well, today's practice is probably here.

After talking about response.content, continue to look at response.text. This attribute can return the content of the Response object in the form of a string, which is suitable for downloading text and web page source code.

Look clearly, it is the source code.

Come on, just find a URL, for example, the URL of my article, let's experience it:

import requests
#引用requests库
res = requests.get('https://mp.toutiao.com/profile_v4/graphic/publish?pgc_id=6889211245900071428')
novel=res.text
#把Response对象的内容以字符串的形式返回
k = open('《第七天》.txt','a+')
#创建一个名为《第七天》的txt文档,指针放在文件末尾,追加内容
k.write(novel)
#写进文件中     
k.close()
#关闭文档
1234567891011

Next, let's look at the last attribute: response.encoding, which can help us define the encoding of the Response object.

First, the encoding of the target data itself is unknown. After sending a request with requests.get(), we will get a Response object, in which the requests library will make its own judgment on the encoding type of the data. but! This judgment may or may not be accurate.

If it judges correctly, the content of the response.text we print out is normal and there are no garbled characters, then res.encoding is not needed; if the judgment is not accurate, there will be a bunch of garbled characters, then we can check it The encoding of the target data, and then use res.encoding to define the encoding as the same type as the target data.

Guess you like

Origin blog.csdn.net/libaiup/article/details/131934306