I want to learn Python secretly, and then shock everyone (day 7)

Insert picture description here

The title is not intended to offend, but I think this ad is fun
. Take the above mind map if you like it. I can’t learn so much anyway.

Preface

Early review: I want to learn Python secretly, and then shock everyone (day 6)

I said the day before, we are going to learn about crawlers today, yes, today we start crawling

Insert picture description here

本系列文默认各位有一定的C或C++基础,因为我是学了点C++的皮毛之后入手的Python,这里也要感谢齐锋学长送来的支持。
本系列文默认各位会百度,学习‘模块’这个模块的话,还是建议大家有自己的编辑器和编译器的,上一篇已经给大家做了推荐啦?

我要的不多,点个关注就好啦
然后呢,本系列的目录嘛,说实话我个人比较倾向于那两本 Primer Plus,所以就跟着它们的目录结构吧。

本系列也会着重培养各位的自主动手能力,毕竟我不可能把所有知识点都给你讲到,所以自己解决需求的能力就尤为重要,所以我在文中埋得坑请不要把它们看成坑,那是我留给你们的锻炼机会,请各显神通,自行解决。

If it is Xiaobai, you can look at the following paragraph:

Welcome to our circle

I built a Python Q&A group, friends who are interested can find out: What kind of group is this?

Portal through the group: Portal


First saw crawler

Like most of you, I played crawlers myself for the first time. I used to be crawled by people.
However, I am not a big boss, so I will not show a lot of very tall crawling skills when I come up. Let's take it step by step.

Web crawlers are also called Web Spiders. It crawls web content according to the web address (URL), and the web address (URL) is the website link we enter in the browser. For example: https://www.baidu.com/, it is a URL.

Why crawler

The processing object of general search engines is Internet webpages. The current number of Internet webpages has reached tens of billions. Therefore, the first problem for search engines is how to design an efficient download system to transmit such a large amount of webpage data to the local area. Form a mirror backup of Internet web pages locally.

Web crawlers can play such a role and complete this difficult task. It is a very critical and fundamental component of the search engine system.

Give a very common chestnut: Baidu.
Baidu is a company that continuously crawls thousands of websites and stores them on its own servers. The essence of your search on Baidu is to search for information on its server. The results you find are some hyperlinks. After the hyperlinks are redirected, you can visit other websites.

General crawler architecture

OK, can you understand the picture above? If not, let's look at some flow charts of users visiting the website:

Insert picture description here

This is a human-computer interaction process, so let's take a look at what kind of work the crawler can replace in this closed loop:
Insert picture description here

Yes, it is very in line with our "artificial intelligence" characteristics, freeing our hands.

Crawler work steps

1步:获取数据。爬虫程序会根据我们提供的网址,向服务器发起请求,然后返回数据。

第2步:解析数据。爬虫程序会把服务器返回的数据解析成我们能读懂的格式。

第3步:提取数据。爬虫程序再从中提取出我们需要的数据。

第4步:储存数据。爬虫程序把这些有用的数据保存起来,便于你日后的使用和分析。

This is how crawlers work. No matter how the content of learning changes later, the core is crawler principles.

This chapter aims to understand crawlers straightforwardly, so too many unnecessary concepts will not be extended.

The characteristics of an excellent crawler

Good code seems to have these characteristics.
But can anyone tell the characteristics of a good architecture? Let my eyes shine, and exclaimed: "Big brother, take me"

1. High performance

The performance here mainly refers to the crawling speed of the crawler downloading web pages. A common evaluation method is the number of web pages that the crawler can download per second as a performance indicator. The more web pages that can be downloaded per unit time, the higher the crawler's performance.

To improve the performance of the crawler, the operation method of the program to access the disk at design time (Disk IO) And implementationdata structureThe choice of is very important. For example, for the URL queue to be crawled and the URL queue that has been crawled, because the number of URLs is very large, the performance of different implementation methods is very different, so the efficient data structure has a great impact on the crawler performance.

2. Scalability

Even if the performance of a single crawler is high, it still takes a long time period to download all web pages locally. In order to shorten the crawling cycle as much as possible, the crawler system should have good scalability, that is, it is easy to increase Grab the number of servers and crawlers to achieve this goal.

The large-scale web crawlers currently available must be distributed, that is, multiple servers are dedicated to crawling. Each server deploys multiple crawlers, and each crawler runs in multiple threads to increase concurrency in a variety of ways.

For giant search engine service providers, data centers may also be deployed globally and in different regions, and crawlers are also assigned to different data centers. This is very helpful for improving the overall performance of the crawler system.

3. Robustness

Crawlers have to access various types of web servers, and they may encounter many abnormal situations: for example, webpage HTML coding is not standardized, the crawled server suddenly crashes, or even crawls into a trap. It is very important that the crawler can handle various abnormal situations correctly, otherwise it may stop working from time to time, which is unbearable.

From another perspective, assuming that the crawler program dies during the crawling process, or the server where the crawler is located is down, a robust crawler should be able to do that: When the crawler is started again, the content and data structure previously crawled can be restored. Instead of having to start all the work from scratch every time, this is also a manifestation of the robustness of crawlers.

4. Friendly

The friendliness of crawlers has two meanings: one is to protect part of the privacy of the website; the other is to reduce the network load of the crawled website. Crawlers crawl various types of websites. For website owners, some content does not want to be searched by everyone, so it is necessary to set up a protocol to inform crawlers which content is not allowed to be crawled. There are currently two mainstream methods to achieve this goal: crawling prohibition protocol and webpage prohibition mark.

This point will be explained in detail later.


First reptile

The first step of a web crawler is to obtain the HTML information of a web page based on the URL. In Python3, you can use urllib.request and requests to crawl web pages.

 urllib库是python内置的,无需我们额外安装,只要安装了Python就可以使用这个库。
 requests库是第三方库,需要我们自己安装。

The basic methods of the requests library are as follows:
Insert picture description here

requests.get()

Look at a piece of pseudo code:

import requests
#引入requests库
res = requests.get('URL')
#requests.get是在调用requests库中的get()方法,
#它向服务器发送了一个请求,括号里的参数是你需要的数据所在的网址,然后服务器对请求作出了响应。
#我们把这个响应返回的结果赋值在变量res上。

I just told them in the group that the most important thing to learn Python is to lay the foundation, starting with data types and data structures.
Then let's take a look at what data type is the return value of the crawler to get the data.

First just to find a URL bar, or to the beginning of the little turtle URL bar:
http://photogz.photo.store.qq.com/psc?/V12wi4eb4HvNdv/ruAMsa53pVQWN7FLK88i5qLH0twfxCgrwzDJPH6IRZadTdk*QTPnqFYrVt5PNiU7vBOh1cvefk4UXqNZcMdzLWowRX1pF4GqWoBZ7YPq5AQ!/b&bo=eAFyAXgBcgERECc!

The URL is a bit longer, but it can be experimented.

import requests 
res = requests.get('URL') 
print(type(res))
#打印变量res的数据类型

结果:<class ‘requests.models.Response’>

Four commonly used attributes of Response objects

Insert picture description here

The first is our status_code, it is a very commonly used attribute, used to check whether the request is successful, you can print out its return value to see.
Insert picture description here

The next attribute is response.content, which can return the content of the Response object in the form of binary data, which is suitable for downloading pictures, audios, and videos. You will understand by looking at an example.
Come let's climb down that little tortoise, I put it in my QQ space:

import requests
res = requests.get('http://photogz.photo.store.qq.com/psc?/V12wi4eb4HvNdv/ruAMsa53pVQWN7FLK88i5qLH0twfxCgrwzDJPH6IRZadTdk*QTPnqFYrVt5PNiU7vBOh1cvefk4UXqNZcMdzLWowRX1pF4GqWoBZ7YPq5AQ!/b&bo=eAFyAXgBcgERECc!')
#发出请求,并把返回的结果放在变量res中
pic=res.content
#把Reponse对象的内容以二进制数据的形式返回
photo = open('乌龟.jpg','wb')
#新建了一个文件ppt.jpg,这里的文件没加路径,它会被保存在程序运行的当前目录下。
#图片内容需要以二进制wb读写。你在学习open()函数时接触过它。
photo.write(pic)
#获取pic的二进制内容
photo.close()
#关闭文件

You can also crawl the small photos in your own space.
Some friends will ask: How do I know the URL of my little photo?
In fact, it’s easy to do: right-click on the small photo and open a new tab. Don’t you have the URL?

No matter what, you can just drag the small photo on this blog, drag it to a new window, and the URL will be there.

Well, today's practice is probably here.


After talking about response.content, continue to look at response.text, this attribute can return the content of the Response object in the form of a string, which is suitable for text,Web page source codeDownloads.

Look clearly, it is the source code.

Here, just find a website, such as the website of my blog, let’s experience it:

import requests
#引用requests库
res = requests.get('https://editor.csdn.net/md?articleId=109320746')
novel=res.text
#把Response对象的内容以字符串的形式返回
k = open('《第七天》.txt','a+')
#创建一个名为《第七天》的txt文档,指针放在文件末尾,追加内容
k.write(novel)
#写进文件中     
k.close()
#关闭文档

Next, we look at the last attribute: response.encoding, which can help us define the encoding of the Response object.

First of all, the encoding of the target data itself is unknown. After sending the request with requests.get(), we will get a Response object, where the requests library will make its own judgment on the encoding type of the data. but! This judgment may or may not be accurate.

If the judgment is accurate, the content of response.text we print out is normal and there is no garbled, then res.encoding is not used; if the judgment is not accurate, there will be a bunch of garbled codes, then we can check The encoding of the target data, and then use res.encoding to define the encoding as a type consistent with the target data.


I'm tired of talking, let's mention a little more.

Our country has a complete law

In fact, our country's laws on crawlers are still being improved, so please crawl as soon as possible.

Under normal circumstances, the server does not care about small crawlers, but the server will reject the high frequency of large crawlers and malicious crawlers, because this will bring great pressure or harm to the server.
However, servers are generally welcome to search engines (as I just said, one of the core technologies of Google and Baidu is crawling). Of course, this is conditional, and these conditions will be written in the Robots agreement.

The Robots protocol is a recognized code of ethics for Internet crawlers. Its full name is "Robots exclusion protocol". This protocol is used to tell crawlers which pages are crawlable and which are not.
How to check the robots agreement of the website is very simple, just add /robots.txt after the domain name of the website.

The most frequently used English in the agreement is Allow and Disallow. Allow represents access, and Disallow represents prohibited access.

The tool is in your hands, how to use it is your choice. When you are crawling website data, don't forget to check whether the Robots protocol of the website allows you to crawl.

At the same time, limiting the speed of crawlers, thanking the server that provides the data, avoiding too much pressure on it, and maintaining a good Internet order is what we should do.

The above is what we are going to talk about today. Next time, we will analyze those web pages and get what we want in the web pages.

Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43762191/article/details/109320746