Do you really understand crawlers? After reading this, you will have a deeper and more comprehensive understanding of web crawlers.

Preface

Crawler is a very interesting technology. You can use crawler technology to obtain things that others cannot get or require payment. You can also automatically crawl and save large amounts of data, reducing time and energy to do some tiring work manually.

It can be said that many people learn programming, and it is indeed a lot less interesting if they don’t play with crawlers. Whether it is amateur, private work or professional crawlers, the crawler world is indeed quite exciting.

Today I will briefly talk about crawlers. The purpose is to let friends who are preparing to learn crawlers or just starting out have a deeper and more comprehensive understanding of crawlers.

Insert image description here


Article directory
    • Preface
    • 1. Understanding reptiles
      • 1. What is a crawler?
      • 2. Classification of crawlers
      • 3.Robots protocol
    • 2. Basic process of crawler
      • 1. 4 steps of crawler
      • 2.Request和Response
    • 3. Understand Request
      • 1. Request method
      • 2. Request URL
      • 3. Request header
      • 4. Request body
      • 5. Practical view of Request
    • 4. Understand Response
      • 1.Response status
      • 2. Response header
      • 3. Response body
    • 5. What kind of data can the crawler obtain?
    • 6. How to parse data?
    • 7. How to save data?

1. Understanding reptiles

1. What is a crawler?

Introduce the famous crawler to everyone in one sentence:An automated program that requests websites and extracts data.

Let’s break it down and understand the crawler:

Requesting a website means sending a request to the website. For example, go to Baidu and search for the keyword "Python". At this time, our browser will send a request to the website;

Extract data. Data includes pictures, text, videos, etc., which are all called data. After we send the request, the website will present the search results to us. This is actually returning the data. At this time, we can extract the data;

The automated program, that is, the code we wrote, realizes the automatic extraction of process data, such as downloading and saving the returned pictures in batches, replacing our manual operation one by one.

Insert image description here


2. Classification of crawlers

According to usage scenarios, crawlers can be divided into three categories:

①Universal crawler (large and comprehensive)
It is powerful and has a wide collection range. It is usually used in search engines. For example, Baidu Browser is a large crawler program.

②Focused crawler (small but sophisticated)
The function is relatively simple, and it only crawls specific content of a specific website, such as going to a website to obtain certain data in batches, < /span>. This is also the most commonly used crawler by us personally

③Incremental crawler (only collects updated content)
This is actually an iterative crawler focused on the crawler. It only collects updated data and does not collect old data. Collection is equivalent to always existing and running. As long as the data that meets the requirements is updated, new data will be crawled automatically.

Insert image description here


3.Robots protocol

Among crawlers, there is a protocol called Robots that you need to pay attention to, also known as "web crawler exclusion criteria". Its function is to tell you what can and cannot be crawled by the website.

Where can I see this Robots protocol? Under normal circumstances, you can view it directly by adding /robots.txt after the URL on the home page of the website. For example, Baidu’s Robots protocol is at https://www.baidu.com/robots.txt. You can see that there are many URLs that stipulate that Crawling, for example, Disallow:/shifen/ means that the current Disallow:/shifen and the subdirectory web pages under Disallow:/shifen cannot be crawled.

Insert image description here
In fact, this Robots Agreement is a gentleman’s agreement. For crawlers, it is basically a verbal agreement. If you violate it, you may be held legally responsible, but no If it is violated, the crawler will not be able to crawl any data, so both parties usually turn a blind eye and don't be too arrogant.

Insert image description here


2. Basic process of crawler

1. 4 steps of crawler

How do reptiles work? The crawler program can be roughly divided into four steps:

①Initiate a request
Initiate a request to the target site through the HTTP library, that is, send a Request. The request can contain additional headers and other information and wait for the server to respond.

②Get the response content
If the server can respond normally, it will get a Response. The content of the Response is the page content to be obtained. The types may include HTML, Json string and binary Data (such as pictures and videos) and other types.

③Parse the content
The content obtained may be HTML, which can be parsed using regular expressions and web page parsing libraries. It may be Json, which can be directly converted to Json object parsing, or it may be binary data, which can be saved or further processed.

④Save data
There are many styles of saved data. It can be saved as text, saved to a database, or saved as a file in a specific format.

Basically these are the four steps the crawler follows.


2.Request和Response

Request and Response are the most important parts of the crawler.What is the relationship between Request and Response? The relationship between them is as follows:

Insert image description here
To understand simply, when we search for something on a computer browser, such as the aforementioned search for "Python" on Baidu, when you click on Baidu, a Request request has been sent to Baidu's server. Request Contains a lot of information, such as identity information, request information, etc. The server makes a judgment after receiving the request, and then returns a Response to our computer. This also contains a lot of information, such as whether the request is successful or not, such as the information we requested. Results (text, pictures, videos, etc.).

This should be easy to understand, right? Next, let's take a closer look at Request and Response.


3. Understand Request

What does the Request contain?? It mainly includes the following things:

1. Request method

The request method can be understood as the way you greet the website. If you want to get data from the website, you have to greet it in the correct way, so that it can pay attention to you. Just like if you want to borrow something from someone else, you You have to knock on the door first and then say hello. If you climb directly through the window, anyone who sees it will have to be kicked out by you.

Insert image description here

The main request methods are GET and POST, as well as HEAD/PUT/DELETE/OPTIONS and other methods. The most commonly used request method is GET..


2. Request URL

What is a URL? The full name of URL is Uniform Resource Locator. For example, a web document, picture, video, etc. has a unique URL. In crawlers, we can understand it as a URL or link.


3. Request header

What is a request header? The English name Request Headers usually refers to the header information included in the request, such as User-Agent, Host, Cookies, etc.

These things are equivalent to your identity information when you send a request to the website. Here you often need to disguise yourself and pretend to be an ordinary user to prevent your target website from recognizing that you are a crawler program, avoid some anti-peep problems, and get it successfully. data.


4. Request body

The official explanation is that additional data is carried when requesting, such as form data when submitting a form.

How to understand? For example, if you go to your father-in-law’s house to propose marriage, you can’t go there empty-handed to propose marriage, right? You have to bring something to make it look like you are proposing marriage, so that your father-in-law will betroth his daughter to you. This is a common etiquette that is indispensable to everyone.

Insert image description here

How do you understand it among reptiles? For example, on some pages you have to log in first or you have to tell me what you are requesting. For example, if you search for "Python" on the Baidu page, then the keyword "Python" is the request body you want to carry. See Only when it reaches your request body does Baidu know what you want to do.

Of course, the request body is usually used in the POST request method. In the GET request, we usually splice it into the URL. It is enough to understand it first. Later, the specific crawler can deepen the understanding.


5. Practical view of Request

Now that we have talked about the theory of Request, we can go and see in practice where Request is located and what it contains.

Taking Google Chrome as an example, I can search for a bunch of results by entering the keyword "Python". Let's use the console window that comes with the web page to analyze the Request request we sent.

Press and hold F12 or right-click on a blank space on the webpage and select "Inspect", and then you can see that there are many options in the console. For example, there is a menu bar in the upper column. Generally, the most commonly used element for junior crawlers is Elements. and Network (network). Other things are not used for the time being. You will use them when you learn more advanced crawlers. For example, you may use the Application window when reversing JS. You will learn more about it later.

Elements contains every element of all request results. For example, the source code of every picture is available. Especially after you click the small arrow in the upper left corner, every place you move to will be displayed in the Elements window. for the source code.

Insert image description here

Network is the network information commonly used by crawlers, including our Request. Let's take a look. Under the Network window, check Disable cache and click All.

Insert image description here

Refresh the web page to see the effect. You can see that we have issued 132 Request requests. Don't be curious about this. Although we only sent a request for "Python" to Baidu, some of them are requests attached to the web page.

Insert image description here

Although there are many types in it, such as png, jpeg, etc., you can slide to the top. In the Type column, there is a type of document, which means web document. Click on it and you will have our Request information.

Insert image description here

After clicking the document to enter, there is a new menu bar. Under the Headers column, we can see the Request URL, which is the request URL we mentioned earlier. This URL is the URL we actually request from the web page, and then also There is a request method, which can be seen as a GET request.

Insert image description here

Swipe down again, and you can also see the request headers we talked about earlier. There is a lot of information, but the User-Agent, Host, and Cookies we talked about earlier are all there. These are the information we give to the server.

Insert image description here

Although there is a lot of content in Request Headers, we also have to do disguise work in this aspect when writing crawler programs, but we do not have to write all the information. We can just selectively write some important information, such as User-Agent. Must be brought, Referer and Host are optional areas, cookies will be brought when you want to log in, and there are only 4 commonly used items for camouflage.

As for the request body, I won't check it for now, because the request method here is GET request, and the request body can only be viewed in POST request. It doesn't matter, you will understand it naturally after using the crawler.


4. Understand Response

Response mainly consists of 3 pieces of content. Let’s take a look at them one by one.

1.Response status

After we send the request, the website will return us a Response, which includes the response status of the response status code, which can be roughly divided into the following categories:

①Two hundred range, for example, response status code 200 indicates success.

②Three hundred range, such as 301 means jump.

③Four hundred range, such as 404 Page Not Found.

④Five hundred range, such as 502 Web page not found.

For crawlers, two to three hundred is the response status we most want to see. It is possible to get data. Four to five hundred is basically useless and cannot get data< a i=1>.

For example, when we just sent the previous Request request, in the document file, in the General under the Headers window, we can see that the response status code is 200, indicating that the web page successfully responded to our request.

Insert image description here


2. Response header

The information the server gives us will also include the response header, which includes content type, content length, server information, cookie settings, etc.

In fact, the response header is not that important to us, just learn about it here.


3. Response body

This is very important. Apart from the response status in the first point above, this is it, because it contains the content of the requested resource, such as web page HTML and image binary numbers, etc.

Where is the response body? It is also in the Response column in the document file. You can slide down and you can see that there is a lot of response data. This is the data we obtained. Some of them can be downloaded directly, while others require technical analysis. Get it.

Insert image description here


Data collection

The above-mentioned complete version of the complete set of Python learning materials has been uploaded to CSDN official. If you need it, friends can scan the CSDN official certification QR code below on WeChat.

Guess you like

Origin blog.csdn.net/Innocence_0/article/details/134810396