Sesame HTTP: Fundamentals of Crawler

We can compare the Internet to a large web, and crawlers (ie, web crawlers) are spiders that crawl on the web. Comparing the nodes of the network to web pages, crawling it is equivalent to visiting the page and obtaining its information. The connection between nodes can be compared to the link relationship between the web page and the web page, so that after the spider passes through a node, it can continue to crawl along the node connection to reach the next node, that is, continue to obtain subsequent web pages through a web page, so that the entire The nodes of the web can be crawled by spiders, and the data of the website can be crawled.

1. Crawler overview

Simply put, a crawler is an automated program that fetches web pages and extracts and saves information, as outlined below.

(1) Get the webpage

The first job that the crawler has to do is to get the web page, here is the source code of the web page. The source code contains some useful information of the web page, so as long as the source code is obtained, the desired information can be extracted from it.

The concept of request and response was discussed earlier. Send a request to the server of the website, and the response body returned is the source code of the webpage. So, the most critical part is to construct a request and send it to the server, and then receive the response and parse it out, so how to implement this process? You can't manually intercept the source code of a web page, right?

Don't worry, Python provides many libraries to help us achieve this operation, such as urllib, requests, etc. We can use these libraries to help us implement HTTP request operations. Both the request and the response can be represented by the data structure provided by the class library. After getting the response, we only need to parse the Body part of the data structure, that is, to get the source code of the web page. In this way, we can use the program to realize the process of obtaining the web page.

(2) Extract information

After obtaining the source code of the web page, the next step is to analyze the source code of the web page and extract the data we want from it. First of all, the most common method is to use regular expression extraction, which is a versatile method, but it is more complicated and error-prone when constructing regular expressions.

In addition, because the structure of web pages has certain rules, there are also some libraries that extract web page information based on web page node attributes, CSS selectors or XPath, such as Beautiful Soup, pyquery, lxml, etc. Using these libraries, we can efficiently and quickly extract web page information, such as node attributes, text values, etc.

Extracting information is a very important part of crawlers, it can make messy data organized, so that we can process and analyze the data later.

(3) Save data

After extracting information, we generally save the extracted data somewhere for subsequent use. There are various forms of saving here, such as simply saving as TXT text or JSON text, or saving to a database, such as MySQL and MongoDB, or to a remote server, such as operating with SFTP.

(4) Automated procedures

Speaking of automated programs, I mean that crawlers can perform these operations in place of humans. First of all, we can of course extract this information manually, but if the equivalent is particularly large or if you want to quickly obtain a large amount of data, you must still use a program. The crawler is an automated program that completes the crawling work on our behalf. It can perform various exception handling, error retry and other operations during the crawling process to ensure that the crawling runs continuously and efficiently.

2. What kind of data can be captured

We can see a variety of information in web pages, the most common is regular web pages, they correspond to HTML codes, and the most common crawling is HTML source code.

In addition, some web pages may return a JSON string instead of HTML code (most of which API interfaces use this form). Data in this format is easy to transmit and parse. They can also be captured, and data extraction is more convenient.

In addition, we can also see various binary data such as pictures, videos, and audios. Using the crawler, we can grab these binary data and save it as the corresponding file name.

In addition, you can also see files with various extensions, such as CSS, JavaScript, and configuration files. These are actually the most common files. As long as they are accessible in the browser, you can grab them.

The above contents actually correspond to their respective URLs and are based on the HTTP or HTTPS protocol. As long as it is this kind of data, the crawler can crawl it.

3. JavaScript rendering the page

Sometimes, when we crawl a web page with urllib or requests, the source code we get is actually different from what we see in the browser.

This is a very common problem. Nowadays, more and more web pages are built using Ajax and front-end modular tools. The entire web page may be rendered by JavaScript, which means that the original HTML code is just an empty shell, for example:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
        <title>This is a Demo</title>
    </head>
    <body>
        <div id="container">
        </div>
    </body>
    <script src="app.js"></script>
</html>

 bodyThere is only one node idin containerthe node, but it should be noted that bodyapp.js is introduced after the node, which is responsible for the rendering of the entire website.

When this page is opened in the browser, the HTML content will be loaded first, and then the browser will find that an app.js file has been introduced into it, and then it will then request this file. After getting the file, it will execute the JavaScript code, and JavaScript changes nodes in HTML, adds content to them, and ends up with a complete page.

But when requesting the current page with libraries such as urllib or requests, all we get is this HTML code, it will not help us continue to load this JavaScript file, so we can't see the content in the browser.

This also explains why sometimes the source code we get is different from what we see in the browser.

Therefore, the source code obtained using the basic HTTP request library may not be the same as the page source code in the browser. For such a situation, we can analyze its background Ajax interface, or use libraries such as Selenium and Splash to simulate JavaScript rendering.

Later, we will describe in detail how to capture JavaScript-rendered web pages.

This section introduces some basic principles of crawlers, which can help us become more handy when writing crawlers later.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326234345&siteId=291194637