Reptile was the basic principle

Cited: We can use the Internet compared to a large network, and reptiles (ie web crawler) on the Internet is crawling spider. The network node compared to a page, which is equivalent to climb reptiles visited the page, access to its information. The connection between the nodes can be likened to link the relationship between pages and pages, so spiders after the adoption of a node, a node can connect continued walking along to reach the next node, that is, to continue to obtain subsequent pages through a web page, so that the entire the network node can be a spider to crawl all the data the site can be crawled down.

Overview of reptiles

1. Get page

Reptile first thing to do is to get the page's source code. Then extract the desired information.

Talked about the concept of request and response, the server sends a request to the site, is the return of the body in response to the source code for web page

The most critical part is configured and sends a request to the server, and then receives a response parsed

Python provides many libraries to help us achieve this operation, such as urllib, requests and so on. We can use these libraries to help us HTTP request data configuration of the operation, requests and responses can be used to provide libraries represent only need to parse the data structure Body portion obtained after the response, i.e., to obtain the page source, so we can use the program to implement the process of obtaining a web page.

2. Extract information

After obtaining the page source code, the next step is analyzing web page source code, extract the data we want. First of all, the most common method is the use of regular expressions to extract, which is a universal method, but error-prone in the construction of regular expressions and more complex.

Further, since the structure of the web there are certain rules, so there are some web pages to extract information based on the page attribute node, css selectors or XPath libraries, such as Beautiful Soup, pyquery, lxml like. Using these libraries, we can quickly and efficiently extract pages of information, such as attribute nodes, text, values, etc.

注:提取信息是爬虫非常重要的部分,可以便于我们后续处理数据。

3. Save data

After saving extract information, we generally will extract data somewhere for later use. There are various forms preserved here, as can be simply saved as TXT or JSON text may be saved to the database, such as MySQL and MongoDB etc., can also be saved to a remote server, such as the operation performed by SFTP.

4. Automated procedures

Automated process, meaning that reptiles can replace people to do these operations. First, of course, we can extract this information manually, but the equivalent of particularly large or want quick access to large amounts of data, then certainly still must rely on the program. Reptile just for us to complete this job crawling automated process, it can be a variety of exception handling in the crawling process, error retry and other operations, to ensure the continued efficient operation of crawled.

What kind of data can be caught

The most common is to grab HTML source code (not data can be downloaded manually)

Some api interface to the data returned as json string, which is more convenient to catch the (ha ha) (json format is to sort information according to certain rules)

We can also see a variety of binary data, such as pictures, video and audio. The use of reptiles, we can crawl down these binary data, and then save the file to the corresponding name. Catch css, JavaScript, and configuration files

JavaScript rendering pages

Lib or requ say when a "crawling web pages, get tour, the actual code and see the browser is not the same. A very common problem. Now more and more web pages using Ajax, modular front-end tools to build, the entire page is rendered possible by the JavaScript, HTML code that is original is an empty shell.  body there is only one node id for the node container, but it should be noted that the introduction of the app. after the body node, it will be responsible for the whole rendering the site. html

Methods: For such cases, we can analyze the background Ajax interface can also be used Selenium, Splash this library to implement simulation JavaScript rendering.

Guess you like

Origin www.cnblogs.com/rstz/p/12587351.html