The basic principle of 04- reptiles

  Overview: With the front of the base herein, this take a look at the basic principles of reptile it!


  There are many insects, reptiles seen as spider why we put it? Because we can be seen as the page node network, the link relationship between the pages seen as a connection between the nodes. This is not like a spider web? After crawling climb over a node, and can climb down the connection node to the next node (available through links to other web pages), again and again, you can climb all the nodes.

 

First, the basic concept of reptiles

  In simple terms, reptiles that get pages from which to extract information , to save data in the automation program.

1. Get page

  We say "get page", refers to obtain the source code of the page.

  We have said before, send a request to the web server returns the response body is the page source code. So a key part of this is, and how to construct a request to the server, and then receive and parse the response.

  We can use the library provided Python (e.g. urllib, requests, etc.) to implement HTTP request data configuration of the operation, requests and responses can be used to provide libraries expressed, then the corresponding, we only need to parse the data structure Body portion i.e. It may be (this is the source code).

2. Extract information

  After getting the source code, we must analyze the source code to extract useful data. There are two common methods:

  • Regular expression: universal method, but relatively complex, prone to error.
  • Use libraries: Some web library file may node attributes, CSS selectors or XPath for extracting the page information (e.g. Beautiful Soup , pyquery , lxml etc.). These libraries can extract data quickly and efficiently.

3. Save data

  After extracting the useful information, the need to save the data, for subsequent use. There are three common practice:

  • Save as txt or JSON text
  • Save to MySQL, MongoDB database, etc.
  • Saved to a remote server (via SFTP, etc.)

4. Automation

  Automated crawlers crawl the exception handling, and other error retry operation, high degree of automation, especially for large amount of data requiring high speed usage scenarios.

 

Second, the reptiles can grab what data

  Long as it is based on HTTP or HTTPS protocol, URL data has a corresponding, can crawl

1. HTML Code

  The most common conventional web page corresponding to the HTML code, which is our most frequently crawl content.

2. JSON string

  Some pages are not returned HTML code, but JSON string. API interface use this form.

3. Binary Data

  Such as pictures, audio, video and so on.

4. All kinds of file extensions

  Such as CSS, JavaScript, configuration files and so on.

 

Three, JavaScript rendering pages

  Sometimes we use urllib or requests to crawl web pages source code and actually see in the browser is not the same. This is because now increasingly using Ajax web front-end and modular tools lap dog, the page may be rendered by JavaScript, the raw HTML code is just an empty shell.

 1 <!DOCTYPE html>
 2 <html>
 3 <head>
 4 <meta charset="UTF-8”>
 5 <title>This is a Oemo</title>
 6 </head>
 7 <body>
 8 <div id=”container” >
 9 </div>
10 </body>
11 <script src=”app.js ></script>
12 </html>

  body of the code above, only one node id of the container, but the body after drinking app.js, which is responsible for rendering the entire site. When the browser opens the page, HTML content loaded first, then the browser sees the introduction of app.js file, so continue to request the file. After obtaining the file, execute one of the JavaScript code, thereby changing the HTML node, add to them, get a complete interface.

  When using urllib library request or requests a page, you can only get the HTML code, will not continue to load JavaScript files, naturally can not see the contents of the browser.

  Since inconsistent with basic HTTP request to get the library source code and the browser page source code, we can analyze the background Ajax interface, you can also use Selenium, Splash and other JavaScript libraries to implement simulation rendering.

Guess you like

Origin www.cnblogs.com/murongmochen/p/11734532.html