Reptile - Reptile acquaintance

1. Concept

1.1 What is a reptile

Web crawler (also known as web spider, web robot), is a kind of follow certain rules, procedures or script automatically grab information of the World Wide Web, the popular talk is to obtain data on the web page you want by program, that is, automatic data capture.

1.2 What is the use reptiles

Reptile is to get the data, for example, you want to download some pictures on the page, one by one manually download is too slow, you can quickly crawling through picture reptile; the data can be used to obtain material data analysis and so on.

1.3 nature reptiles

Reptile actually send a request to the server application to impersonate the user, the server returns data, program analysis and filtration html code, we want to derive resources (text, pictures, video .....).

2. Principle

The basic flow of 2.1 reptiles

Initiate a request
to initiate a request to the target site via HTTP library, that is, send a Request, the request may contain additional header information, waiting for a server response

Acquiring response content
if the server can be a normal response, will get a Response, Response contents page content is to be acquired, may be the type of HTML, Json string, binary data (images or videos) and other types

Analytical content
obtained content may be HTML, you can use regular expressions to parse, page parsing library, may be Json, it can be directly converted into Json object parsing and may be binary data, or can be stored for further processing

Save data
stored in various forms, can be saved as text, it can be saved to the database, or files stored in a specific format

2.2 Request和Response

See blog: https://www.cnblogs.com/lymlike/p/11579840.html

2.3 How to parse data

  1. Deal directly
  2. Json parsing
  3. Regular expression processing
  4. BeautifulSoup analytical processing
  5. PyQuery analytical processing
  6. XPath parsing process

2.4 How to save data

  1. Text: plain text, Json, Xml etc.
  2. Relational databases: The structured mysql, oracle, sql server database, etc.
  3. Non-relational database: MongoDB, Redis and other key-value store form

Reference: https://www.cnblogs.com/zhaof/p/6898138.html

Guess you like

Origin www.cnblogs.com/lymlike/p/11593824.html