Reptile process

Reptile process

Web crawler process is actually very simple

It can be divided into four parts:

1 initiates a request

Initiate a request to the target site through HTTP library, i.e. send a Request, the request may contain additional headers, data and other information, and then waiting for the server. Process this request as we open the browser, the browser address bar enter the URL: www.baidu.com, then click Enter. This process is actually equivalent to the browser as a browser client sends a request to the server.

2 get a response

If the server can be a normal response, we'll get a Response, Responsethe content is the content to be acquired, may be the type HTML, Jsonstring, binary data (images, video, etc.) and other types. This process is the server receives the client's request, been to parse sent to the web browser HTMLfiles.

3 parsing content

The resulting content may be HTML, you can use regular expressions, page parsing library for resolution. It may be Json, you can directly into Jsonthe object parsing. It may be binary data, or can be stored for further processing. This step corresponds to the browser to get to a local file server, and then explain and show up.

4 Save Data

Save mode can be saved as text data, the data can be saved to a database, or saved to a file specific jpg, mp4 format like. This is equivalent to when we browse the Web, download pictures or video on a web page.

Request send request

There are: GET / POST are two types of commonly used, in addition to HEAD / PUT / DELETE / OPTIONS

GET and POST difference is: GET request is data in the url, POST is located in a head portion


URL, or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet.

Each file on the Internet has a unique URL, the information it contains indicate the location of the file browser and how they deal with it


URL format consists of three parts: the first part is the protocol ( protocol type ). The second part of the resource is there a host IP address ( host domain name ). The third part is the specific address of the host resources ( file name ).

Request header

When the request contains header information, such as the User-Agent, Host, Cookies information

Request is a request body carrying data, such as submitting the form data when the form data (POST)

HTTP protocol Introduction

Response acquisition response contains what

All of the first row are HTTP response status line, followed by the current version of HTTP, three-digit status code, and a description of the state of phrases, each other separated by a space.

Response Status

There are a variety of response status, such as: 200 delegates success, 301 jumps, 404 Page Not Found, 502 Server Error

Response header

The content type, length type, server information, provided Cookie

Response Body

The most important part, the content contains the requested resources, such as Web HTMl, images, binary data, etc.

Web page text: as an HTML document, Json formatted text and other images: Get into a binary file, save it as a picture format video: the same binary file other: as long as the request to, you can get

HTTP protocol Introduction

How to parse data

Deal directly

Json parsing

Regular expression processing

BeautifulSoup analytical processing

PyQuery analytical processing

XPath parsing process

About the same problem not crawl pages and data browser to see

This occurs because many of the data site is through js, ajax dynamic loading, so get direct access through the browser requests a page and show different.


How to solve the problem js rendering?

Analysis ajax Selenium / webdriver Splash PyV8,
how to save data

Text: plain text, Json, Xml etc.

How to save data

Relational databases: The structured mysql, oracle, sql server database, etc.

Non-relational database: MongoDB, Redis and other key-value store form

Database use

Guess you like

Origin www.cnblogs.com/wow-santa/p/12115046.html