Reptile process
Web crawler process is actually very simple
It can be divided into four parts:
1 initiates a request
Initiate a request to the target site through HTTP library, i.e. send a Request, the request may contain additional headers, data and other information, and then waiting for the server. Process this request as we open the browser, the browser address bar enter the URL: www.baidu.com, then click Enter. This process is actually equivalent to the browser as a browser client sends a request to the server.
2 get a response
If the server can be a normal response, we'll get a
Response
,Response
the content is the content to be acquired, may be the typeHTML
,Json
string, binary data (images, video, etc.) and other types. This process is the server receives the client's request, been to parse sent to the web browserHTML
files.
3 parsing content
The resulting content may be
HTML
, you can use regular expressions, page parsing library for resolution. It may beJson
, you can directly intoJson
the object parsing. It may be binary data, or can be stored for further processing. This step corresponds to the browser to get to a local file server, and then explain and show up.
4 Save Data
Save mode can be saved as text data, the data can be saved to a database, or saved to a file specific jpg, mp4 format like. This is equivalent to when we browse the Web, download pictures or video on a web page.
Request send request
There are: GET / POST are two types of commonly used, in addition to HEAD / PUT / DELETE / OPTIONS
GET and POST difference is: GET request is data in the url, POST is located in a head portion
URL, or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet.
Each file on the Internet has a unique URL, the information it contains indicate the location of the file browser and how they deal with it
URL format consists of three parts: the first part is the protocol ( protocol type ). The second part of the resource is there a host IP address ( host domain name ). The third part is the specific address of the host resources ( file name ).
Request header
When the request contains header information, such as the User-Agent, Host, Cookies information
Request is a request body carrying data, such as submitting the form data when the form data (POST)
Response acquisition response contains what
All of the first row are HTTP response status line, followed by the current version of HTTP, three-digit status code, and a description of the state of phrases, each other separated by a space.
Response Status
There are a variety of response status, such as: 200 delegates success, 301 jumps, 404 Page Not Found, 502 Server Error
Response header
The content type, length type, server information, provided Cookie
Response Body
The most important part, the content contains the requested resources, such as Web HTMl, images, binary data, etc.
Web page text: as an HTML document, Json formatted text and other images: Get into a binary file, save it as a picture format video: the same binary file other: as long as the request to, you can get
How to parse data
Deal directly
Json parsing
Regular expression processing
BeautifulSoup analytical processing
PyQuery analytical processing
XPath parsing process
About the same problem not crawl pages and data browser to see
This occurs because many of the data site is through js, ajax dynamic loading, so get direct access through the browser requests a page and show different.
How to solve the problem js rendering?
Analysis ajax Selenium / webdriver Splash PyV8,
how to save dataText: plain text, Json, Xml etc.
How to save data
Relational databases: The structured mysql, oracle, sql server database, etc.
Non-relational database: MongoDB, Redis and other key-value store form