python reptile (c) review some basic knowledge of reptiles

reptile:

Simply put: Get web pages and extract automated program to save the information

four basic information request

1. Request way:
There are get, post; there are two additional PUT the Delete Options head
2. requests the URL of:
the URL of stands for Uniform Resource Locator, such as a web document, a picture, a video and so can be uniquely identified by url
3. request header:
the header information, such as the User-agent, Host Cookies contain other information requested
4. request body:
the form data carrying additional data such as form submission request

response four basic information

2. Response Status: multiple response status, such as 200 represents successful jump 301, 404 can not find the page, the server 502 Error
3 in response head: such as content type, content length, server information, and the like disposed cookie
4. Response body: the most important part, contains the contents of the requested resource, such as a web page html, images, binary data

Crawlers can crawl data

Page text documents such as html, json format text
pictures,
videos
other

Analytical methods:

1. Direct processing
2.Json
3. regex
4.BeatutifulSoup
5.PyQuery
6.XPath

How to solve the problem of rendering JavaScript

Analysis of Ajax request
the Selenium / WebDriver
Splash
PyV8 Ghost.py

How to save data

Text: plain text, Json, xml
relational databases: Mysql, oracle sqlserver such as a structured table structure to store
non-relational database: MongoDB, Redis, etc. Key-Value stored as
binary files: such as pictures, video, audio, etc. saved directly as to the characteristics of the format

Reptiles Agent:

Because reptiles crawling faster, you may encounter the same ip access too mundane issues crawling process, in which case the site will let us enter verification code to sign in or directly blocked ip, so use a proxy to hide the true ip, so to achieve a good crawling effect

Published 63 original articles · won praise 12 · views 4055

Guess you like

Origin blog.csdn.net/qq_45353823/article/details/104161956
Recommended