What is the Internet? What reptiles do is?
What we call the Internet is sending a request by a client computer to the target computer to download the data to the target computer's local process.
Users to access network data is: the browser to submit a request -> download page code -> parsing / rendering into pages .
The crawlers have to do is: Analog browser sends a request -> download page code -> only to extract useful data -> stored in the database or file
definition
By programming, simulation browser access to the website initiates a request , let go after acquiring data analysis on the Internet and extract useful data program
The basic flow of reptiles
1 , initiates a request to use http library sends a request to the target site, it sends a Request Request includes: request headers, request bodies such as 2 , to obtain the contents of the response if the server can be a normal response, you'll get a Response Response include: html, json, Pictures , video, etc. 3 , parses the content parse html data: regular expressions, such as third-party parsing library Beautifulsoup, pyquery such as parsing json data: json module parses binary data : a way to write documents b 4 , save the data database file
Request and response
http protocol: HTTPS: //home.cnblogs.com/u/waller/ the Request: user's own information is sent to the server (socket server) via the browser (Socket Client) the Response: The server receives the request, analyzing the request sent by the user information, and return data (data that is returned may contain other links, such as: images, js, css, etc.) # PS: browser after receiving the Response, parses its content to display to the user, and crawlers in the simulation browser then, after receiving the transmission request Response, to extract useful data therein.
-
Request
1 Request mode: common request method: GET, POST other requests ways: the HEAD, the PUT, the DELETE, OPTHONS '' ' post and get request will eventually be assembled into such a form: k1 = xxx & k2 = yyy & k3 = zzz parameter post request on the request body: using a browser to view and stored in the form data after the get request parameters directly on the url '' ' 2 request url. url stands for uniform resource locator, such as a web document, a picture of a video, etc. can be used to uniquely determine the url '' ' url encoding https://www.baidu.com/s?wd= picture image is encoded (see example code) ' '' '' ' loading page is: loading a pages are usually loaded first document file, parsing document document, the link, download the image for a hyperlink to initiate the request encounters ' '' 3 , the first request the User -Agent: If there is no request header user- Agent client configuration, a server may be illegal user you as host cookies: cookie to save login '' ' generally do crawler will request header plus '' ' 4. The body of the request if get embodiment, no request body content if the embodiment is a post, Data request body is the format ' '' PS: 1, the login window, file uploads, information will be attached to the request body 2, log in, enter the wrong user name and password, then submit, you can see the post, usually right after the login page will jump, you can not be captured pOST ' ''
-
Response
1, response status 200 : success on behalf of 301 : Jump on behalf of 404 : file does not exist 403 : Permissions 502 : Server Error 2 , respone header the SET - the cookie: There may be more, is to tell the browser to preserve the cookie 3 , preview page source code is the most important part, it contains the requested resource content such as web pages html, picture binary data, etc.
Reptile classification
1 General crawler:
acquiring an entire data page 2 focused crawler:
acquiring the specified data within a given page 3 Incremental crawler:
used to update the Web site data is detected, the latest update site crawling out data
"Attack" and "defense"
"Attack":
Website using relevant techniques or strategies to block crawlers crawling the site data - UA detection: the server through the value of the UA request in advance of a request to determine the identity of the carrier's request
# Robots protocol (gentleman's agreement) is commonly used in portals, declare what data can climb, which data can not climb EG: HTTPS: //www.taobao.com/robots.txt
"Defense":
Let crawlers get the data through the crack anti-climb mechanism # PS: different sites different strategies need to be analyzed
The general use of the header information: request header - User- Agent: the identity of the carrier's request Carrier: browsers, crawlers - Connection: the Keep-Alive | use Close use Close: When using crawler request is successful, this will immediately request the corresponding connection disconnect response header - content-type: data response, the server response data in a corresponding manner
supplement
: http protocol in a prescribed form of client and server for data exchange https protocol: secure http protocol, the http based on the operation of adding a security layer (SSL encryption) - Symmetric :( insecure secret key encryption) the encrypted data level secret rules (private) sent to the server hack - low asymmetric secret key encryption :( efficiency, unsafe: the client get the public is not necessarily the end of the service) server-side development of encryption (public key) and decryption way (private), clients obtain the public key to encrypt data re-transmission - certificate encryption keys: a public key -> certificate Authority -> Add certificates -> client -> encrypt data -> server
ps: just reptile tool, neither good nor bad,