Reptile - Reptile know

What is the Internet? What reptiles do is?

What we call the Internet is sending a request by a client computer to the target computer to download the data to the target computer's local process.

Users to access network data is: the browser to submit a request -> download page code -> parsing / rendering into pages .

The crawlers have to do is: Analog browser sends a request -> download page code -> only to extract useful data -> stored in the database or file

definition

By programming, simulation browser access to the website initiates a request , let go after acquiring data analysis on the Internet and extract useful data program

The basic flow of reptiles

1 , initiates a request 
to use http library sends a request to the target site, it sends a Request Request includes: request headers, request bodies such as 

2 , to obtain the contents of the response 
if the server can be a normal response, you'll get a Response Response include: html, json, Pictures , video, etc. 

3 , parses the content 
parse html data: regular expressions, such as third-party parsing library Beautifulsoup, pyquery such as 
parsing json data: json module 
parses binary data : a way to write documents b 

4 , save the data 
database file

Request and response

http protocol: HTTPS: //home.cnblogs.com/u/waller/ 
the Request: user's own information is sent to the server (socket server) via the browser (Socket Client) 

the Response: The server receives the request, analyzing the request sent by the user information, and return data (data that is returned may contain other links, such as: images, js, css, etc.) 

# PS: browser after receiving the Response, parses its content to display to the user, and crawlers in the simulation browser then, after receiving the transmission request Response, to extract useful data therein.
  • Request

1 Request mode: 
    common request method: GET, POST 
    other requests ways: the HEAD, the PUT, the DELETE, OPTHONS 
'' ' 
post and get request will eventually be assembled into such a form: k1 = xxx & k2 = yyy & k3 = zzz 
parameter post request on the request body: 
    using a browser to view and stored in the form data 
after the get request parameters directly on the url 
'' ' 
2 request url. 
    url stands for uniform resource locator, such as a web document, a picture of 
    a video, etc. can be used to uniquely determine the url 
'' ' 
url encoding 
    https://www.baidu.com/s?wd= picture 
    image is encoded (see example code) ' 
'' 
'' ' 
loading page is: 
loading a pages are usually loaded first document file, 
parsing document document, the link, download the image for a hyperlink to initiate the request encounters 
' ''
3 , the first request 
    the User -Agent: If there is no request header user- Agent client configuration, 
    a server may be illegal user you as 
    host
    cookies: cookie to save login 

'' ' 
generally do crawler will request header plus 
'' ' 
4. The body of the request 
    if get embodiment, no request body content 
    if the embodiment is a post, Data request body is the format 
' '' 
PS: 
1, the login window, file uploads, information will be attached to the request body 
2, log in, enter the wrong user name and password, then submit, you can see the post, usually right after the login page will jump, you can not be captured pOST 
' ''
  • Response

1, response status
     200 : success on behalf of
     301 : Jump on behalf of
     404 : file does not exist
     403 : Permissions
     502 : Server Error 

2 , respone header 
    the SET - the cookie: There may be more, is to tell the browser to preserve the cookie 
    
3 , preview page source code is the 
    most important part, it contains the requested resource content 
    such as web pages html, picture 
    binary data, etc.

 

Reptile classification

1 General crawler: 
  acquiring an entire data page
2 focused crawler:
  acquiring the specified data within a given page
3 Incremental crawler:
  used to update the Web site data is detected, the latest update site crawling out data

 

"Attack" and "defense"

"Attack": anti-climb mechanism

Website using relevant techniques or strategies to block crawlers crawling the site data
 - UA detection: the server through the value of the UA request in advance of a request to determine the identity of the carrier's request

 

# Robots protocol (gentleman's agreement) 
is commonly used in portals, declare what data can climb, which data can not climb 
EG: 
    HTTPS: //www.taobao.com/robots.txt

"Defense": anti-anti-climb policy

Let crawlers get the data through the crack anti-climb mechanism
 # PS: different sites different strategies need to be analyzed

 

The general use of the header information: 
    request header
     - User- Agent: the identity of the carrier's request 
        Carrier: browsers, crawlers
     - Connection: the Keep-Alive | use Close 
        use Close: When using crawler request is successful, this will immediately request the corresponding connection disconnect 
    response header
     - content-type: data response, the server response data in a corresponding manner

supplement

: http protocol in a prescribed form of client and server for data exchange
 https protocol: secure http protocol, the http based on the operation of adding a security layer (SSL encryption)
     - Symmetric :( insecure secret key encryption) 
        the encrypted data level secret rules (private) sent to the server hack
     - low asymmetric secret key encryption :( efficiency, unsafe: the client get the public is not necessarily the end of the service) 
        server-side development of encryption (public key) and decryption way (private), clients obtain the public key to encrypt data re-transmission
     - certificate encryption keys: 
        a public key -> certificate Authority -> Add certificates -> client -> encrypt data -> server

 

 

 ps: just reptile tool, neither good nor bad,

Guess you like

Origin www.cnblogs.com/waller/p/11927999.html