August classroom --Python reptiles (Spider) basis

First, reptiles (Spider)

 

Request sites, web content extraction maximize the program. Html code is acquired, it is necessary to extract the required data from these texts.

HTTP: is the Internet's most widely used network protocol, a client and a server-side request and response standard (TCP), hypertext transfer protocol for transmission from the WWW server to the local browser, it can make browsing It is more efficient, so that network traffic is reduced.

HTTPS: HTTP is safe for the target channel, simply, is a safe version of HTTP, HTTP added SSL layer, HTTPS security infrastructure is SSL, encryption and therefore the details will need to SSL.

SSL (Secure Sockets Layer Secure Sockets Layer ) for network communications to provide security and data integrity of a secure protocol. SSL in the transport layer is encrypted network connection

 

Public platform interface is no longer supported http way calling, in December 2017 after 30 All sites must be called in HTTPS mode

URL (Uniform Resource Locator) Basic format:

The basic format: scheme://host[:port#]/path/.../[?query-string][#anchor]
scheme : protocol. Such as: HTTP, HTTPS, the FTP
Host : IP address or domain name server. Such as: 192.168.0.11
Port # : server port. (Http default port 80, https default port is 443)
path : the path to access the resource
Query-String : parameters, data sent to the server http
Anchor : anchor (jump to a specific page the link address of the point spread)
 
 
GET is to obtain data from the server, POST data is transmitted to the server.
In the client, GET way through the URL submission of data, the data can be seen in the URL; POST mode, data is placed in the HTML HEADER submitted

We did a GET request is equivalent to an action query in the database, it does not affect the data in the database itself.
POST request is the equivalent of doing a modification of the operation in the database, it will affect the data in the database itself (such as: registration, post, comment, get points, then the server resource status has changed).
 
 
Simple example:
 https://www.cnblogs.com/zhaof/p/6910871.html
 
Python Reptile framework: python of urllib package provides a more complete API to access web pages of the document
Simulate browser behavior, simulate user agent behavior constructing appropriate request, such as user login simulation, simulation session / cookie store and settings. In python in there very good third-party package to seal the deal, such as Requests, mechanize
python's beautifulsoap provides concise document processing functions, can be very short code completion handle most document
 
 

Guess you like

Origin www.cnblogs.com/liurg/p/11144325.html