A reptile principle
Reptile, that web crawlers, we can be understood as the network has been crawling spider, he likened the Internet to a large network, but in this reptile is crawling spider web slightly, if it encounters a resource, then it it will crawl down. What want to crawl? This is controlled by you it slightly.
For example, it crawl a page, in the net, he found a way, in fact, a hyperlink to a web page, then it can climb to another online to get the data. In this way, even the entire spider with a big net of this is within reach, every minute climb down is not a thing.
1. What is the Internet?
It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.
2. The purpose of the establishment of the Internet?
The purpose is to establish the Internet and sharing data transfer data.
3. What is the data?
..........
4. Internet process
1. Ordinary user data acquisition mode:
Open Browser -> sending a request to a target site -> the fetch response data -> renderer to the browser
2. crawlers:
Analog Browser -> sending a request to a target site -> the fetch response data -> extract valuable data -> persisted to the data
5. The process of browsing the web
Request http protocol. https = http + ssl
Users browse the web in the process, we might see a lot of nice pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, this process is actually the user enter the URL later, after the DNS server to find the host server, sends a request to the server, the server after resolution, sent to the user's browser HTML, JS, CSS and other documents, browser parses out, users will be able to see all kinds of pictures.
Therefore, the user can see the page is essentially composed of HTML code, climbing to the reptiles is the content, access to pictures, text and other resources by analyzing and filtering the HTML code to achieve.
6. The whole process of reptiles
- sending a request (request requires libraries: Requests database request, Selenium request library, rullib)
- fetch response data (as long as the transmission request to the server, the request will be returned by the response data)
- parse and extract data (requires parsing library: re (regular), BeautifulSoup4, Xpath)
- Save to local (file processing, database, MongoDB)
7.URL meaning
URL or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet. Each file on the Internet has a unique URL, which contains information indicates that the file location and the browser should be how to deal with it.
URL format consists of three parts:
- Protocol (otherwise known as service mode)
- there the resource host IP address (sometimes including the port number)
- specific address of the host resources, such as directory and file name, etc.
You must have a target URL before they can get when reptiles crawling data, therefore, it is the fundamental basis for crawlers access to data, an accurate understanding of its implications for the reptile learning a great help.
Two Requests library request
1. What is Requests
Requests are written in Python, based on urllib, using Apache2 Licensed open-source HTTP protocol library. It is more convenient than urllib, we can save a lot of work to fully meet the needs of HTTP test.
2. Send Request
1) Request various ways
①GET: page information specified by the request, and returns the entity body directly transmits a request to obtain data.
②POST: the request server receives a document designated as a new subordinate entities identified by the URI
③HEAD: only request header of the page.
④PUT: replace specific content of the document data transmitted to the client from the server.
⑤DELETE: requests the server to delete the specified page.
Instructions:
import requests
requests.get('http://httpbin.org/post')
requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')
2) Properties Methods
3) the respective status code
1xx: provisional response (represented by a provisional response and require the requestor to continue with the operations of the status code).
2xx: Success response (status code indicates a successful processing of the request).
3xx: Redirection responses (said to fulfill the request, the need for further operations generally, these status codes for redirection.).
4xx: Request Error (These status codes indicate that the request may be wrong, preventing the processing server).
5xx: Server Error (These status codes indicate internal server error occurred while trying to process requests for these errors tend to be the server itself, not with the request.).
Common response code: 200 - the server successfully returned the page 404 - the requested page does not exist 503 - Service Unavailable
4)get方法:requests.get(url,params=None,**kwargs)
url: Required to simulate get page link
params: extra parameters in the url, dictionary or byte stream format without encoding them, commonly used to send a GET request
** kwargs: 12 access control parameters
data: a dictionary, information specified form, commonly used to send a request using the POST
1 import requests 2 url = 'http://www.httpbin.org/post' 3 data = { 4 'users':'value1', 5 'key':'value2' 6 } 7 response = requests.post(url=url,data=data) 8 print(response.text)
headers: a dictionary, designation request header
# Pass a dict to the headers parameter can, Requests will not change based on the specific circumstances of the custom header # own behavior. Only in the last request, all of the header information will be passed into it. Import Requests headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 61.0.3163.100 Safari / 537.36 ' , } Response = requests.get ( " HTTPS : //www.zhihu.com/expiore " , headers = headers) Print (response.text)
Proxy settings proxies: a dictionary, specify proxy
# Set the parameters to configure the proxy proxies, the proxy may be provided at the same time password authentication, may also use a SOCKS proxy Import Requests proxies = { " HTTP " : " http://127.0.0.1:9999 " , (proxy address, port) " HTTPS " : " http://127.0.0.1:8888 " } Response = requests.get ( " https://www.baidu.com " , Proxies = Proxies) Print (response.text) '' ' Proxies = {" HTTP ":" HTTP: // User: [email protected]: 3128 / ",} Proxies = { 'HTTP': 'Socks5: // User:pass@host:port', 'HTTPS': 'Socks5: // User:pass@host:port' } '''
cookies: a dictionary, designated Cookie, maintain session
# Get Cookie, with the parameter sent to the server cookies Import Requests Response = requests.get ( " the URL " ) Print (Response.Cookies) # acquires cookie # Print (Response.Cookies [ 'example_cookie_name']) Gets a specific cookie for Key, value in response.cookies.items (): # Get all the two cookie attributes Print (Key + ' = ' + value) Cookies = { ' cookies_are ' : ' Working ' } # set cookie parameters = requests.get Request ( ' http://httpbin.org/cookies ' , Cookies = Cookies) # session remains '' ' when a cookie is used to simulate the effect is that you can log in and do a session to maintain, making the simulated landing, always a browser page. Get cookie, cookie content to be the landing site, so if you send multiple requests to the same host, TCP underlying connection will be reused, leading to significant performance improvements. '' ' Import Requests S = requests.Session () # the Session () provided with the analog server login process, the user login information stored in the server. s.get ( " http://httpbin.org/cookies/set/sessioncookie/123456 " ) Response = s.get ( " http://httpbin.org/cookies " ) Print (response.text) #Session Manager can be used before and after the text to ensure that the with block exit the session be closed, even if an exception occurs, too. requests.Session with () AS S: s.get ( ' http://httpbin.org/cookies/set/sessioncookie/123456789 ' )
Authentication Settings auth: tuple type, specify the account and password landing
import requests url = 'http://www.httpbin.org/basic-auth/user/password' auth = ('user','password') response = requests.get(url=url,auth=auth) print(response.text)
Certificate verification verify: a Boolean type, the need for certificate verification request specifying the site, the default is True, you do not want to verify the certificate, you need to set to False
'' If you will verify is set to False, Requests can ignore validation of SSL certificates, but will cause a warning '' ' requests.get (' https://kennethreitz.org ', verify = False) # 1, ignore warnings # 2, to verify incoming certificates Import Requests from requests.packages Import urllib3 urllib3.disable_warnings () Response = requests.get ( "https://www.12306.cn", verify = False) Print (response.status_code) > 200 >> requests.get ( 'https://github.com', the Verify = '/ path / to / certfile') is set if verify # folder path folder must be processed through c_rehash tools provided by OpenSSL. #s = requests.Session () or to hold it in the session # s.verify = '/ path / to / certfile'
files: file upload
Import Requests # file upload operation post Files = { ' Files ' : Open ( ' the favicon.ico ' , ' RB ' )} # with files (Specified uploaded file name), open approach it read out the file response = requests.post ( ' http://httpbin.org/post ' , Files = Files) Print (response.text)
Timeout setting timeout: Specifies the timeout, if no response is obtained over a specified time, an exception is thrown
"" " Tell requests to stop after a set number of seconds to timeout parameter of waiting for a response, if the server does not answer within timeout seconds, will raise an exception ." "" Import requests Request = requests.get ( ' HTTP: / /www.google.com.hk ' , timeout = 0.01 ) Print (request.url)
Exception Handling: Requests
explicit exceptions thrown inherit fromrequests.exceptions.RequestExceptio
5) the request header information
User-Agent: User Agent (request to send proved by computer equipment and browser)
Cookies: real user login information (to prove you are a user of the target site)
requests.Session () maintain Cookies information
Referer: on a visit to the url (to prove that you are jumping from target sites on the web)
6) request body
POST requests will have a request body
Form Data{'user':'Berlin','pwd':'123'}