Day03 reptiles principle

A reptile principle

      Reptile, that web crawlers, we can be understood as the network has been crawling spider, he likened the Internet to a large network, but in this reptile is crawling spider web slightly, if it encounters a resource, then it it will crawl down. What want to crawl? This is controlled by you it slightly.

      For example, it crawl a page, in the net, he found a way, in fact, a hyperlink to a web page, then it can climb to another online to get the data. In this way, even the entire spider with a big net of this is within reach, every minute climb down is not a thing.

  1. What is the Internet?

    It refers to a stack of a network device, to the computer station to the Internet together with a called Internet.

  2. The purpose of the establishment of the Internet?

    The purpose is to establish the Internet and sharing data transfer data.

  3. What is the data?

    ..........

  4. Internet process

    1. Ordinary user data acquisition mode:

      Open Browser -> sending a request to a target site -> the fetch response data -> renderer to the browser

    2. crawlers:

      Analog Browser -> sending a request to a target site -> the fetch response data -> extract valuable data -> persisted to the data

  5. The process of browsing the web

    Request http protocol. https = http + ssl

    Users browse the web in the process, we might see a lot of nice pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, this process is actually the user enter the URL later, after the DNS server to find the host server, sends a request to the server, the server after resolution, sent to the user's browser HTML, JS, CSS and other documents, browser parses out, users will be able to see all kinds of pictures.

    Therefore, the user can see the page is essentially composed of HTML code, climbing to the reptiles is the content, access to pictures, text and other resources by analyzing and filtering the HTML code to achieve.

  6. The whole process of reptiles

    - sending a request (request requires libraries: Requests database request, Selenium request library, rullib)

    - fetch response data (as long as the transmission request to the server, the request will be returned by the response data)

    - parse and extract data (requires parsing library: re (regular), BeautifulSoup4, Xpath)

    - Save to local (file processing, database, MongoDB)

  7.URL meaning

    URL or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet. Each file on the Internet has a unique URL, which contains information indicates that the file location and the browser should be how to deal with it.

    URL format consists of three parts:

      - Protocol (otherwise known as service mode)

      - there the resource host IP address (sometimes including the port number)

      - specific address of the host resources, such as directory and file name, etc.

    You must have a target URL before they can get when reptiles crawling data, therefore, it is the fundamental basis for crawlers access to data, an accurate understanding of its implications for the reptile learning a great help.

Two Requests library request

  1. What is Requests

    Requests are written in Python, based on urllib, using Apache2 Licensed open-source HTTP protocol library. It is more convenient than urllib, we can save a lot of work to fully meet the needs of HTTP test.

  2. Send Request

     1) Request various ways

      ①GET: page information specified by the request, and returns the entity body directly transmits a request to obtain data.

      ②POST: the request server receives a document designated as a new subordinate entities identified by the URI

      ③HEAD: only request header of the page.

      ④PUT: replace specific content of the document data transmitted to the client from the server.

      ⑤DELETE: requests the server to delete the specified page.

      Instructions:

          import requests

          requests.get('http://httpbin.org/post')

          requests.post('http://httpbin.org/post')

          requests.put('http://httpbin.org/put')

          requests.delete('http://httpbin.org/delete')

          requests.head('http://httpbin.org/get')

          requests.options('http://httpbin.org/get')

    2) Properties Methods

          

    3) the respective status code

           1xx: provisional response (represented by a provisional response and require the requestor to continue with the operations of the status code).

           2xx: Success response (status code indicates a successful processing of the request).

           3xx: Redirection responses (said to fulfill the request, the need for further operations generally, these status codes for redirection.).

           4xx: Request Error (These status codes indicate that the request may be wrong, preventing the processing server).

           5xx: Server Error (These status codes indicate internal server error occurred while trying to process requests for these errors tend to be the server itself, not with the request.).

           Common response code: 200 - the server successfully returned the page 404 - the requested page does not exist 503 - Service Unavailable

    4)get方法:requests.get(url,params=None,**kwargs)

        url: Required to simulate get page link

        params: extra parameters in the url, dictionary or byte stream format without encoding them, commonly used to send a GET request

        ** kwargs: 12 access control parameters

          data: a dictionary, information specified form, commonly used to send a request using the POST           

1 import requests
2 url = 'http://www.httpbin.org/post'
3 data = {
4     'users':'value1',
5     'key':'value2'
6 }
7 response = requests.post(url=url,data=data)
8 print(response.text)
View Code

          headers: a dictionary, designation request header        

# Pass a dict to the headers parameter can, Requests will not change based on the specific circumstances of the custom header 
# own behavior. Only in the last request, all of the header information will be passed into it. 
Import Requests 
headers = {
     ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 61.0.3163.100 Safari / 537.36 ' , 
} 
Response = requests.get ( " HTTPS : //www.zhihu.com/expiore " , headers = headers)
 Print (response.text)
View Code

          Proxy settings proxies: a dictionary, specify proxy

# Set the parameters to configure the proxy proxies, the proxy may be provided at the same time password authentication, may also use a SOCKS proxy 
Import Requests 
proxies = {
     " HTTP " : " http://127.0.0.1:9999 " , (proxy address, port)
     " HTTPS " : " http://127.0.0.1:8888 " 
} 
Response   = requests.get ( " https://www.baidu.com " , Proxies = Proxies)
 Print (response.text)
 '' ' 
Proxies = {" HTTP ":" HTTP: // User: [email protected]: 3128 / ",} 
Proxies = {  
    'HTTP': 'Socks5: // User:pass@host:port',
    'HTTPS': 'Socks5: // User:pass@host:port'
}
'''
View Code

          cookies: a dictionary, designated Cookie, maintain session

# Get Cookie, with the parameter sent to the server cookies 
Import Requests 
Response = requests.get ( " the URL " )
 Print (Response.Cookies)   # acquires cookie 
# Print (Response.Cookies [ 'example_cookie_name']) Gets a specific cookie 
for Key, value in response.cookies.items (): # Get all the two cookie attributes 
    Print (Key + ' = ' + value) 
Cookies = { ' cookies_are ' : ' Working ' } # set cookie parameters
= requests.get Request ( ' http://httpbin.org/cookies ' , Cookies = Cookies) 

# session remains 
'' ' 
when a cookie is used to simulate the effect is that you can log in and do a session to maintain, making the simulated landing, always a browser page. 
Get cookie, cookie content to be the landing site, so if you send multiple requests to the same host, TCP underlying connection will be reused, leading to significant performance improvements. 
'' ' 
Import Requests 
S = requests.Session ()    # the Session () provided with the analog server login process, the user login information stored in the server. 
s.get ( " http://httpbin.org/cookies/set/sessioncookie/123456 " ) 
Response = s.get ( " http://httpbin.org/cookies " )
 Print (response.text)
 #Session Manager can be used before and after the text to ensure that the with block exit the session be closed, even if an exception occurs, too. 
requests.Session with () AS S: 
    s.get ( ' http://httpbin.org/cookies/set/sessioncookie/123456789 ' )
View Code

          Authentication Settings auth: tuple type, specify the account and password landing

import requests
url = 'http://www.httpbin.org/basic-auth/user/password'
auth = ('user','password')
response = requests.get(url=url,auth=auth)
print(response.text)
View Code

          Certificate verification verify: a Boolean type, the need for certificate verification request specifying the site, the default is True, you do not want to verify the certificate, you need to set to False

'' If you will verify is set to False, Requests can ignore validation of SSL certificates, but will cause a warning '' ' 
requests.get (' https://kennethreitz.org ', verify = False) 
# 1, ignore warnings 
# 2, to verify incoming certificates 
Import Requests 
from requests.packages Import urllib3 
urllib3.disable_warnings () 
Response = requests.get ( "https://www.12306.cn", verify = False) 
Print (response.status_code) 
> 200 >> 
requests.get ( 'https://github.com', the Verify = '/ path / to / certfile') is set if verify # folder path folder must be processed through c_rehash tools provided by OpenSSL. 
#s = requests.Session () or to hold it in the session 
# s.verify = '/ path / to / certfile'
View Code

          files: file upload

Import Requests # file upload operation post 
 
Files = { ' Files ' : Open ( ' the favicon.ico ' , ' RB ' )} # with files (Specified uploaded file name), open approach it read out the file 
response = requests.post ( ' http://httpbin.org/post ' , Files = Files)
 Print (response.text)
View Code     

          Timeout setting timeout: Specifies the timeout, if no response is obtained over a specified time, an exception is thrown

"" " Tell requests to stop after a set number of seconds to timeout parameter of waiting for a response, if the server does not answer within timeout seconds, will raise an exception ." "" 
Import requests 
Request = requests.get ( ' HTTP: / /www.google.com.hk ' , timeout = 0.01 )
 Print (request.url)
View Code

          Exception Handling: Requestsexplicit exceptions thrown inherit fromrequests.exceptions.RequestExceptio

      

    5) the request header information

      User-Agent: User Agent (request to send proved by computer equipment and browser)

      Cookies: real user login information (to prove you are a user of the target site)

      requests.Session () maintain Cookies information

      Referer: on a visit to the url (to prove that you are jumping from target sites on the web)

    6) request body

      POST requests will have a request body

      Form Data{'user':'Berlin','pwd':'123'}

Guess you like

Origin www.cnblogs.com/Berlin1998/p/11088501.html