request reptiles

requests reptiles

  • Beautiful is better than ugly. (Beautiful than ugly)

  • Explicit is better than implicit. (Clearly better than vague)

  • Simple is better than complex. (Simple is better than complex)

  • Complex is better than complicated. (Better cumbersome complex)

  • Readability counts. (It is important for readability)

The basic flow of reptiles

Initiate a request to initiate a request to the target site via HTTP library, that is, send a Request , the request may contain additional Heade r and other information, waiting for a server response

Acquiring response content if the server can be a normal response, will get a Response, Response contents page content is to be acquired, may be the type of HTML, Json string, binary data (images or videos) and other types

Analytical content obtained content may be HTML, you can use regular expressions, page parsing library parse Json may be, it can be directly converted into Json object parsing and may be binary data, or can be stored for further processing

Save data stored in various forms, can be saved as text, it can be saved to the database, or files stored in a specific format

Request,Response

The browser sends a message to the server where the Web site, this process is called HTPP Request

When the server receives the message sent by the browser, the browser can be sent based on the contents of the message, do the appropriate treatment, then the message back to the browser, a process that is HTTP Response

Response after the browser receives the server information, the information will be handled accordingly and show

What Request contains?

Request method

There are: GET / POST are two types of commonly used, in addition to HEAD / PUT / DELETE / OPTIONS

GET and POST difference is: GET request is data in the url, POST is located in a head portion

URL, or Uniform Resource Locator, which is what we say URL Uniform Resource Locator is a kind of resources available on the Internet from the location and access method is simple, said the standard is the address of a resource on the Internet. Each file on the Internet has a unique URL, the information it contains indicate the location of the file browser and how they deal with it

URL format consists of three parts: the first part is a protocol (or referred to as service mode). There is a second portion of the resource host IP address (sometimes also including a port number). The third part is the host address specific resources, such as directories and file names.

Request header

When the request contains header information, such as the User-Agent, Host, Cookies information

Request body the request data is carried, such as submitting the form data when the form data (POST)

Response contains what

All of the first row are HTTP response status line, followed by the current version of HTTP, three-digit status code, and a description of the state of phrases, each other separated by a space.

Response Status

There are a variety of response status, such as: 200 delegates success, 301 jumps, 404 Page Not Found, 502 Server Error

Response header

The content type, length type, server information, provided Cookie

Response Body

The most important part, the content contains the requested resources, such as Web HTMl, images, binary data, etc.

What kind of data can crawl

Web page text: as an HTML document, Json formatted text and other images: Get into a binary file, save it as a picture format Video: The same binary file other: as long as the request to, you can get

How to parse data

  1. Deal directly

  2. Json parsing

  3. Regular expression processing

  4. BeautifulSoup analytical processing

  5. PyQuery analytical processing

  6. XPath parsing process

About the same problem not crawl pages and data browser to see

This occurs because many of the data site is through js, ajax dynamic loading, so get direct access through the browser requests a page and show different.

How to solve the problem js rendering?

Analysis ajax the Selenium / webdriver Splash PyV8, Ghost.py

How to save data

Text: plain text, Json, Xml etc.

Relational databases: The structured mysql, oracle, sql server database, etc.

Non-relational database: MongoDB, Redis and other key-value store form

Detailed feature requests

A presentation of the overall function
 
 
import requests
response  = requests.get("https://www.baidu.com")
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
print(response.content)
print(response.content.decode("utf-8"))

We can see that response is indeed very easy to use, there is a need to pay attention to the problem: the site in many cases if the direct garbled problem occurs, so the use of such returned data format is actually a binary format, and then decode () converted to utf-8, this will be solved through direct return garbled display problems. response.textresponse.content response.text

Action is to decode the binary data into decode unicode encoding, such as str1.decode ( 'utf-8'), shows a utf-8 encoded string is decoded into unicode encoding.

Simply put: decode is binary data (bytes) to see the transformation of adult understand English or Chinese characters (decode used more)

The effect is to encode a string of unicode encoded into binary coded data, such as str2.encode ( 'utf-8'), shows a unicode character string encoded into the encoded utf-8

After the request is made, Requests will make an educated guess based on the encoded HTTP response header. When you visit response.textwhen it, Requests will use its presumed text encoding. You can find out what encoding Requests to use, and can use the response.encodingproperty to change it, such as:

 
 
response =requests.get("http://www.baidu.com")
response.encoding="utf-8"

Whether by response.content.decode("utf-8)way of or by response.encoding="utf-8"way of garbled problems can be avoided

Basic GET Request

 
 
import requests
response = requests.get('http://httpbin.org/get')
print(response.text)

GET request with parameters

 
 
import requests
response = requests.get("http://httpbin.org/get?name=leti&age=23")
print(response.text)

If we want the URL query string to pass data, we would usually pass through httpbin.org/get?key=val way. Requests module allows the params keyword pass parameters to a dictionary to pass these parameters, the following examples:

 
import requests
data = {
    "name":"leti",
    "age":22
}
response = requests.get("http://httpbin.org/get",params=data)
print(response.url)
print(response.text)

Parsing json

 
import requests
import json
response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

As can be seen from the results of requests which integrates json is actually performed json.loads () method, the result of both is the same

Binary data acquisition

In the above mentioned response.content, this acquired data is binary data, this same method can also be used to download pictures and video resource

Add headers and in front of us will urllib time module, we can also customize the message headers, such as when we know almost directly through the website requests the request, the default is not accessible

Because access requires header information, this time we enter in Google chrome browser: // version, you can see the user agent, the user agent is added to the header information

1572002738666

 
 
import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
response =requests.get("https://www.zhihu.com",headers=headers)
print(response.text)
 
import requests
response= requests.get("http://www.baidu.com")
if response.status_code == requests.codes.ok:
    Print ( "successful visit")

Get cookie

 
 
import requests
response = requests.get("http://www.baidu.com")
print(response.cookies)
for key,value in response.cookies.items():
    print(key+"="+value)

Session to maintain

The role of a cookie is to be used for simulated landing, do maintain session

 
 
import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456")
response = s.get("http://httpbin.org/cookies")
print(response.text)



Guess you like

Origin www.cnblogs.com/leiting7/p/11740761.html