Python web crawler requests of the basic library

Before learning the urllib library to acquire web content, but compared with the requests library, urllib really weak burst, requests provides many modules is very convenient to use, without invoking the urllib several modules that can be achieved.

requests the library is not a library that comes with Python, that is to say before the first install Python libraries using Xu, anyway pip click on installed, so it is strongly recommended that requests the library to get web content.

1.GET request

GET requests are requests in the most common requests. Look at an example:

import requests

r = requests.get("http://httpbin.org/get")
print(r.text)

|#以下为输出内容:
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4", 
    "X-Amzn-Trace-Id": "Root=1-5e7b817a-642ede6f41b26bdf583d312d"
  }, 
  "origin": "120.229.19.26", 
  "url": "http://httpbin.org/get"
}

The above is a simple GET request includes a request header returned results, URL, IP and other information.

If you want to add parameters in the request and how to achieve it? Params parameters can be used, the parameter to be added to a dictionary stored up to then pass params, the following example shows:

import requests

data = {
    "name" : "Marty",
    "age" : 18
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

\#以下为输出结果:
{
  "args": {
    "age": "18", 
    "name": "Marty"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4", 
    "X-Amzn-Trace-Id": "Root=1-5e7b877e-a8afd42005f62ca0e4ef7f20"
  }, 
  "origin": "120.229.19.26", 
  "url": "http://httpbin.org/get?name=Marty&age=18"
}

We will pass the request parameters, you can see, the returned results, request links automatically constructed, URL becomes the " http://httpbin.org/get?name=Marty&age=18 "

In addition, we see that the pages return type is the type str, which is a string, but it is Json format. So we can call the json () method to parse get a dictionary format. That is, json () method may return the result is a string format for transforming Json dictionary.

print(type(r.text))
print(r.json())
print(type(r.json()))
\#以下为输出结果:
<class 'str'>
{'args': {'age': '18', 'name': 'Marty'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4', 'X-Amzn-Trace-Id': 'Root=1-5e7b877e-a8afd42005f62ca0e4ef7f20'}, 'origin': '120.229.19.26', 'url': 'http://httpbin.org/get?name=Marty&age=18'}
<class 'dict'>
  • Crawl the web

In the known almost 'find' page, for example:

import requests
import re

header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
r = requests.get("http://www.zhihu.com/explore", headers = header)
pattern = re.compile("ExploreRoundtableCard-questionTitle.*?>(.*?)</a>", re.S)
titles = re.findall(pattern, r.text)
print(titles)
\#以下为输出结果:
['有没有萝莉控福音的gal呢?', '如何看待galgame中男主的平行性格?', 'Steam 上有什么优秀的Galgame?', '读完《攀岩人生》,你有什么感想?', '攀岩有何魅力?为什么近两年攀岩运动开始在国内悄悄兴起?', '如果2020年发生了经济危机你会如何应对?', '为什么这么多人想转行做产品经理?', '疫情过后会滋生出新的行业吗?什么产品会火爆?', '为什么找个产品助理的职位都这么难?', '东京奥运会推迟至 2021 年将在多大程度上影响日本经济?会给日本带来多大的经济损失?', '如何看待日本奥委会被曝购买慕尼黑再保险公司的「奥运取消保险」?', '东京奥运会推迟至 2021 年夏天,会给后续产生什么影响?']

We have added a parameter in the GET request headers, the introduction of this parameter can put our program disguised as a browser, without this, it would be prohibited crawl.

This user-Agent where to see it, just open a web page, check the elements, and then select Network, you can see below there are a lot of entries, which represent a total load that page to send so many requests to the server. Casually points to open an entry, at the bottom is the location where the user-agent, which contains the hardware (phone or computer) you use, brand, what kind of system, which browser and other information. Of course, this argument does not have to choose their own computer browser user-agent, which identifies the content online a lot.

  • Grab binary data

      import requests
    
      r = requests.get("https://github.com/favicon.ico")
      print(r.text)
      print(r.content)
      #以下为输出:
      $$ �������������� +++G��������G+++
      b'\x00\x00\x01\x00\x02\x00\x10\x10\x00\x00\x01\x00 \x00(\x05\x00\x00&\x00\x00\x00  \x00\x00\x01\x00 \x00(\x14\x00\x00N\x05\x00\x00(\x00\x00\x00\x10\x00\x00\x00 \x00\x00\x00\x01\x00 \x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
    

The above is a picture request (small icon on the site label), pictures audio and video files are essentially composed of binary coding, there is a specific preservation and the corresponding analytical methods. So they will want to get their settlements binary code.

R printing attributes can be seen, r.text result is garbled, and the content is a binary code (before the sentence with a word b). Because the picture is binary, so the print is converted into str type, be wrong.

Here by the way of the difference between a text and content. content is the original source of the data returned. The text is returned through the encoded data.

Available will be stored and the code below:

import requests
r = requests.get("https://github.com/favicon.ico")
with open("favicon.ico", 'wb') as f:
    f.write(r.content)

Similarly, audio, video can also be obtained in this way.





Too sleepy to write no more, sleep.

Guess you like

Origin www.cnblogs.com/shuai3290/p/12571921.html