python web crawler study notes bis initiate http requests and parameter passing

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/bowei026/article/details/90183795

Acquiring response content

response object has attributes:
text request all returned content
status_code status code
encoding encoded
contents of the response byte of the content, such as in \ n denotes a carriage return, and \ t \ r et
r.json () returns if json strings, analyzing will be used to json json decoder comes Requests

Delivery request parameter

import requests

dict = {'key1' : 'value1', 'key2' : 'value2'}
link = 'http://httpbin.org/get'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36', 'Content-Type': 'text/html'}
r = requests.get(link, headers=headers, params=dict)
print(r.content)
print(r.status_code)
print(r.json())
程序运行结果:
b'{\n  "args": {\n    "key1": "value1", \n    "key2": "value2"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Type": "text/html", \n    "Host": "httpbin.org", \n    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"\n  }, \n  "origin": "223.72.90.250, 223.72.90.250", \n  "url": "https://httpbin.org/get?key1=value1&key2=value2"\n}\n'
200
{'args': {'key1': 'value1', 'key2': 'value2'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Type': 'text/html', 'Host': 'httpbin.org', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}, 'origin': '223.72.90.250, 223.72.90.250', 'url': 'https://httpbin.org/get?key1=value1&key2=value2'}

Params = dict visible through the request parameters has been correctly transmitted key1 = value1 & key2 = value2. Further, if you want a compact format json format the data format can use the online tool http://www.bejson.com/

Custom request headers

Examples of the above specified headers parameter transmitted through the User-Agent, we can convey more information headers, such as the
import requests

= Link 'http://httpbin.org/get'
headers = { 'the Host': 'www.santostang.com', 'the User-- Agent': 'the Mozilla / 5.0 (the Windows NT 6.1; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.103 Safari / 537.36 ',' the Type-the Content ':' text / HTML '}
R & lt requests.get = (Link, headers = headers)
Print (r.status_code)
can also pass a more many headers parameters, view the contents of request Headers from requesting browser can join.

Sending a post request

import requests

dict = {'key1' : 'value1', 'key2' : 'value2'}
headers = {'Host' : 'www.santostang.com', 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36', 'Content-Type': 'text/html'}
r = requests.post('http://httpbin.org/post', headers=headers, data=dict)
print(r.text)
运行结果:
{
  "args": {}, 
  "data": "key1=value1&key2=value2", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "text/html", 
    "Host": "www.santostang.com", 
    "The User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 73.0.3683.103 Safari / 537.36"
  }, 
  "json": null, 
  "Origin": " 223.72.90.250, 223.72.90.250 ", 
  " URL ":" https://www.santostang.com/post "
}
POST request parameter value specified by the request parameter data

Set timeout

import requests

requests.post = R & lt ( 'http://httpbin.org/post', timeout = 0.001)
Print (r.text)
result:
Because the timeout value of 0.001 timeout parameter set is too small, the execution of the program being given socket.timeout: timed out

Crawling watercress network top250 movie

import requests
from bs4 import BeautifulSoup

def getMovies():
	headers = {'Host' : 'movie.douban.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
	movies = []
	for i in range(0, 10):
		r = requests.post('https://movie.douban.com/top250?start=' + str(i * 25), headers=headers)

		soup = BeautifulSoup(r.text, 'lxml')
		div_list = soup.find_all('div', class_='hd')
		for div in div_list:
			title = div.a.span.text
			movies.append(title)

	return movies
		

movies = getMovies()
for i, movie in enumerate(movies):
	print(str(i+1) + "==" + movie)

Run the program will display the first 250 movie watercress network.

BeautifulSoup documents refer  https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

This concludes this article, it may be more concerned about the number of public and personal micro signal:

Guess you like

Origin blog.csdn.net/bowei026/article/details/90183795