02.Detailed explanation of python requests module

1. Installation of requests

pip install requests

2. Requests crawl Sogou homepage and store

1.request.get method

requests.get(url,params,kwargs)

url: request address

params: parameters

2. Code

import requests
if __name__ == "__main__":
    #step_1:指定url
    url = 'https://www.sogou.com/'
    #step_2:发起请求
    #get方法会返回一个响应对象
    response = requests.get(url=url)
    #step_3:获取响应数据.text返回的是字符串形式的响应数据
    page_text = response.text
    print(page_text)
    #step_4:持久化存储
    with open('./sogou.html','w',encoding='utf-8') as fp:
        fp.write(page_text)
    print('爬取数据结束!!!')

3. Garbled problem

Response uses iso-8859-1 encoding to encode the message body by default, and transmits data to the client. If the encoding format is not specified, garbled codes may appear. Which encoding format to use? You only need to open the website you want to crawl

Insert picture description here

You can see that the Sogou homepage uses utf-8 encoding format, then you need to specify the encoding format:

response.encoding = "utf-8"

The rest of the websites are the same.

3. Requests collect Sogou search data

1. UA camouflage

UA: User-Agent (request carrier identity)
UA detection: the portal server will detect the carrier identity of the corresponding request. If the requested carrier identity is detected as a certain browser,
the request is a normal request. However, if it is detected that the requested carrier identity is not based on a certain browser, it means that the request
is an abnormal request (crawler), and the server is likely to reject the request.

UA disguise: Let the crawler's corresponding request carrier identity disguise as a certain browser

2. Crawl data

import requests
if __name__ == "__main__":
    #UA伪装:将对应的User-Agent封装到一个字典中
    headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    }
    url = 'https://www.sogou.com/web'
    #处理url携带的参数:封装到字典中
    kw = input('enter a word:')
    param = {
    
    
        'query':kw
    }
    #对指定的url发起的请求对应的url是携带参数的,并且请求过程中处理了参数
    response = requests.get(url=url,params=param,headers=headers)
		response.encoding = "utf-8"
    page_text = response.text
    fileName = kw+'.html'
    with open(fileName,'w',encoding='utf-8') as fp:
        fp.write(page_text)
    print(fileName,'保存成功!!!')

Three. Crack Baidu translation

Requirements: input the text to be translated, return the translation result in json format and store it locally

import requests
import json
if __name__ == "__main__":
    #1.指定url
    post_url = 'https://fanyi.baidu.com/sug'
    #2.进行UA伪装
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'

    }
    #3.post请求参数处理(同get请求一致)
    word = input('enter a word:')
    data = {
    
    
        'kw':word
    }
    #4.请求发送
    response = requests.post(url=post_url,data=data,headers=headers)
    #5.获取响应数据:json()方法返回的是obj(如果确认响应数据是json类型的,才可以使用json())
    dic_obj = response.json()

    #持久化存储
    fileName = word+'.json'
    fp = open(fileName,'w',encoding='utf-8')
    json.dump(dic_obj,fp=fp,ensure_ascii=False)

    print('over!!!')

4. Crawl the movie detail data in the Douban movie classification ranking list https://movie.douban.com/

import requests
import json
if __name__ == "__main__":
    url = 'https://movie.douban.com/j/chart/top_list'
    param = {
        'type': '24',
        'interval_id': '100:90',
        'action':'',
        'start': '0',#从库中的第几部电影去取
        'limit': '20',#一次取出的个数
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'

    }
    response = requests.get(url=url,params=param,headers=headers)

    list_data = response.json()

    fp = open('./douban.json','w',encoding='utf-8')
    json.dump(list_data,fp=fp,ensure_ascii=False)
    print('over!!!')

Guess you like

Origin blog.csdn.net/qq_40837794/article/details/109604680