[python] Crawler notes (two) request module

The modules based on network requests in python are:

  • urllib module
  • requests module √ efficient and concise

requests module:

  • Based on network request, powerful, simple and convenient, high efficiency
  • Role: Simulate browser request

Requests use steps to split the process:

  • Specified url
  • Make a request to the url
  • Get response data
  • Persistent storage

Environment installation:
pip install requests

Code combat:

  • Crawling Sogou homepage data
import requests

url = 'https://www.sogou.com/'

if __name__ == "__main__":
    #get方法会返回一个响应对象
    response = requests.get(url=url) #第二步:获取url页面信息get
    #第三步:获取响应数据
    page_text = response.text
    print(page_text)
    #第四步:持久化存储,这里用存在本地的html为例
    with open('./sogou.html','w',encoding='utf-8') as fp:
        fp.write(page_text)
    print("爬取数据结束")
  • Sogou search specified dynamic keyword webpage collector
#网页采集器
#动态指定关键词并爬取

#UA伪装,网站的服务器会检测对应请求的载体身份标识,如果检测到请求的载体身份标识为某一款浏览器,说明该请求是正常请求
#如果不是基于浏览器,则该请求为不正常请求

#User-Agent伪装
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"

import requests

keyword = "硕了" #可结合input

if __name__ == "__main__":
    url = "https://www.sogou.com/web?"
    #处理url携带的参数:封装到字典中
    headers = {
    
    
        'User-Agent':ua,
    }
    param = {
    
    
        'query': keyword,
    }
    #对指定url发起的请求对应的url是携带参数的,并且请求过程中已经处理了参数
    response = requests.get(url=url,params=param,headers=headers)
    page_text = response.text

    filename = keyword+'.html'
    with open(filename, 'w', encoding='utf-8') as fp:
        fp.write(page_text)
    print(filename, "爬取成功")
#通过抓包发现对应的请求是post请求,响应数据是一组json数据
#post请求携带了参数
#用requests发起请求
#要会利用浏览器抓包工具,这里先验知识是网页采用的是阿贾克斯请求,可以抓包XHR

ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
import json
import requests
if __name__ == "__main__":
    post_url = "https://fanyi.baidu.com/sug"
    #post请求参数处理
    data = {
    
    
        "kw": "dog"
    }
    headers = {
    
    
        "User-Agent":ua,
    }
    #请求发送
    response = requests.post(url=post_url,data=data,headers=headers)
    #获取响应数据,json方法返回的是一个obj,如果确认服务器响应数据是json类型的,才可以使用json对象返回
    dic_obj = response.json()
    print(dic_obj)
    fp = open('./dog.json','w',encoding='utf-8')
    json.dump(dic_obj,fp=fp,ensure_ascii=False) #中文不能使用ascii编码
    print("over")
  • Crawl the Douban movie classification rankings
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
import json
import requests

if __name__ == "__main__":
    url = "https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=20&limit=20"
    param = {
    
    
        'type': '24',
        'interval_id': '100:90',
        'action': '',
        'start':'20',
        'limit':'20',
    }
    headers = {
    
    
        "User-Agent":ua,
    }
    response = requests.get(url=url,params=param,headers=headers)
    list_obj = response.json()
    print(list_obj)
  • Crawl KFC restaurant information
import requests
kw = "广州"

if __name__ == "__main__":
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"

    headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4198.400'
    }

    data = {
    
    
        'cname':'',
        'pid': '',
        'keyword':kw,
        'pageIndex': '1',
        'pageSize': '10'
    }

    response = requests.post(url=url, headers=headers, data=data)
    print(response.text)

Guess you like

Origin blog.csdn.net/Sgmple/article/details/112003689