Python web crawler two requests library

Learning and application examples of #requests library

tutorial

Request: Automatic crawling of HTML pages and automatic network request submission
robots protocol: web crawler exclusion criteria
Projects: Practical projects

Unit 1: Getting Started with the Requests Library

Requests library installation: pip install requests

get() head() is most commonly used

insert image description here

get() method

import requests
r = requests.get("url")
#get->request:构造一个向服务器请求的资源的Requests对象
#response->r:返回一个包含服务器资源的的Response对象(包含爬虫返回的内容)
#requests.get(url,params=None,**kwargs)
	#url:拟获取页面的url链接
    #params:url中的额外参数,字典或字节流格数，可选
    #**kwargs:12个控制访问的参数

insert image description here

It can be seen that the requests library has only one request method, and other methods call the request method

#Requests库的两个重要对象 Response-Request
#Response的属性(包含爬虫返回的内容)
	#404或其他：错误或异常
    #200：可查看相应属性值

insert image description here

A general code framework for crawling web pages

insert image description here

#爬取网页的通用代码框架
#网络链接有风险，异常处理很重要
import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()#如果状态不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return  "产生异常"
if __name__=="__main__":
    url = "https://www.baidu.ccom/"
    print(getHTMLText(url))

returns the correct result

insert image description here

`` if _name_ == " _main_":

The function of the printf("11") statement: ``

There are two ways to use a python file. The first is to execute it directly as a script, and the second is to import it into other python scripts to be called (module reuse) for execution. Therefore, the role of if name == 'main': is to control the process of executing code in these two cases. The code under if name == 'main': is only executed in the first case (that is, the file is directly executed as a script). Will be executed, but import to other scripts will not be executed.

HTTP protocol and Requests library method

insert image description here

HTTP protocol: Hypertext Transfer Protocol.

http is a stateless application layer protocol based on the "request and response" model.

HTTP generally uses url as an identifier for locating network resources.

URL format http://host[:port][path]

host: legal Internet host domain name or IP address

port: port number, the default port is 80

path: the path of the requested resource

http://www.bit.edu.cn

Understanding HTTP URLs

A url is an Internet path for accessing resources through the HTTP protocol, and a URL corresponds to a data resource.

insert image description here

#Requests库的head()方法
import requests
r=requests.head("https://www.baidu.com/")
r.headers
#Requests库的post()方法
payload = {
    
    'key1':'value1'}
r = requests.post('http://httpbin.org/post',data = payload)
print(r.text)
```这是结果
{
    
    
  "args": {
    
    }, 
  "data": "", 
  "files": {
    
    }, 
  "form": {
    
    
    "key1": "value1"
  }
```
#Requests库的put()方法
payload = {
    
    'k1':'v1','k2':'v2'}
r = requests.put('http://httpbin.org/put',data=payload)
print(r.text)
```这是结果一部分    
{
    
    
  "args": {
    
    }, 
  "data": "", 
  "files": {
    
    }, 
  "form": {
    
    
    "k1": "v1", 
    "k2": "v2"
  },
```

Analysis of the main methods of the Requests library

request() method

#requests.request(method,url,**kwargs)
	method:请求方式,对应get/put/post等七种
        r = request.request('GET',url,**kwargs)#eg
	url:拟获取页面的url链接
    **kwargs:控制访问的参数,共13个
        params:字典或字节序列,作为参数增加到url中
        	kv = {
    
    'k1':'v1','k2':'v2'}
			r = requests.request('GET','http://python123.io/ws',params=kv)
			print(r.url)
			#结果 https://python123.io/ws?k1=v1&k2=v2
        data:字典，字节序列或文件对象，作为Request的内容  
            kv = {
    
    'k1':'v1','k2':'v2'}
            r = requests.request('POST','http://python123.io/ws',data=kv)
        json:JSON格式的数据，作为Ruquests的内容
            kv = {
    
    'k1':'v1','k2':'v2'}
            r = requests.request('POST','http://python123.io/ws',json=kv)
        headers：字典，HTTP定制头
        	hd = {
    
    'user-agent':'chrome/10'}
            r = requests.request('POST','http://python123.io/ws',headers=hd)
        	
        cookies:字典或CookieJar,Request中的cookie
        auth：元组,支持HTTP认证功能
        files:字典类型，传输文件
            fs = {
    
    'file':open('data.xls','rb')}
            r=requests.request('POST','http://python123.io/ws',file=fs)
            
        timeout:设定超时时间，秒为单位
            r=requests.request('GET','http://python123.io/ws',timeout=10)
        proxies:字典类型，设定访问代理服务器，可以增加登录认证
        allow_redirects:True/False,默认为True，重定向开关
        stream:True/False,默认为True，获得内容立即下载开关
       	verify：True/False,默认为True，认证SSL证书开关
        cert:本地SSL证书路径

insert image description here

Unit 2: "Theft is also a way" of web crawlers

Problems caused by web crawlers

insert image description here

Legal Risk
leak privacy
Harassment issue

Limitations of Web Crawlers

Source review: judge user-agent to limit
Announcement: robots protocol

Robots protocol (web crawler exclusion criteria)

Function: The website tells the web crawler which pages can be crawled and which pages cannot

Agreement: the robots.txt file in the root directory of the website

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-RqWUh17q-1637850721313)(https://www.jd.com/robots.txt)]

JD robots agreement

Baidu robots agreement

grammar:

User-agent:*

Disallow:/

Compliance with the Robots protocol

Web crawler : automatically or manually identify robots.txt, and then crawl content.

Binding : The Robots agreement is a suggestion but not binding. Web crawlers can not abide by it, but there are legal risks.

Human-like behavior does not need to refer to the Robots protocol

Unit 3: Requests library web crawler combat (5 examples)

Example 1 Crawling of Jingdong commodity pages

import requests
try:
    #https://item.jd.com/100021007462.html
    r = requests.get("https://item.jd.com/100021007462.html/")
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print('爬取失败')

result

insert image description here

Example 2 Crawling of Amazon product pages

Unlike accessing JD products, we need to let our code simulate browsers to provide HTTP requests to Amazon through the headers field

#正确源代码
import requests
try:
    url = "https://www.amazon.cn/dp/B0814XNDPM/ref=s9_acsd_hps_bw_c2_x_2_i?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=TR3JA9FYNTNPF2PZ66V3&pf_rd_t=101&pf_rd_p=7235aeb5-a996-42a4-a46a-257db647554a&pf_rd_i=2032713071"
    kv = {
    
    "user-agent":"Mozilla/5.0"}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[6000:10000])
except:
    print('爬取失败')

insert image description here

The crawler faithfully told Amazon that he wrote python-requests, Amazon's source review, which made such a crawler go wrong.

We can change the headers, simulate the browser sending a request to Amazon, first construct a key-value pair to redefine the user-agent content, and call the get function to modify it.

insert image description here

Example 3 Baidu 360 search keyword submission

Search engine keyword submission interface

百度：http://www.baidu.com/s?wd=keyword
360:http://www.so.com/s?q=keyword

insert image description here

import requests
keyword = 'Python'
try:
    kv = {
    
    'wd':keyword}
    r = requests.get("http://www.baidu.com/s",params=kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
except:
    print("爬取失败")

insert image description here

Example 4 Crawling and Storage of Network Pictures (Video)

Format of web image link:

http://www.example.com/picture.jpg

import requests
import os

url = "http://cj.jj20.com/2020/down.html?picurl=/up/allimg/tp05/19100120461512E-0.jpg"
root = "C://程序员专用软件//"
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r=requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

Example 5 Automatic query of IP address attribution

https://www.ip138.com/ #Query ip address

https://www.ip138.com/iplookup.asp?ip=112.224.74.158&action=2 #This form of link

import requests
url = "http://m.ip138.com/ip.asp?ip="
#try:
    r = requests.get(url+'202.204.80.112')
    r.raise_for_status()
   # r.encoding = r.apparent_encoding
   # print(r.text[-500:])
#except:
    print("爬取失败")

insert image description here