Article Directory

Python crawler-----request module learning and case

Python crawler-----request module learning and case

basic knowledge

The difference between str and bytes

In python3:
str uses the encode method to convert to bytes
bytes and decodes to str

In Python 3, the two are separated. This requires attention in use. In actual applications, it is transmitted through binary on the Internet, so we need to convert str into bytes for transmission, and decode it into the encoding we need during reception to process the data. This way, no matter what encoding the other party is, it is us locally. The encoding used will not be garbled.

str1 = '人生苦短'
b = str1.encode()
str1 = b.decode()
print(str1)
print(type(str1))

print('***************')
print(b)
print(type(b))
结果输出
人生苦短
<class 'str'>
***************
b'\xe4\xba\xba\xe7\x94\x9f\xe8\x8b\xa6\xe7\x9f\xad'
<class 'bytes'>

urllib library

Common method

1.request.urlopen（）

import urllib
urllib.request.urlopen(url,data,timeout)

The first parameter ur is the URL, the second parameter data is the data to be transmitted when accessing the URL, and the third timeout is to set the timeout period.
The second and third parameters can not be transmitted, data defaults to empty None, timeout defaults to socket._ GLOBAL DEFAULT _TIMEOUT
The first parameter URL must be transmitted. In this example, we transmitted the URL of Baidu. After executing the urlopen method, a response object is returned, and the return information is stored in it.

2.read（）

The read() method is to read all the contents of the file and return the bytes type

3.getcode（）

Return the HTTP response code, successfully return 200, 4 server page error, 5 server problem

4.info（）

Return the server response HTTP header

import urllib.request

url1 = "http://www.baidu.com/"
f = urllib.request.urlopen(url1)
info = f.read()
#print(info.decode())
print(f.geturl())
print('***********************')
print(f.getcode())
print('***********************')
print(f.info())

输出结果
http://www.baidu.com/
***********************
200
***********************
Bdpagetype: 1
Bdqid: 0x853689130000c1c9
Cache-Control: private
Content-Type: text/html;charset=utf-8
Date: Tue, 16 Mar 2021 12:20:23 GMT
Expires: Tue, 16 Mar 2021 12:19:47 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=E5EA9206E149F08295546FF28C768CE2:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=E5EA9206E149F08295546FF28C768CE2; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1615897223; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=E5EA9206E149F0825C2486BEC5C001D3:FG=1; max-age=31536000; expires=Wed, 16-Mar-22 12:20:23 GMT; domain=.baidu.com; path=/; version=1; comment=bd
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=1; path=/
Set-Cookie: H_PS_PSSID=33257_33344_31253_33594_33570_33392_26350_22158; path=/; domain=.baidu.com
Traceid: 161589722302381870189599010370484224457
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked

Request object

But if you need to perform more complex operations, such as adding HTTP headers, you must create a Request instance as a parameter of urlopen(); and the URL address to be accessed is used as a parameter of the Request instance.

User-Agent

We use a legal identity to request other people's websites. Obviously, they are welcome, so we should add an identity to our code, which is the so-called User-Agentheader.

If we want our crawler to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different User-Agent headers when sending requests.

from urllib.request import urlopen,Request

url1 = "http://www.baidu.com"

head = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
req = Request(url1,headers=head)
print(req.get_header('User-agent'))

resp = urlopen(req)
info1 = resp.read()
print(info1.decode())

Get request method

GET requests are generally used for us to get data from the server.

In it, we can see that in the request part, http://www.baidu.com/s?a long string appears afterwards, which contains the keyword we want to query Chuanzhi podcast, so we can try to send the request using the default Get method.

method 1:

from urllib.request import Request,urlopen
from urllib.parse import quote

url1 = 'https://www.baidu.com/s?wd={}'.format(quote('社区'))
headers = {
    
    
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}

request = Request(url1,headers=headers)
response = urlopen(request)
print(response.read().decode())

Method 2:

from urllib.request import urlopen,Request
from urllib.parse import urlencode

agre = {
    
    
    'wd':'社区',
    'ie':'utf-8'
}
print(urlencode(agre))
url = "https://www.baidu.com/s?[]".format(urlencode(agre))

headers = {
    
    
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}

req = Request(url,headers=headers)
res = urlopen(req)
print(res.read().decode())

Download Baidu Tieba case

from urllib.request import Request,urlopen
from urllib.parse import urlencode

def get_html(url):
    head = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    req = Request(url,headers=head)
    info = urlopen(req)
    return info.read()
def save_html(filename,html_bytes):
    with open(filename,'wb') as f:
        f.write(html_bytes)
        print(html_bytes)

def main():
    content = input("请输入要下载的内容：")
    size = int(input("请输入要下载的页数："))
    base_url = 'https://tieba.baidu.com/f?ie=utf-8&{}'
    for pn in range(size):
        args = {
    
    
            'pn': pn * 50,
            'kw': content
        }
        args = urlencode(args)
        html_bytes = get_html(base_url.format(args))
        print('正在下载第{}页'.format(pn))
        filename = '第{}页.html'.format(pn)
        save_html(filename,html_bytes)


if __name__ == '__main__':
    main()

1. Simple web collector

import requests
if __name__ == '__main__':
    head = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    url = 'https://www.sogou.com/sie'
    kw = input("请输入需要搜索的内容：")
    param = {
    
    
        'query':kw
    }
    response = requests.get(url=url,params=param,headers=head)
    page_text = response.text
    filename = kw + '.html'
    with open(filename,'w',encoding='utf-8',) as fp:
        fp.write(page_text)
    print(filename,'保存成功！！')

Effect picture
Insert picture description here

2. Crawling Baidu translated content

import requests
import json

if __name__ == '__main__':
    url = 'https://fanyi.baidu.com/sug'
    headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37'
    }
    query = input('请输入需要翻译的内容：')
    data = {
    
    
        'kw':query
    }
    response = requests.post(url=url,data=data,headers=headers)
    dic_obj = response.json()
    fileName = query+'.json'

    fp = open(fileName,'w',encoding='utf-8')
    json.dump(dic_obj,fp=fp,ensure_ascii=False)
    print('保存成功！！')

Effect picture
Insert picture description here

3. Crawling Douban Movies

import requests
import json

url = 'https://movie.douban.com/j/chart/top_list'
headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }

param = {
    
    
    'type': '5',
    'interval_id': '100:90',
    'action':'',
    'start': '0',
    'limit': '20'
}
response = requests.get(url=url,params=param,headers=headers)
list_data = response.json()

fileName = 'douban.json'
fp = open(fileName,'w',encoding='utf-8')
json.dump(list_data,fp=fp,ensure_ascii=False)

Effect picture
Insert picture description here

4. Crawl kfc restaurant address location

import requests


url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
kw = input('请输入想要查询的kfc地址：')
index = input('请输入想要查询的kfc地址第几页：')
data = {
    
    
    'cname':'',
    'pid':'' ,
    'keyword': kw,
    'pageIndex': index,
    'pageSize': '10'
}
response = requests.post(url=url,data=data,headers=headers)
kfc_data = response.text

fileName = kw+'kfc地址.txt'

with open(fileName, 'w', encoding='utf-8', ) as fp:
    fp.write(kfc_data)
print("over!!!")

Insert picture description here

5. Climb to the General Administration of Drug Administration

import requests
import json
if __name__ == '__main__':

    url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
    headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
        }
    all_data_list = []
    id_list = []  # 储存企业id
    for page in range(1,10):
        data = {
    
    
            'on': 'true',
            'page': page,
            'pageSize': '15',
            'productName':'',
            'conditionType': '1',
            'applyname': '',
        }

        json_id = requests.post(url=url,headers=headers,data=data).json()
        for id in json_id['list']:
            id_list.append(id['ID'])

    post_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
    for id in id_list:
        detail_data = {
    
    
            'id':id
        }
        detail_json = requests.post(url=post_url,data=detail_data,headers=headers).json()
        all_data_list.append(detail_json)
fp = open('./药监总局数据.txt', 'w', encoding='utf-8')
json.dump(all_data_list, fp=fp, ensure_ascii=False)
print('Over!!!')

Effect picture
Insert picture description here