[Use of urllib (below)]


1. Ajax get request

Douban movie (first page)

Requirement: Crawl the first page of Douban Movie Ranking Edition,
open and check,
insert image description herefind the data of the movie on the first page
insert image description here
and start writing code

  1. Get request, get the first page data of Douban movie, and save it
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

2. Customization of request objects

request = urllib.request.Request(url=url, headers=headers)

3. Get response data


response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')

4. Download data to local (method 1)

 fp = open('douban.json', 'w', encoding='utf-8')
 fp.write(content)

(Method Two)

with open('douban1.json', 'w', encoding='utf-8') as fp:
    fp.write(content)

Total code:


import urllib.request

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')


# fp = open('douban.json', 'w', encoding='utf-8')
# fp.write(content)

# 另一种方法获取json数据
with open('douban1.json', 'w', encoding='utf-8') as fp:
    fp.write(content)

Operating data:
insert image description here

Douban movies (first ten pages)

Observe the interface of each page.
The first page,
insert image description herethe second page,
insert image description here
and the third page
insert image description herecan observe the rules

start=XX different
page 1 2 3 4
start 0 20 40 60

So start (page - 1) * 20

Start writing code
in three steps:

  1. request object customization
  2. Get response data
  3. Data download to local

Entrance to writing programs (output pages 1-10)

if __name__ == '__main__':
    start_page = int(input("请输入起始的页码:"))
    end_page = int(input("请输入结束的页码:"))
    for page in range(start_page, end_page+1):
    	print(page)

Customize the request object, each page has its own customization of the request object

# 创建一个方法(这里传入一个page参数,为了函数中能进行使用)
creat_request(page)

Method functions for creating custom objects

def creat_request(page):
    base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'

    # 这里是get请求,所以url是可以进行拼接的
    data = {
    
    
        'start': (page - 1) * 20,
        'limit': 20
    }
    # 进行拼接,get请求后面不需要+encode
    data = urllib.parse.urlencode(data)

    url = base_url + data

    print(url)

    headers = {
    
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}

Output the addresses of pages 1-10
insert image description here

Customize the request object in the function (define 10 requests)

request = urllib.request.Request(url=url, headers=headers)

The second step is to get the response data (definition function)

def get_content():
    response = urllib.request.urlopen()

In the function of obtaining the response data, request is needed, so we have to use the return value at this time!
In the function of the custom object, to return the request,
the main function must receive the request
request = creat_request(page)
and pass it to the method of obtaining the response data. At
get_content(request)
this time, the method function of obtaining the response data can use the request parameter

def get_content(request):
    response = urllib.request.urlopen(request)

Step Three: Download Data

# 定义方法
down_load()

Same as the previous step, the content and page parameters need to be used in the down_load() method, remember to pass the parameters!

def down_load(page,content):
    with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
        fp.write(content)

Total code block:

import urllib.parse
import urllib.request

 # 难点:每一页的url都不一样
def creat_request(page):
    base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'

    data = {
    
    
        'start': (page - 1) * 20,
        'limit': 20
    }

    data = urllib.parse.urlencode(data)

    url = base_url + data

    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
        }


    request = urllib.request.Request(url=url, headers=headers)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

def down_load(page,content):
    with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
        fp.write(content)

if __name__ == '__main__':
    start_page = int(input("请输入起始的页码:"))
    end_page = int(input("请输入结束的页码:"))

    for page in range(start_page, end_page+1):

        request = creat_request(page)

        content = get_content(request)

        down_load(page, content)

The first ten pages of Douban movies:
insert image description here

Two, ajax post request

KFC official website

Requirement: Crawl which locations in a region have KFC, and crawl the data of the first ten pages and save it locally

Open the official website of KFC, click on the restaurant query, and select the city you want to crawl (Chengdu is selected here)
insert image description here

Copy the interface
insert image description hereto observe the form data of the first page and the second page and the interface
insert image description herediscovery rule: pageIndex is different

Roughly the same as the above two cases of crawling watercress, the only difference is: post request, need to write the encoding method

Attached source code:

import urllib.request
import urllib.parse

def create_request(page):
    base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'

    data= {
    
    
        'cname': "成都",
        'pid': "",
        'pageIndex': page,
        'pageSize': "10"
    }

    data = urllib.parse.urlencode(data).encode('utf-8')

    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
    }

    # 定制
    request = urllib.request.Request(url=base_url, headers=headers, data=data)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

def down_load(page,content):
    with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
        fp.write(content)


if __name__ == '__main__':
    start_page = int(input("请输入起始页码:"))
    end_page = int(input("请输入结束页码:"))

    for page in range(start_page, end_page+1):
        # 请求对象的定制
        request = create_request(page)
        # 获取网页源码
        content = get_content(request)
        # 下载数据
        down_load(page, content)

operation result:
insert image description here

3. URLError\HTTPError

URL components:

  1. protocol
  2. the host
  3. port
  4. request path
  5. Parameters (wd, kw)
  6. Anchors
    so: HTTPError is a subclass of urllib

Requirement: Get the source code of a web page

import urllib.request

url = 'https://blog.csdn.net/qq_64451048/article/details/127775623'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

print(content)

At this time, I accidentally changed the address of the url
(one more 1)insert image description here

Report HTTPError error
insert image description here

catch exception

try:
    request = urllib.request.Request(url=url, headers=headers)

    response = urllib.request.urlopen(request)

    content = response.read().decode('utf-8')

    print(content)
except urllib.error.HTTPError:
    print('系统正在升级...')

insert image description here
If the url is abnormal

except urllib.error.URLError:
    print('系统正在升级...')

Four, Handler processor

Customize more advanced request headers (customization of request objects with complex business logic can no longer meet our needs. Dynamic cookies and proxies cannot use customization of request objects)

1. Basic use

Requirement: Use handler to access Baidu to obtain webpage source code
Three important words: handler, build_opener, open

import urllib.request

url = 'http://www/baidu.com'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

request = urllib.request.Request(url=url, headers=headers)


# 获取handler对象
handler = urllib.request.HTTPHandler()

# 通过handler获取opener对象
opener = urllib.request.build_opener(handler)

# 调用open方法
response = opener.open(request)

content = response.read().decode('utf-8')

print(content)

2. Proxy server

Common functions of agents

  • Break through your own IP access restrictions and access foreign sites
  • Access some internal resources of the unit or group
  • Improve access speed
  • Hide real IP

Change your own IP address through proxy ip

import urllib.request

url = 'http://www.baidu.com/s?wd=IP'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Cookie': 'BAIDUID=12196FD75453346657491E87390AC35B:FG=1; BIDUPSID=12196FD7545334661F0AE8D4B062BE2E; PSTM=1666008285; ispeed_lsm=2; BDUSS=FhRbk81OEFrZ1RFRFJrWUxCQ1dmRTZUQXp0VXA4ZGZtT0QyOUZ0T0hDRGYtSVpqRVFBQUFBJCQAAAAAAAAAAAEAAACS5FTMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN9rX2Pfa19je; BD_UPN=13314752; baikeVisitId=9616ee70-5c47-41fe-865c-14dc1b170603; COOKIE_SESSION=195933_1_0_1_1_1_1_0_0_1_0_0_0_0_6_0_1666259938_1666064006_1666259932%7C2%230_1_1666063999%7C1; ZFY=M8B:A6gXyHZyKVBf:AqksGBg5jNPPKTmxNoclm:BgHpXzI:C; B64_BOT=1; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=umLOJeC62ZlGj87jjNM7q-J69LozULrTH6_n1tn9KcRk7KlESLLqEG0PWf8g0KubzcDrogKKXeOTHiFF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJIe_C-atC-3fP36q4rVhP4Sqxby26ntamJ9aJ5nJDoADh3Fe5J8MxCIjpLLBjK8BIOE-lR-QpP-_nul5-IByPtwMNJi2UQgBgJDKl0MLU3tbb0xynoD24tvKxnMBMnv5mOnanTI3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjOBDNrP; H_PS_PSSID=36554_37555_37518_37687_37492_34813_37778_37721_37794_36807_37662_37533_37720_37740_26350_22157; delPer=0; BD_CK_SAM=1; PSINO=1; BDRCVFR[Fc9oatPmwxn]=aeXf-1x8UdYcs; BD_HOME=1; sugstore=1; BA_HECTOR=0h040g8k252hak242h8g8rou1hna11u1e; H_PS_645EC=e13b%2FA3XVtQZqyt9d0m3A8twSI3IrHVjaGptlJbr4wMhPOUE0G9YUipXLjIqNjZ2UHOS; BDSVRTM=231'
}

request = urllib.request.Request(url=url, headers=headers)

# 模拟浏览器访问服务器
# response = urllib.request.urlopen(request)

# 代理ip
proxies = {
    
    
    'http': '222.74.73.202:42055'
}

handler = urllib.request.ProxyHandler(proxies=proxies)

opener = urllib.request.build_opener()

response = opener.open(request)

content = response.read().decode('utf-8')

with open('daili.html', 'w', encoding='utf-8')as fp:
    fp.write(content)

3. Proxy pool

Simple version of the proxy pool:

proxies_pool = [
    {
    
    'http': '222.74.73.202:42055111'},
    {
    
    'http': '222.74.73.202:42055222'}
]

import random

proxies = random.choice(proxies_pool)

print(proxies)

Custom proxy pool source code:

import urllib.request

proxies_pool = [
    {
    
    'http': '121.13.252.60:41564'},
    {
    
    'http': '121.13.252.60:41564'}

]

import random

proxies = random.choice(proxies_pool)

url = 'http://www.baidu.com/s?wd=IP'

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
   
}

# 请求对象定制
request = urllib.request.Request(url=url, headers=headers)

handler = urllib.request.ProxyHandler(proxies=proxies)

opener = urllib.request.build_opener(handler)

response = opener.open(request)

content = response.read().decode('utf-8')

with open('daili1.html', 'w', encoding='utf-8')as fp:
    fp.write(content)

Guess you like

Origin blog.csdn.net/qq_64451048/article/details/127873676