Python crawler actual combat, Request+urllib module, batch download and crawl all music files on the Biao Song Chart

foreword

What I will introduce to you today is that Python crawls all the audio data of the Biaoge Bang and saves it locally. Here, I will give the code to the friends who need it, and give some tips.

First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add request headers, but such plain text

There will be many people crawling this data, so we need to consider changing the proxy IP and randomly changing the request header to crawl the music data.

Before writing crawler code every time, our first and most important step is to analyze our web pages.

Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.

development tools

Python version: 3.8

Related modules:

requests module

re module

urllib module

Environment build

Install Python and add it to the environment variable, and pip installs the required related modules.

Idea analysis

Open the page we want to crawl in the browser
Press F12 to enter the developer tool to see where the music data we want is here
we need the page data

source code
find the pattern

Code

# 存放音乐文件的文件夹
folder = r'F:\music'
if not isdir(folder):
    mkdir(folder)

# 音乐飙升榜地址
url = 'https://music.163.com/discover/toplist?id=3779629'
# 模拟Chrome浏览器
headers = {
    
    'User-Agent': 'Chrome/88.0.4324.190'}
req = Request(url, headers=headers)
# 读取网页源代码
with urlopen(req) as fp:
    content = fp.read().decode()

# 正则表达式,提取音乐id和名字
pattern = r'<li><a href="/song\?id=(.+?)">(.+?)</a></li>'
for music_id, music_name in findall(pattern, content):
    music_file = rf'{
      
      folder}\{
      
      music_name}.mp3'
    if isfile(music_file):
        print(f'文件已存在,跳过...{
      
      music_name}')
        continue
    # 下载地址
    download_url = rf'https://music.163.com/song/media/outer/url?id={
      
      music_id}'
    req = Request(download_url, headers=headers)
    # 读取网络音乐文件数据,写入本地文件
    with urlopen(req) as fp:
        content = fp.read()
    with open(music_file, 'wb') as fp:
        fp.write(content)
    print(f'下载完成...{
      
      music_name}')

Result display

result

at last

In order to thank the readers, I would like to share with you some of my recent favorite programming dry goods, to give back to every reader, and hope to help you.

There are practical Python tutorials suitable for beginners~

Come and grow up with Xiaoyu!

① More than 100 PythonPDFs (mainstream and classic books should be available)

② Python standard library (the most complete Chinese version)

③ Source code of reptile projects (forty or fifty interesting and classic hand-practicing projects and source codes)

④ Videos on basics of Python, crawlers, web development, and big data analysis (suitable for beginners)

⑤ Python Learning Roadmap (Farewell to Influential Learning)

Guess you like

Origin blog.csdn.net/Modeler_xiaoyu/article/details/128236996