Python crawler beginners' introductory teaching (15): crawling website music materials

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Preamble content

 

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Python crawler novice introductory teaching (2): crawling novels

Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data

Python crawler novice introductory teaching (4): crawling 51job.com recruitment information

Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B

Python crawler novice introductory teaching (6): making word cloud diagrams

Python crawler beginners introductory teaching (7): crawling Tencent video barrage

Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper

Python crawler beginners introductory teaching (11): recent king glory skin crawling

Python crawler novice introductory teaching (12): the latest skin crawling of League of Legends

Python crawler beginners introductory teaching (13): crawling high-quality ultra-clear wallpapers

Python crawler beginners' introductory teaching (14): crawling audio novel website data

 

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

import os
import concurrent.futures
import requests
import parsel

Install Python and add it to the environment variables, pip installs the required related modules.

One, determine the demand

Can't find suitable audio material?  Python batch crawl audio material website

 


Although the above shows that a paid download is required, it is also free to download.

2. Web page data analysis

Open the developer tools, click to play audio, the url address of the audio will be loaded in Media.

Can't find suitable audio material?  Python batch crawl audio material website

 


If you want to verify whether this link is the actual download address of the audio, you can copy and paste the link into a new window.

Can't find suitable audio material?  Python batch crawl audio material website

 


It will automatically download an audio file. And this audio file can be played, and the audio sound on the web page can be matched.
It turns out that this is what we want to get the audio address.

Can't find suitable audio material?  Python batch crawl audio material website

 

https://downsc.chinaz.net/Files/DownLoad/sound1/202102/s830.mp3

The old idea is to copy some parameters in the link and search in the developer tools. Obviously,  s830  is the ID of the audio.

Can't find suitable audio material?  Python batch crawl audio material website

 


Search s830 to find the source, and found that the web page has its own download address. After obtaining the audio download address, you need to stitch the url yourself.

Web page data is not complicated, and relatively simple.
1. Request the current web page data, get the audio address and audio title
2. Just save and download

Three, code implementation

Get audio ID and audio title

def main(html_url):
    html_data = get_response(html_url).text
    selector = parsel.Selector(html_data)
    lis = selector.css('#AudioList .container .audio-item')
    for li in lis:
        name = li.css('.name::text').get().strip()
        src = li.css('audio::attr(src)').get()
        audio_url = 'https:' + src
        save(name, audio_url)
        print(name, audio_url)

save data

def save(name, audio_url):
    header = {
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    audio_content = requests.get(url=audio_url, headers=header).content
    path = 'audio\\'
    if not os.path.exists(path):
        os.mkdir(path)
    with open(path + name + '.mp3', mode='wb') as f:
        f.write(audio_content)

Here I want to give a header parameter again, otherwise it will not be downloaded. The code will always run, but nothing happens

Multi-threaded crawling

if __name__ == '__main__':
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
    for page in range(1, 31):
        url = f'https://sc.chinaz.com/yinxiao/index_{page}.html'
        # main(url)
        executor.submit(main, url)

Can't find suitable audio material?  Python batch crawl audio material website

 

Can't find suitable audio material?  Python batch crawl audio material website

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113725399