Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Preamble content
Python crawler beginners introductory teaching (1): crawling Douban movie ranking information
Python crawler novice introductory teaching (2): crawling novels
Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data
Python crawler novice introductory teaching (4): crawling 51job.com recruitment information
Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B
Python crawler novice introductory teaching (6): making word cloud diagrams
Python crawler beginners introductory teaching (7): crawling Tencent video barrage
Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF
Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation
Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper
Python crawler beginners introductory teaching (11): recent king glory skin crawling
Python crawler novice introductory teaching (12): the latest skin crawling of League of Legends
Python crawler beginners introductory teaching (13): crawling high-quality ultra-clear wallpapers
Python crawler beginners' introductory teaching (14): crawling audio novel website data
Basic development environment
- Python 3.6
- Pycharm
Use of related modules
import os
import concurrent.futures
import requests
import parsel
Install Python and add it to the environment variables, pip installs the required related modules.
One, determine the demand
Although the above shows that a paid download is required, it is also free to download.
2. Web page data analysis
Open the developer tools, click to play audio, the url address of the audio will be loaded in Media.
If you want to verify whether this link is the actual download address of the audio, you can copy and paste the link into a new window.
It will automatically download an audio file. And this audio file can be played, and the audio sound on the web page can be matched.
It turns out that this is what we want to get the audio address.
https://downsc.chinaz.net/Files/DownLoad/sound1/202102/s830.mp3
The old idea is to copy some parameters in the link and search in the developer tools. Obviously, s830 is the ID of the audio.
Search s830 to find the source, and found that the web page has its own download address. After obtaining the audio download address, you need to stitch the url yourself.
Web page data is not complicated, and relatively simple.
1. Request the current web page data, get the audio address and audio title
2. Just save and download
Three, code implementation
Get audio ID and audio title
def main(html_url):
html_data = get_response(html_url).text
selector = parsel.Selector(html_data)
lis = selector.css('#AudioList .container .audio-item')
for li in lis:
name = li.css('.name::text').get().strip()
src = li.css('audio::attr(src)').get()
audio_url = 'https:' + src
save(name, audio_url)
print(name, audio_url)
save data
def save(name, audio_url):
header = {
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
audio_content = requests.get(url=audio_url, headers=header).content
path = 'audio\\'
if not os.path.exists(path):
os.mkdir(path)
with open(path + name + '.mp3', mode='wb') as f:
f.write(audio_content)
Here I want to give a header parameter again, otherwise it will not be downloaded. The code will always run, but nothing happens
Multi-threaded crawling
if __name__ == '__main__':
executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
for page in range(1, 31):
url = f'https://sc.chinaz.com/yinxiao/index_{page}.html'
# main(url)
executor.submit(main, url)