Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
Preamble content
Python crawler beginners introductory teaching (1): crawling Douban movie ranking information
Python crawler novice introductory teaching (2): crawling novels
Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data
Python crawler novice introductory teaching (4): crawling 51job.com recruitment information
Python crawler beginners introductory teaching (5): crawling the video barrage of station B
Python crawler novice introductory teaching (6): making word cloud diagrams
Python crawler beginners introductory teaching (7): crawling Tencent video barrage
Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF
Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation
Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper
Python crawler beginners introductory teaching (11): recent king glory skin crawling
Python crawler novice introductory teaching (12): the latest skin crawling of League of Legends
Python crawler beginners introductory teaching (13): crawling high-quality ultra-clear wallpapers
Python crawler beginners' introductory teaching (14): crawling audio novel website data
Python crawler beginners' introductory teaching (15): crawling website music materials
Python crawler beginners' introductory teaching (16): crawling good-looking videos
Python crawler beginners introductory teaching (17): crawling yy site-wide small video
Python crawler beginners' introductory teaching (19): crawl ip proxy, build proxy pool
Python learning exchange group: 1039645993
Basic development environment
- Python 3.6
- Pycharm
Use of related modules
import requests
import re
from tqdm import tqdm
import os
Install Python and add it to the environment variables, pip installs the required related modules.
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Python learning exchange group: 1039645993
Determine target needs
Now that you have chosen to crawl the video, you must give priority to the video of the young lady
I know everything~
Web data analysis, find data sources
The video of station A is in m3u8 format, and the whole video is divided into many small segments, one segment corresponds to a ts file.
So just find the data source of this m3u8 to get all the ts files.
The request parameter pkey of the url link will change. But this parameter can be found in the source code of the web page. The request link including m3u8 is also available in the source code of the web page.
the whole idea
1. Request the video address and get the url address of m3u8 in the source code.
2. Request the address of m3u8, get all the ts file addresses
3. Save the ts file and merge the ts file into mp4 video format
Implementation code
import requests
import re
from tqdm import tqdm
import os
def change_title(title):
pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]") # '/ \ : * ? " < > |'
new_title = re.sub(pattern, "_", title) # 替换为下划线
return new_title
def get_response(html_url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
def save(name, video, title):
path = f'{name}\\'
if not os.path.exists(path):
os.makedirs(path)
with open(path + title + '.ts', mode='wb') as f:
f.write(video)
def get_m3u8_url(html_url):
html_data = get_response(html_url).text
m3u8_url = re.findall('backupUrl(.*?)\"]', html_data)[0].replace('"', '').split('\\')[-2]
title = re.findall('"title":"(.*?)"', html_data)[0]
new_title = change_title(title)
m3u8_data = get_response(m3u8_url).text
m3u8_data = re.sub('#EXTM3U', "", m3u8_data)
m3u8_data = re.sub(r'#EXT-X-VERSION:\d', "", m3u8_data)
m3u8_data = re.sub(r'#EXT-X-TARGETDURATION:\d', "", m3u8_data)
m3u8_data = re.sub(r'#EXT-X-MEDIA-SEQUENCE:\d', "", m3u8_data)
m3u8_data = re.sub(r'#EXT-X-ENDLIST', "", m3u8_data)
m3u8_data = re.sub(r'#EXTINF:\d\.\d,', "", m3u8_data)
m3u8 = m3u8_data.split()
for link in tqdm(m3u8):
ts_url = 'https://tx-safety-video.acfun.cn/mediacloud/acfun/acfun_video/hls/' + link
video = get_response(ts_url).content
ts_title = link.split('?')[0].split('.')[1]
save(new_title, video, ts_title)
print(f'{title}已经下载完成,请验收....')
if __name__ == '__main__':
video_id = input('请输入你要下载的视频ID:')
url = f'https://www.acfun.cn/v/{video_id}'
print('正在下载请稍后.....')
get_m3u8_url(url)
The easiest way to merge is as long as you have hands