Python crawler beginner tutorial (20): Crawling m3u8 video format video of station A

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Preamble content

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Python crawler novice introductory teaching (2): crawling novels

Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data

Python crawler novice introductory teaching (4): crawling 51job.com recruitment information

Python crawler beginners introductory teaching (5): crawling the video barrage of station B

Python crawler novice introductory teaching (6): making word cloud diagrams

Python crawler beginners introductory teaching (7): crawling Tencent video barrage

Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

Python crawler novice introductory teaching (ten): crawling the other shore 4K ultra-clear wallpaper

Python crawler beginners introductory teaching (11): recent king glory skin crawling

Python crawler novice introductory teaching (12): the latest skin crawling of League of Legends

Python crawler beginners introductory teaching (13): crawling high-quality ultra-clear wallpapers

Python crawler beginners' introductory teaching (14): crawling audio novel website data

Python crawler beginners' introductory teaching (15): crawling website music materials

Python crawler beginners' introductory teaching (16): crawling good-looking videos

Python crawler beginners introductory teaching (17): crawling yy site-wide small video

Python crawler beginners' introductory teaching (19): crawl ip proxy, build proxy pool

 

Python learning exchange group: 1039645993

 

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

import requests
import re
from tqdm import tqdm
import os

Install Python and add it to the environment variables, pip installs the required related modules.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542 

Python learning exchange group: 1039645993

Determine target needs

Now that you have chosen to crawl the video, you must give priority to the video of the young lady

 

I know everything~

Web data analysis, find data sources

The video of station A is in m3u8 format, and the whole video is divided into many small segments, one segment corresponds to a ts file.
 

 

So just find the data source of this m3u8 to get all the ts files.

 

The request parameter pkey of the url link will change. But this parameter can be found in the source code of the web page. The request link including m3u8 is also available in the source code of the web page.

 

the whole idea

1. Request the video address and get the url address of m3u8 in the source code.

2. Request the address of m3u8, get all the ts file addresses

3. Save the ts file and merge the ts file into mp4 video format

Implementation code

import requests
import re
from tqdm import tqdm
import os


def change_title(title):
    pattern = re.compile(r"[\/\\\:\*\?\"\<\>\|]")  # '/ \ : * ? " < > |'
    new_title = re.sub(pattern, "_", title)  # 替换为下划线
    return new_title


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(name, video, title):
    path = f'{name}\\'
    if not os.path.exists(path):
        os.makedirs(path)
    with open(path + title + '.ts', mode='wb') as f:
        f.write(video)


def get_m3u8_url(html_url):
    html_data = get_response(html_url).text
    m3u8_url = re.findall('backupUrl(.*?)\"]', html_data)[0].replace('"', '').split('\\')[-2]
    title = re.findall('"title":"(.*?)"', html_data)[0]
    new_title = change_title(title)
    m3u8_data = get_response(m3u8_url).text

    m3u8_data = re.sub('#EXTM3U', "", m3u8_data)
    m3u8_data = re.sub(r'#EXT-X-VERSION:\d', "", m3u8_data)
    m3u8_data = re.sub(r'#EXT-X-TARGETDURATION:\d', "", m3u8_data)
    m3u8_data = re.sub(r'#EXT-X-MEDIA-SEQUENCE:\d', "", m3u8_data)
    m3u8_data = re.sub(r'#EXT-X-ENDLIST', "", m3u8_data)
    m3u8_data = re.sub(r'#EXTINF:\d\.\d,', "", m3u8_data)
    m3u8 = m3u8_data.split()

    for link in tqdm(m3u8):
        ts_url = 'https://tx-safety-video.acfun.cn/mediacloud/acfun/acfun_video/hls/' + link
        video = get_response(ts_url).content
        ts_title = link.split('?')[0].split('.')[1]
        save(new_title, video, ts_title)
    print(f'{title}已经下载完成,请验收....')


if __name__ == '__main__':
    video_id = input('请输入你要下载的视频ID:')
    url = f'https://www.acfun.cn/v/{video_id}'
    print('正在下载请稍后.....')
    get_m3u8_url(url)

 


The easiest way to merge is as long as you have hands
 

 


 

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114585010