【M3U8】python (streaming video data)

Introduction to HLS technology

Most video clients now use HTTP Live Streaming instead of directly playing video files such as MP4 (HLS, a technology developed by Apple to improve streaming efficiency). The feature of HLS technology is to divide the streaming media into several [TS segments] (such as a few seconds), and then download these TS segments in batches through an [M3U8 list file] for the client player to realize real-time streaming playback. Therefore, the idea of ​​crawling HLS streaming media files is generally to [download M3U8 files] and analyze the content, then download the [TS fragments] defined in the batch files, and finally [combine] them into mp4 files or save them directly TS fragments.

To put it simply, in actual operation, you will encounter many complicated problems, such as the m3u8 file cannot be downloaded, the ts segment file is encrypted, and even the key for encrypting the ts segment is also encrypted.

HLS role

HLS technology is currently widely used in mainstream application products. The main reason is that the advantage of HLS technology over traditional streaming media technology is that once the video is segmented, the subsequent distribution process does not require any additional special software at all, just a common web server, which reduces the burden on the server. technical requirements.

In addition, there is another advantage of using TS for streaming media packaging, that is, there is no need to load the complete video and play it, which greatly reduces the delay of the first loading and improves the user experience.

In addition, the biggest advantage of HTTP Live Streaming is adaptive bit rate streaming. The client will automatically select video streams with different bit rates according to network conditions, use high bit rates when conditions permit, use low bit rates when the network is busy, and automatically switch between the two. This is very helpful to ensure smooth playback when the network condition of the mobile device is unstable.

Detailed explanation of M3U8 file

If you want to crawl resource data under HLS technology, you must first have a good understanding of the data structure and field definitions of M3U8. M3U8 is an extended file format extended from M3U. So what is M3U?

M3U file

The M3U file format is not an audio and video file in essence, it is a list file of audio and video files, and it is a plain text file.

After the M3U file is acquired, the playback software does not play it, but finds the network address of the media according to its records for online playback. That is to say, files in M3U format just store multimedia playlists and provide an index pointing to audio and video files in other locations, and those pointed files are played.

In order to better understand the concept of M3U, let's make a simple M3U file (myTest.m3u). Find a few MP3 and MP4 files in the computer and enter the paths of these files in turn. The content of the myTest.m3u file is as follows

E:\Users\m3u8\Andy Lau-Infernal Affairs.mp4
E:\Users\m3u8\Na Ying-Moment.mp3
E:\Users\m3u8\Jay Chou-Secrets That Can’t Be Told.mp4
E:\Users\m3u8\Flower Congee- One day when I was twenty.mp3
E:\Users\m3u8\Zhou Shen- Big Fish.mp4

M3U8 file

M3U8 is also an extended format of M3U (advanced M3U, so it also belongs to M3U). Below we will take a look at several very important keywords defined in M3U8:

#EXTM3U:每个M3U文件第一行必须是这个tag标识。(简单了解)

#EXT-X-VERSION:版本,此属性可用可不用。(简单了解)

#EXT-X-TARGETDURATION:目标持续时间,是用来定义每个TS的【最大】duration(持续时间)。(简单了解)

#EXT-X-ALLOW-CACHE是否允许允许高速缓存。(简单了解)

#EXT-X-MEDIA-SEQUENCE定义当前M3U8文件中第一个文件的序列号,每个ts文件在M3U8文件中都有固定唯一的序列号。(简单了解)

#EXT-X-DISCONTINUITY:播放器重新初始化(简单了解)

#EXT-X-KEY定义加密方式,用来加密的密钥文件key的URL,加密方法(例如AES-128),以及IV加密向量。(记住)

#EXTINF:指定每个媒体段(ts文件)的持续时间,这个仅对其后面的TS链接有效,每两个媒体段(ts文件)间被这个tag分隔开。(简单了解)

#EXT-X-ENDLIST表明M3U8文件的结束。(简单了解)

Image (crawling m3u8 file)

insert image description here

**M3U8 example:** You will see that there are a large number of link addresses of ts files in this file. This is the real video file we described before. Any one of the ts files is a short video that can be played independently. The goal of our video crawler is to crawl all these ts files.

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:19
#EXT-X-ALLOW-CACHE:YES
#EXT-X-MEDIA-SEQUENCE:0

#EXT-X-KEY:METHOD=AES-128,URI="https://edu.aliyun.com/hls/1109/clef/YnBGq7zAJf1Is7xIB5v8vI7AIORwwG9W",IV=0x0fe82567a6be41afda68d82d3724976a
#EXTINF:8.583,
https://xuecdn2.aliyunedu.net/headLeader-0/20170519032524-ggauw1x00qo0okgk-conv/e_20170519032524-ggauw1x00qo0okgk-conv_hd_seg_0.ts
#EXT-X-DISCONTINUITY
#EXT-X-KEY:METHOD=AES-128,URI="https://edu.aliyun.com/hls/2452/clef/0VqtrHq9IkTfOsLqy0iC1FP9342VZm1s",IV=0xdebe4353e61b56e4ecfe0240ca3f89f5
#EXTINF:10.080,
https://xuecdn2.aliyunedu.net/courselesson-50224/20170630095028-3xsfwyxw20cgwws8-conv/e_20170630095028-3xsfwyxw20cgwws8-conv_hd_seg_0.ts
#EXT-X-KEY:METHOD=AES-128,URI="https://edu.aliyun.com/hls/2452/clef/0VqtrHq9IkTfOsLqy0iC1FP9342VZm1s",IV=0x8a3ce90cf18587963953b948487c1729
#EXT-X-KEY:METHOD=AES-128,URI="https://edu.aliyun.com/hls/2452/clef/0VqtrHq9IkTfOsLqy0iC1FP9342VZm1s",IV=0x3f1c20b9dd4459d0adf972eaba85e0a2
#EXTINF:10.000,
https://xuecdn2.aliyunedu.net/courselesson-50224/20170630095028-3xsfwyxw20cgwws8-conv/e_20170630095028-3xsfwyxw20cgwws8-conv_hd_seg_104.ts
#EXT-X-ENDLIST

Key file in EXT-X-KEY

For most M3U8 videos, it is generally not encrypted. For some important video service providers, they will encrypt their videos. The current standard encryption method for M3U8 video is to use AES-128 for encryption. If the video is encrypted, the following information will appear in the M3U8 file:

#EXT-X-KEY:METHOD=AES-128,URI="https://edu.aliyun.com/hls/2452/clef/0VqtrHq9IkTfOsLqy0iC1FP9342VZm1s",IV=0x3f1c20b9dd4459d0adf972eaba85e0a2

Among them, METHOD is the encryption method, and the standard is AES-128.

Key is the download address of the key file (the key is a 16-byte file and needs to be downloaded).

IV is an encryption vector (16-byte hexadecimal number), if there is no IV value, just fill it with b"0000000000000000".

Note: Key and IV are necessary information for AES encryption and decryption, so we don’t need to explain them in depth here. You only need to know that the values ​​of Key and IV will be directly called as parameters of the decryption function. If the #EXT-X-KEY is not included in the file, the media file will not be encrypted.

need

Crawl the video of American Drama Network: https://www.meijuw.com/vodplay/3985-1-1/

set up

insert image description here

locking

insert image description here

download m3u8 fileinsert image description here

the code

# import requests
# from urllib.parse import urljoin
# import re
# import os
# # pip install pycryptodome
# from Crypto.Cipher import AES
# dirName = 'tsLib'
# if not os.path.exists(dirName):
#     os.mkdir(dirName)
#
# headers = {
    
    
# 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
# }
#
# # 一级m3u8地址
# # m1_url = 'https://new.iskcd.com/20220311/hzfUuf6B/index.m3u8'
# m1_url = 'https://v4.cdtlas.com/20220311/xEaAxRVd/index.m3u8'
# m1_page_text = requests.get(url=m1_url,headers=headers).text
# # print(m1_page_text)
#
# # 从一级m3u8文件中解析出二级m3u8地址
# m1_page_text = m1_page_text.strip() #取出收尾的回车
# # 二级m3u8地址
# m2_url = ''
# for line  in m1_page_text.split('\n'):
#     if not line.startswith('#'):
#         m2_url = line
#         # 将m1_url 和m2_url不同之处补充到m2_url中
#         m2_url = urljoin(m1_url,m2_url)
#         # 至此就获取到了完整的二级文件地址
# # 请求链接地址
# # print(m2_url)
# # 请求二级文件地址内容
# m2_page_text = requests.get(url=m2_url,headers=headers).text
# m2_page_text = m2_page_text.strip()
# # print(m2_page_text)
#
# # 解析出解密秘钥key的地址
# key_url = re.findall('URI="(.*?)"',m2_page_text,re.S)[0]
# key_url =urljoin(m1_url,key_url)
# # print(key_url)
# # 请求key的地址,获取知秘钥
#     # 注意: key和iv需要为bytes类型
# key = requests.get(url=key_url,headers=headers).content
# iv = b'0000000000000000'
# # print(key)
# # 解析出每一个ts切片的地址
# ts_url_list = []
# for line in m2_page_text.split('\n'):
#     if not line.startswith("#"):
#         ts_url = line
#         ts_url = urljoin(m1_url,ts_url)
#         ts_url_list.append(ts_url)
#
# # print(ts_url_list)
# # 请求到每一个ts切片的数据
# for url in ts_url_list:
#     # 获取ts片段的数据
#     ts_data = requests.get(url=url,headers=headers).content
#     # 需要对ts片段数据进行解密(需要用到key和iv)
#     aes = AES.new(key=key,mode=AES.MODE_CBC,iv=iv)
#     desc_data = aes.decrypt(ts_data) # 获取了解密后的数据
#     ts_name = url.split('/')[-1]
#     ts_path = dirName+'/'+ts_name
#     with open(ts_path,'wb') as fp:
#         # 需要将解密后的数据写入文件进行保存
#         fp.write(desc_data)
#     print(ts_name,'下载保存成功!')

Code -> Download .ts file using coroutine


import requests
from urllib.parse import urljoin
import re
import os
import asyncio
import aiohttp



# pip install pycryptodome
from Crypto.Cipher import AES
dirName = 'tsLib'
if not os.path.exists(dirName):
    os.mkdir(dirName)

headers = {
    
    
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

# 一级m3u8地址
# m1_url = 'https://new.iskcd.com/20220311/hzfUuf6B/index.m3u8'
m1_url = 'https://v4.cdtlas.com/20220311/xEaAxRVd/index.m3u8'
# m1_url = 'https://iqiyi.sd-play.com/20211104/6AoEDjLD/index.m3u8'
# m1_url = 'https://new.iskcd.com/20220228/hc80JRB9/1100kb/hls/playlist_up.m3u8'
# m1_url = 'https://iqiyi.sd-play.com/20220112/tvnvoPbM/index.m3u8'
m1_page_text = requests.get(url=m1_url,headers=headers).text

# print(m1_page_text)

# 从一级m3u8文件中解析出二级m3u8地址
m1_page_text = m1_page_text.strip() #取出收尾的回车
# 二级m3u8地址
m2_url = ''
for line  in m1_page_text.split('\n'):
    if not line.startswith('#'):

        m2_url = line
        # 将m1_url 和m2_url不同之处补充到m2_url中
        m2_url = urljoin(m1_url,m2_url)
        # 至此就获取到了完整的二级文件地址
# 请求链接地址
# print(m2_url)
# 请求二级文件地址内容
m2_page_text = requests.get(url=m2_url,headers=headers).text
m2_page_text = m2_page_text.strip()
# print(m2_page_text) # 打印输出整个.ts文件

# 解析出解密秘钥key的地址
key_url = re.findall('URI="(.*?)"',m2_page_text,re.S)[0]
key_url =urljoin(m1_url,key_url)
# print(key_url) # 打印请求链接地址 https://iqiyi.shanshanku.com/20211104/6AoEDjLD/1200kb/hls/key.key
# 请求key的地址,获取知秘钥
    # 注意: key和iv需要为bytes类型
key = requests.get(url=key_url,headers=headers).content
iv = b'0000000000000000'
# print(key) # 得到解密密钥
# 解析出每一个ts切片的地址
ts_url_list = []
for line in m2_page_text.split('\n'):
    if not line.startswith("#"):
        ts_url = line
        ts_url = urljoin(m1_url,ts_url)
        ts_url_list.append(ts_url)


# print(ts_url_list) # 列表组成的逗号分隔的.ts文件

# 异步请求到每一个ts切片的数据
async def get_ts(url):
    async with aiohttp.ClientSession() as sess:
        async with await sess.get(url=url,headers=headers) as response:
            ts_data = await response.read() # 获取byte形式的响应数据
            # 需要对ts片段数据进行解密(需要用到key和iv)
            aes = AES.new(key=key, mode=AES.MODE_CBC, iv=iv)
            desc_data = aes.decrypt(ts_data)  # 获取了解密后的数据

            return [desc_data,url]


def download(t):
    r_list = t.result()
    data =r_list[0]
    url = r_list[1] # ts文件的地址
    ts_name = url.split('/')[-1]
    ts_path = dirName + '/' + ts_name
    with open(ts_path, 'wb') as fp:
        # 需要将解密后的数据写入文件进行保存
        fp.write(data)
    print(ts_name, '下载保存成功!')

tasks = []
for url in ts_url_list:
    c = get_ts(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(download)
    tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))


Implementation process

insert image description here

save file result

insert image description here

Guess you like

Origin blog.csdn.net/weixin_48321071/article/details/123507659