Python implements mp3 ID3v2.3 information extraction

Python implements mp3 ID3v2.3 information extraction

Use python to extract song title, singer, album, duration and picture from mp3 ID3v2.3. (with source code)

extract content

  1. name of the song;
  2. singer name;
  3. The album name;
  4. song duration;
  5. song picture;

The result is returned in dictionary format, which can be easily converted to json format. Wherein, the song picture is the save path of the picture.

Realization principle

Use python to read the hexadecimal of the mp3 file (you can use WinHex for hexadecimal analysis), and compare and extract information according to the explanation given on the official website.

  1. Song name, artist name, and album name are the tag header data corresponding to mp3;
  2. The duration of the song is calculated using the actual content of the song. In order to match more accurately, here is the496e666f0000000fThat is, at Info... , we use it as the extraction mark to start calculation. Of course, there is a certain error with the starting position of the actual content of the song, but it does not affect the final output of the song duration .
    Use Info... as the starting position (not the actual starting position of the song), calculate the byte length to the end of the file, this is the number of bytes of the song, multiply by 8 to get the bits, and then divide by the bit rate (128kbps): 128* 1000, the final result is the song duration (unit/s).
    Formula: byte length * 8 / (128 * 1000)
  3. The song picture is fromffd8start to496e666f0000000fend, where the end position should have beenffd9However, in order to match more accurately, the positioning is also performed according to the position of the Info... mark, so there is a certain error, but it does not affect the final output of the song picture .
    ( Note : The code only implements the extraction of jpg format images, there may be png image information, the code is not implemented)

★ Implementation principle hexadecimal comparison analysis diagram (please refer to the label format given by id3v2.3.0 official website for understanding)

Hexadecimal Analysis Figure 1
... (omitted here)
Hexadecimal Analysis Figure 2
tag type TIT2: song name, TPE1: artist name, TALB: album name

TIT2 label song name analysis:

Hexadecimal:
5449543200000011000001FFFEE0563A4E604F20004062E54E1162

  1. 54495432 (H) TIT2 Tag - Song Name
  2. 00000011 (H) tag length, converted to decimal is 17 (D), that is, the tag length is 17 bytes, including 1 byte of encoding format, that is to say, the actual song name information content is 16 bytes.
    Namely: FFFEE0563A4E604F20004062E54E1162 (two bits and one byte in hexadecimal)
  3. 0000 (H) Flags, I am also a hen~
  4. 01 (H) encoding format
    is as follows:
    0: Indicates that the characters in the frame content are encoded by ISO-8859-1;
    1: Indicates that the characters in the frame content are encoded by UTF-16LE;
    2: Indicates that the characters in the frame content are encoded by UTF-16BE;
    3: Indicates that the characters in the frame are encoded by UTF-16BE Content characters are encoded in UTF-8 (only supported by ID3V2.4)
  5. FFFEE0563A4E604F20004062E54E1162 (H) song name, which needs to be transcoded according to the encoding format in point 4,

Other tags are available in the same way.

file data

This section of the program only needs to quote a re module, because regular expressions are used, the code can be copied and run directly.

The program will generate a log.txt file for collecting logs. The location is under the folder directory where the program runs .

The path of the pictures extracted by the program is the path of the folder where the songs are located .
For example:
file save style

Implementation code

# 引用所需模块
import re  # 正则表达式

# mp3信息和路径提取
def mp3Info(input_file_url):
    # 读取mp3文件
    input_file_url = input_file_url
    with open(input_file_url, "rb") as input_file:
        mp3_data = input_file.read().hex()

    # 判断mp3文件类型是否是 ID3v2.3 格式
    if mp3_data[:6] == "494433" and mp3_data[6:8] == "03":
        print("歌曲是ID3v2.3版本,正在提取信息...")

        # 获取歌曲名称
        if re.search(r"\\", input_file_url):
            input_file_name = re.search("(.*).mp3", input_file_url.split("\\")[-1]).group(1)
        else:
            input_file_name = re.search("(.*).mp3", input_file_url).group(1)

        # 获取歌曲路径
        if re.search("(.*)" + input_file_name + ".mp3", input_file_url):
            input_file_path = re.search("(.*)" + input_file_name + ".mp3", input_file_url).group(1)
        else:
            input_file_path = ""

        return mp3_data, input_file_path, input_file_name  # 返回 mp3 16进制数据,输入文件路径,输入文件文件名
    else:
        return ""


# 提取标签内容函数
def tagInfo(tag_name, mp3_data):
    # 标签名称,mp3完整数据
    tag_name, mp3_data = tag_name, mp3_data

    # 标签列表
    tag_name_list = {
    
    "TIT2": "54495432", "TPE1": "54504531", "TALB": "54414c42"}
    if tag_name in tag_name_list:
        tag_hex = tag_name_list[tag_name]  # 标签的 16 进制数据

        # 标签长度
        tag_len = int(re.search(tag_hex + "(.{8})", mp3_data).group(1), 16)
        # print("%s 标签的长度是 %s 个字节" % (tag_name,tag_len))

        # 判断标签类型
        tag_index = mp3_data.find(tag_hex)
        tag_data_type = mp3_data[(tag_index + 2 * (4 + 4 + 2)):(tag_index + 2 * (4 + 4 + 2 + 1))]

        # 判断内容编码方式
        if tag_data_type == "00":
            encoding_type = 'iso8859-1'
            # print("采用 ISO-8859-1 编码 ")
        if tag_data_type == "01":
            encoding_type = 'utf-16-le'
            # print("采用 UTF-16LE 编码 ")
        elif tag_data_type == "02":
            encoding_type = 'utf-16-be'
            # print("采用 UTF-16BE 编码 ")
        # elif tag_data_type == "03":    # (仅ID3V2.4才支持)
        #     encoding_type = 'utf-8'
        #     print("采用 UTF-8 编码 ")

        # 提取标签内容
        tag_data_hex = mp3_data[(tag_index + 2 * (4 + 4 + 2 + 1)):(
                tag_index + 2 * (4 + 4 + 2 + 1) + tag_len * 2 - 2)]  # 取标签内容(16进制)
        tag_data_bytes = bytes.fromhex(tag_data_hex)  # 将字符串转换为字节流数据
        tag_info = tag_data_bytes.decode(encoding_type, 'ignore')  # 根据编码类型解码

        return tag_info  # 返回标签的内容
    else:
        return ""


# 提取所需标签信息
def tagsInfo():
    # 标签类型 TIT2:标题,TPE1:艺术家,TALB:专辑
    tags = {
    
    "TIT2": "歌名", "TPE1": "歌手", "TALB": "专辑"}

    # 标签内容提取
    tags_info = {
    
    }
    for i in tags:
        tag_name = i
        tag_info = tagInfo(tag_name, mp3_data)  # 标签数据
        tags_info[tags[i]] = tag_info.encode('utf-8').decode('utf-8-sig')  # 使用utf-8-sig编码,否则出现'\ufeff' BOM数据

    return tags_info  # 返回所有标签的内容


# 获取歌曲时长函数
def mp3Duration(mp3_data):
    music_index = mp3_data.find("496e666f0000000f")  # 定位歌曲实际的起始位置
    music_size = len(mp3_data[music_index:]) / 2  # 歌曲字节长度
    duration = music_size * 8 / (128 * 1000)  # 获取歌曲时长,单位 s
    duration_show = str(int(duration / 60)) + ":" + str(int(duration % 60))  # 格式化歌曲时长
    return duration_show  # 返回格式化的歌曲时长


# 提取图片函数
def imgTag(mp3_data, input_file_path, input_file_name):
    # 歌曲数据,输入文件的路径,输入文件的文件名
    mp3_data, input_file_path, input_file_name = mp3_data, input_file_path, input_file_name

    # 图片数据的提取
    img_data_hex = re.search(r"ffd8.+?496e666f0000000f", mp3_data)[0]  # 图片的 16 进制数据
    if img_data_hex:
        img_data_bytes = bytes.fromhex(img_data_hex)  # 将字符串转换为字节流数据
        out_file_name = input_file_path + input_file_name + '.jpg'
        with open(out_file_name, "wb") as out_file:
            out_file.write(img_data_bytes)

        return "%s%s.jpg" % (input_file_path, input_file_name)
    else:
        return ""


if __name__ == '__main__':
    try:
        # 提示信息
        print("##### 本程序为提取 mp3 ID3v2.3 格式的歌曲信息 #####")

        # mp3路径
        input_file_url = input("请输入需要提取的文件路径:")

        # 获取歌曲信息
        mp3_info = mp3Info(input_file_url)
        if mp3_info:
            # 获取文件路径,文件名
            mp3_data, input_file_path, input_file_name = mp3_info  # 返回 mp3 16进制数据,输入文件路径,输入文件文件名

            # 获取标签数据
            tags_info = tagsInfo()

            # 获取歌曲时长信息
            mp3_duration = mp3Duration(mp3_data)
            tags_info["时长"] = mp3_duration  # 添加时长字段

            # 获取图片数据
            imgInfo = imgTag(mp3_data, input_file_path, input_file_name)
            tags_info["图片路径"] = imgInfo  # 添加图片字段
            print(tags_info)

            # 日志信息
            log = "歌曲路径: " + input_file_path + input_file_name + ".mp3 \n" + "歌曲信息:" + str(
                tags_info) + "\n\n"
        else:
            # 日志信息
            log = "暂不支持此文件的提取,本程序仅支持 ID3v2.3 格式的 mp3 文件\n\n"
            print("暂不支持此文件的提取,本程序仅支持 ID3v2.3 格式的 mp3 文件")
    except:
        # 日志信息
        log = "意外错误\n\n"
        print("意外错误")
    finally:
        # 保存日志
        with open(r"log.txt", "a") as out_file:
            out_file.write(log)

This is the first time I post a post, and I am new to python. Both the post and the code may be relatively low or there may be errors. Wang Haihan, if there is any problem, please comment and point it out, thank you.

Guess you like

Origin blog.csdn.net/weixin_43832353/article/details/113106027