A brief description of the structure and function of each database file on the PC side of WeChat - Multi folder

Journey of Imagination: My original blog is completely handcrafted, absolutely not ported, and there is no possibility of repetition on the entire network; I have no team, and I only share it for technology enthusiasts, and all content does not involve advertisements. All my articles are only published on CSDN, Nuggets and personal blog (must be the domain name of Fantastic Journey), otherwise all are pirated articles!


related information:


MultiThe decoding of files in folders is the same as before for other database operations.

The file structure in this folder is relatively simple, there are only three types: FTSMSG, MediaMSGand MSG. It is said that there are three types here, not three, because the database here will be split when it reaches a certain size.

FTSMSG

Those who have read the "Overview" article should be familiar with the prefix FTS-this represents the index required for searching.

The main contents are the following two tables:

  • FTSChatMsg2_content: There are three fields inside
    • docid: number incremented from 1, equivalent to the ID of the current entry
    • c0content: Search keywords (keywords entered in the WeChat search box can be searched for by this field)
    • c1entityId: The purpose is not clear yet, it may be related to verification
  • FTSChatMsg2_MetaData
    • docid: FTSChatMsg2_contentcorresponds to the docid in the table
    • msgId: MSGcorresponds to the content in the database
    • entityId: FTSChatMsg2_contentcorresponds to c1entityId in the table
    • type: the possible type of the message
    • The rest of the fields are unclear

In particular, the number 2 in the table name, my personal guess may be the version number of the current database format.

MediaMSG

All voice messages are stored here. There is one and only Mediaone table in the database, which contains three valid fields:

  • Key
  • Reserved0
  • Buf

The Reserved0fields correspond one-to-one to MSGthe messages in the databaseMsgSvrID .

The third item is the binary data of the voice. You can find that these files are stored in the SILK format by observing the header. This is a voice format developed and open sourced by Microsoft for Skype, you can Google it yourself.

Here is the code to export the data in the Buf field to a file:

import sqlite3


def writeTofile(data, filename):
    with open(filename, 'wb') as file:
        file.write(data)
    print("Stored blob data into: ", filename, "\n")


def readBlobData(key):
    try:
        sqliteConnection = sqlite3.connect('dbs/decoded_MediaMSG0.db')
        cursor = sqliteConnection.cursor()
        print("Connected to SQLite")

        sql_fetch_blob_query = """SELECT * from Media where Key = ?"""
        cursor.execute(sql_fetch_blob_query, (key, ))
        record = cursor.fetchall()
        for row in record:
            print("Key = ", row[0], "Reserved0 = ", row[1])
            file = row[2]

            print("Storing on disk \n")
            path = f'{
      
      row[0]}.silk'
            writeTofile(file, path)

        cursor.close()

    except sqlite3.Error as error:
        print("Failed to read blob data from sqlite table", error)
    finally:
        if sqliteConnection:
            sqliteConnection.close()
            print("sqlite connection is closed")


readBlobData(1099511630953)

If you need to find files through MSGthe database MsgSvrID, you can change the SQL query and then traverse all the databases.

The following is the code to convert the silk file to wav (the implementation idea is to convert it to pcm first and then to wav; the sampling rate data of wav is personally tested):

KEY = 1099511630953

import wave
from pathlib import Path

import pilk


def pcm2wav(pcm_file, wav_file, channels=1, bits=16, sample_rate=24000):
    pcmf = open(pcm_file, 'rb')
    pcmdata = pcmf.read()
    pcmf.close()

    if bits % 8 != 0:
        raise ValueError("bits % 8 must == 0. now bits:" + str(bits))

    wavfile = wave.open(wav_file, 'wb')
    wavfile.setnchannels(channels)
    wavfile.setsampwidth(bits // 8)
    wavfile.setframerate(sample_rate)
    wavfile.writeframes(pcmdata)
    wavfile.close()


duration = pilk.decode(f"{
      
      KEY}.silk", f"{
      
      KEY}.pcm")
# print("语音时间为:", duration)
Path(f"{
      
      KEY}.silk").unlink()

pcm2wav(f"{
      
      KEY}.pcm", f"{
      
      KEY}.wav")
Path(f"{
      
      KEY}.pcm").unlink()

These two codes are not explained in detail, so read it yourself.

MSG

Finally arrived at the entire file, no, the most important part of the entire project - the core database of chat records !

The two main tables inside are MSGand Name2ID.

Among them Name2ID, this table has only one column, and the content format is 微信号or 群聊ID@chatroom, and the function is to make MSGsome fields in it correspond to it. Although there is no ID column in the table, in fact WeChat defaults to the ID of the first row (numbered from 1).

The following is mainly about MSGthis table (the bold is used to remind yourself that the content needs to be supplemented, not important information):

  • localId: literally means the local ID of the message, its function has not been found yet
  • TalkerId: ID of the room where the message is located (this information is a guess, see the StrTalker field for the reason of the guess), corresponding to Name2ID.
  • MsgSvrID: Guess that Srv may be the abbreviation of Server, which refers to the message ID stored on the server side
  • Type: message type, see Table 1 for details
  • SubType: message type subcategory, its actual use has not been seen yet
  • IsSender: Whether it is a message sent by yourself, that is, the marked message is displayed on the left or right of the conversation page, and the value is 0 or 1
  • CreateTime: The second-level timestamp of the message creation time. Further experiments are needed here to confirm which time node is specifically marked at this time . The rules of personal guessing are as follows:
    • Messages sent from this computer: markers represent the moment the send button was clicked for each message
    • Messages sent/received from other users from other devices: mark the time when the message was received locally from the server
  • Sequence: Sequence, although it looks like a millisecond timestamp but it is not. This is composed of three digits at the end of the CreateTime field, usually 000, if two messages with the same CreateTime appear, the last three digits will increase in turn. Further confirmation is required whether the unique range is within one session or all sessions .
  • StatusEx, FlagEx, Status, MsgServerSeq, MsgSequence: These five fields have not analyzed valid information for the time being
  • StrTalker: The WeChat account of the message sender. In particular, from this point of view, the TalkerId field above most likely refers to the room ID where the message is located, not the sender ID. Of course, it may also be the same content as the TalkerId, which needs to be confirmed .
  • StrContent: data in string format. In particular, except for text-type messages, most other types of this field will be a piece of XML data to mark some relevant information.
  • DisplayContent: For taking a picture, save the account information of the shooter and the person being photographed
  • Reserved0~6: These fields have not yet analyzed valid information, and some fields are always empty
  • CompressContent: It literally means compressed data, in fact, the data in StrContent that Micro-Trust does not want to exist here (for example, text messages with references, etc.; the messages here can only be distinguished according to the binary content, but the specific format specification , I don’t know how to retrieve the data)
  • BytesExtra: extra data in binary format
  • BytesTrans: At present, this is a field that is always empty

There are quite a lot of guesses here, and there are still a lot of things marked that should be further tested that have not been completed, because it is impossible to update the database in real time with newly received messages after unlocking, and every time a new message is sent, I don’t know it. Which database will appear in the split, so the experimental efficiency is extremely low.

Table 1: MSG.TypeComparison table of field values ​​and meanings (may be extended to fields that also mark message type information in other databases)

Classification subcategory corresponding type
1 0 text
3 0 picture
34 0 voice
43 0 video
47 0 Animated emoticons (emoticons developed by third parties)
49 1 Messages that are similar to text messages but not the same, so far I have only seen an invitation to register on Alibaba Cloud Disk. Estimated to be the same as the case of 57 subclasses
49 5 Card-style link, title, introduction, etc. in CompressContent, and locally cached cover path in BytesExtra
49 6 File, there is a file name and download link in CompressContent (but not read), and there is a path saved locally in BytesExtra
49 8 For the GIF expression uploaded by the user, there is a CDN link in CompressContent, but it seems that the download cannot be accessed directly
49 19 Merged and forwarded chat records, detailed chat records in CompressContent, and caches of pictures, videos, etc. in BytesExtra
49 33/36 For the shared applet, there is card information in CompressContent, and the cover cache location in BytesExtra
49 57 Text message with quotes (StrContent is empty for this type, and both sent and quoted contents are in CompressContent)
49 63 Video number live broadcast or live playback, etc.
49 87 Group announcement
49 88 Video number live broadcast or live playback, etc.
49 2000 Transfer messages (including sending, receiving, and voluntary refunds)
49 2003 Gift red envelope cover
10000 0 System notifications (the kind of gray text that appears in the center)
10000 4 take a pat
10000 8000 System notifications (especially if you invite someone to a group chat)

References for this article (in no particular order):

Guess you like

Origin blog.csdn.net/weixin_44495599/article/details/130163338
Recommended