String Compression Algorithms for Limited Character Sets

overview

During development, there are often scenarios where online stacks are reported to analyze and deal with online problems. Therefore, compression and encryption of stacks are also essential. Encryption: You can use the AES symmetric encryption algorithm. Compression: You can use the natural compression of protobuf to compress the string when uploading.

However, in order to save traffic and improve transmission efficiency, it can be guaranteed by compressing the data once before uploading to the stack. Let me introduce to you an algorithm for compressing strings that the author explored by himself, and it has its own encryption effect.

Algorithm Introduction

This algorithm usage scenario: string compression of limited character set.

For example, the compression of the fully qualified name of a Java method, for the fully qualified method, the components: uppercase and lowercase English letters, numbers, and special characters. In the development process, a standard and qualified class name and method name need to be clear from the name. According to effective statistics, more than 99% of the full method limit is composed of uppercase and lowercase English letters.

Algorithm implementation

A brief description of the principle of compression
The free bits of char characters are used to store valid data. For example, by mapping a ~ z to numbers from 1 to 26, and dividing the Char type into three groups of high, medium, and low in groups of 5 bits, to store a number respectively (this number represents a character)

Create a string header structure: Head

In the process of writing Java code, the proportion of uppercase letters in a fully qualified string is relatively small. Therefore, the uppercase letters in the fully qualified string are recorded by using pre-supplementary characters. If a string is finite and immutable, then the relative positions between the characters that make up them are determined. The implementation algorithm is as follows:

public char[] build(String s) {
    
    
            ...
    for (int i = 0; i < len; i++) {
    
    
        c = s.charAt(i);
        b = Character.isUpperCase(c);
        if (b || c == FILL) {
    
    
            if (i - lastIndex >= maxDistance) {
    
    
                maxDistance = i - lastIndex;
            }
            upCharIndex.add(i - lastIndex);
            lastIndex = i;
       }
    if (b) upCharCount++;
    }
    ...
    return handleHead(type);
} 

The first step before compression: at the beginning of the string, save and record the position of the capital letters and the distance between each capital letter. (The decimal point is considered a capital letter).

 private char[] handleHead(int type) {
    
    
        ...
    int k, j;
    //记录大写字母位置与char中
    for (int i = 0; i < chars.length; i++) {
    
    
        if (i == 0) {
    
    
            for (j = 0, k = 1; j < ch1; j++, k++) {
    
    
                ch = bitToLeft(ch, upCharIndex.get(j), 12 - (k * stepDistance));
            }
            chars[i] = ch;
        } else {
    
    
            char emptyCh = FILL;
            emptyCh &= 0;
            int start = (i - 1) * sizeOfChar + ch1;
            for (j = start, k = 1; j < start + sizeOfChar; j++, k++) {
    
    
                if (j == upCharIndex.size())
                    break;
                emptyCh = bitToLeft(emptyCh, upCharIndex.get(j), 16 - (k * stepDistance));
            }
            chars[i] = emptyCh;
        }
    }
    return chars;
} 

The minimum length of the Head is: 1 Char, which is 16bit. The upper 2 bits of 16bit store the step size. The next 2 bits record the real Head length.

Head length: The minimum length of Head is 1 Char, which records the information of step size and Head length. Currently, the longest padding length is 3+1, and the extension of the Head length can be completed through the step size algorithm. Extension methods: getTypeBySize, getSizeByType

When storing the position of uppercase letters, it is filled according to the step size. For example: if the step size is 3, it means that every 3 bits store a capital letter position.
The length of the Head depends on how many steps are filled. For example: to fill 10 positions with a step size of 3, 16%3 is required to be equal to 5, then two Char are required.

步长: 步长是一个可变的量,在算法设计中,提供如下几种步长类型:(据统计最长英文单词:45个字符)

    STEP_0:表示没有大写字母
    STEP_3:表示大写字母距离(0,8),步长为3
    STEP_15:表示大写字母间距离[816),步长为4
    STEP_OVER_15:表示大写字母间距离[1663),步长为6

Create compressed string content: Content

Content compression is done according to the algorithm of storing one character in the high, middle and low three bits of one Char. The specific implementation of FormatUtil.ContentBuilder:

Padding: Since the strings are not all multiples of 3. In order to ensure the integrity of the original string, fill the original string with a certain number of characters before splitting the string to ensure that it can be divided into groups of 3 characters.

 public String handleString(String s) {
    
    
    int f;
    if ((f = s.length() % 3) != 0) {
    
    
        StringBuilder sBuilder = new StringBuilder(s);
        for (f = 3 - f; f > 0; f--)
            sBuilder.append(FILL);
        s = sBuilder.toString();
    }
    return s.toLowerCase();
} 

Split Replacement: After padding, split the original string into multiple groups of three. For the numbers or special characters in the string, we have not formed a mapping in the mapping file, so once they appear, they will be replaced by "MASK".

public short buildShort(char high, char mid, char low) {
    
    
    short b = 0;

    b |= getShortFromMapping(high) << 10;
    b |= getShortFromMapping(mid) << 5;
    b |= getShortFromMapping(low);
    return b;
}

public short getShortFromMapping(char ch) {
    
    
    if (mapping.containsKey(ch))
        return mapping.get(ch);
    return mapping.get(MASK);
} 

Create a compressed string

Head + content = string after compression.

Summarize

In the early stage of algorithm conception, the theoretical compression efficiency can reach 66%: three chars are stored in one char, but calculated from the total compression rate of the final package size, the compression rate should only be about 50%. This happens for the following reasons:

    字符串长度不都是3的整数倍,有多余的字符填充
    压缩完以后的字符并不是一个正确的ASCII码,在Java底层对字符集的编解码过程中,将其认为是汉字,一次一个字符会被解码成两个字符大小。

Guess you like

Origin blog.csdn.net/qq_24252589/article/details/131444511