Self-developed string compression algorithm

Overview

In development, there are often scenarios where online stacks are reported to analyze and deal with online problems. Therefore, stack compression and encryption are also essential. Encryption: AES symmetric encryption algorithm can be used, compression: Strings can be compressed using the native compressibility of protobuf when uploading.

However, in order to save traffic and improve transmission efficiency, it can be guaranteed by compressing the data once before uploading the stack. The following will introduce an algorithm that the author has explored by himself to compress strings, and has its own encryption effect.

Algorithm introduction

This algorithm uses a scenario: String compression with limited character sets .

For example, the compression of the fully qualified name of the Java method, for the fully qualified method, the components: uppercase and lowercase English letters, numbers, special characters. In the development process, a standard and qualified class name and method name need to be familiar with the name. According to effective statistics, more than 99% of the methods are fully qualified by uppercase and lowercase English letters.

Algorithm implementation

Brief introduction to compression principle

Use the free bits of a char character to store valid data. For example, by mapping a ~ z to numbers from 1 to 26, and dividing the Chartype into three groups of high, medium and low with 5 bits as a group, a number is stored separately (this number represents a character)

Create a string header structure: Head

In the process of writing Java code, the proportion of uppercase letters in a fully qualified string is relatively small. Therefore, the uppercase letters in the fully qualified string are recorded by using preceding supplementary characters. If a string is finite and immutable, then the relative positions of the characters that make up them are determined. The implementation algorithm is as follows:

public char[] build(String s) {
            ...
    for (int i = 0; i < len; i++) {
        c = s.charAt(i);
        b = Character.isUpperCase(c);
        if (b || c == FILL) {
            if (i - lastIndex >= maxDistance) {
                maxDistance = i - lastIndex;
            }
            upCharIndex.add(i - lastIndex);
            lastIndex = i;
       }
    if (b) upCharCount++;
    }
    ...
    return handleHead(type);
}

复制代码

The first step before compression: at the beginning of the string, save and record the position of the capital letters and the distance between each capital letter. (The decimal point is considered an uppercase letter).


private char[] handleHead(int type) {
        ...
    int k, j;
    //记录大写字母位置与char中
    for (int i = 0; i < chars.length; i++) {
        if (i == 0) {
            for (j = 0, k = 1; j < ch1; j++, k++) {
                ch = bitToLeft(ch, upCharIndex.get(j), 12 - (k * stepDistance));
            }
            chars[i] = ch;
        } else {
            char emptyCh = FILL;
            emptyCh &= 0;
            int start = (i - 1) * sizeOfChar + ch1;
            for (j = start, k = 1; j < start + sizeOfChar; j++, k++) {
                if (j == upCharIndex.size())
                    break;
                emptyCh = bitToLeft(emptyCh, upCharIndex.get(j), 16 - (k * stepDistance));
            }
            chars[i] = emptyCh;
        }
    }
    return chars;
}

复制代码

The minimum length of Head is: 1 Char, which is 16bit. The step size is stored in the upper 2 bits of 16bit . The next 2 bits record the real head length .

head长度:Head最小的长度是1个Char,其中记录步长和Head长度的信息。目前,填充长度最长为 3+1,可通过步长算法完成Head长度的扩展。扩展方法:getTypeBySize、getSizeByType

  • 存储大写字母的位置时,按照步长来填充。例如:步长为3,那么就意味着每3个bit存储一个大写字母位置。
  • Head的长度取决于填充了多少个步长。例如:填充10个步长为3的位置,需要16%3等于5,那么就需要两个Char.

步长: 步长是一个可变的量,在算法设计中,提供如下几种步长类型:(据统计最长英文单词:45个字符)

  • STEP_0:表示没有大写字母
  • STEP_3:表示大写字母距离(0,8),步长为3
  • STEP_15:表示大写字母间距离[8,16),步长为4
  • STEP_OVER_15:表示大写字母间距离[16,63),步长为6

建立压缩字符串内容:Content

Content压缩是按照1个Char的高、中、低三位中分别存储一个字符的算法完成的。具体的实现FormatUtil.ContentBuilder

填充: 由于字符串并不都是3的倍数。为了保证原字符串的完整性,在分割字符串之前先给原来字符串填充一定数量的字符,保证其在分割的时候可以每3个字符为一组。


public String handleString(String s) {
    int f;
    if ((f = s.length() % 3) != 0) {
        StringBuilder sBuilder = new StringBuilder(s);
        for (f = 3 - f; f > 0; f--)
            sBuilder.append(FILL);
        s = sBuilder.toString();
    }
    return s.toLowerCase();
}

复制代码

分割替换: 在完成填充以后,将原来的字符串以三个为一组分割成多个组。对于字符串中的数字或者特殊字符,我们在mapping文件中并没有形成映射,因此,一旦出现,那么就通过“MASK”去代替。

public short buildShort(char high, char mid, char low) {
    short b = 0;

    b |= getShortFromMapping(high) << 10;
    b |= getShortFromMapping(mid) << 5;
    b |= getShortFromMapping(low);
    return b;
}

public short getShortFromMapping(char ch) {
    if (mapping.containsKey(ch))
        return mapping.get(ch);
    return mapping.get(MASK);
}
复制代码

建立完成压缩后字符串

Head + content = 压缩完成后的字符串。

总结

在算法构思前期,理论压缩效率可达66%:将三个Char存储在一个Char中,不过从最后包大小的总压缩率来计算,压缩率应该只有50%左右。出现这种的情况的原因如下:

  • 字符串长度不都是3的整数倍,有多余的字符填充
  • 压缩完以后的字符并不是一个正确的ASCII码,在Java底层对字符集的编解码过程中,将其认为是汉字,一次一个字符会被解码成两个字符大小。

完整代码 欢迎大家评论留言,指导学习!

Guess you like

Origin juejin.im/post/7079610382548992013