CCF-CSP real question "202305-3 decompression" ideas + python, c++ full score solution

Students who want to check the real questions and solutions of other questions can go to: CCF-CSP real questions with solutions


Question No.: 202305-3
Question name: unzip
time limit: 5.0s
Memory limit: 512.0MB
Problem Description:

topic background

Sisi Iver Island Operating Company is a large corporation responsible for maintaining and operating the island's infrastructure. In the company, there are many departments in charge of different businesses that need to use server facilities. In order to facilitate management and reduce the company's operating costs,
Sisi Ivory Island Operating Company has built a private cloud system. In addition to hosting virtual machine services, this private cloud system can also provide some other services. Among them, the most popular one is the log service. Previously, the logs of each business system were scattered and stored on their own servers,
which was not only inconvenient to view and analyze, but also risked loss. The log service can collect the logs of various business systems in a unified manner, which is convenient for viewing and management.

The logs collected by the log server are all plain text and highly formatted. This means that log data can be compressed very small. However, the amount of log data is very large and requires high processing efficiency. Therefore, a certain compression rate can be sacrificed and an efficient compression algorithm can be used to compress log data.
Little C is arranged to realize the program of decompressing logs, given a piece of compressed log data, he needs to decompress it.

Problem Description

The data stream generated by this compression algorithm can be regarded as a series of elements. There are two types of elements: literals and backreferences. Literals contain a series of bytes, and when decompressed, these bytes can be output directly.
Back-referencing is to repeatedly output a part of the data stream that has been decompressed before. A backreference can be represented as <o,l>, consisting of two numbers, offset o and length l.
The offset indicates the backward distance from the current position, and the length indicates the number of bytes to be output repeatedly. Among them, o,l>0 is required. If p bytes have been decompressed, when o≥l,
it means to repeatedly output l bytes starting from offset (p−o) (the first byte offset is 0). For example, if the decompressed data stream is  abcde, then the backreference <3,2> indicates the output  cd.
When o<l, it means to repeatedly output the o bytes starting from the offset (p−o), and then continue to repeatedly output the o bytes until a total of l bytes are output. For example, if the decompressed data stream is  abcde,
then the backreference <2,5> indicates the output  deded.

The compressed data format is divided into two parts: the boot field and the data field. Among them, the boot field saves the length of the original data. Let the original data length be n. Then n can be expressed as ∑k=0dck×128k, where 0≤ck<128, and
cd≠0. The length of the boot field is (d+1) bytes, and c0+128,c1+128,⋯,cd−1+128,cd are stored in sequence. That is, each byte uses the lower 7 bits to save ck, except the highest bit of the last byte is 0, and
the highest bit of the other bytes is 1. For example, if the length of the original data is 1324, then ck is: 44, 10, that is, hexadecimal  0x2C, 0x0A. So the length of the boot area is 2 and the sequence of bytes is  0xAC 0x0A.


compressed data format

The data field stores the compressed data, which is a sequence of consecutively stored elements. The lowest two bits of the first byte of each element indicate the type of the element. When the lowest two bits are 0, it means this is a literal value. If the number of bytes contained in the literal is l, and l≤60,
then the upper 6 bits of the first byte represent (l−1). The following l bytes are the original bytes contained in the literal. For example  0xE8, the binary value of  a byte 1110 1000is 0, and the lower two bits are 0, indicating that this is a literal value. The upper six bits are  111010, indicating the number 58,
which means that the literal contains 59 bytes. Therefore, the 59 bytes following this byte are the original bytes contained in the literal. If l>60, then (l−1) is represented by 1 to 4 bytes in little endian order and stored after the first byte.
When the value stored in the upper six bits of the first byte is 60, 61, 62 or 63, it means (l−1) is represented by 1, 2, 3 or 4 bytes respectively. For example,  0xF4 0x01 0x0A in a byte sequence, the binary value of the first byte is  1111 0100, and the lower two bits are 0, indicating that this is a literal value.
The upper 6 bits  111101represent the number 61, which means that the length of the literal value is stored in the following two bytes. The next two bytes  0x01 0x0Aform a hexadecimal number in little-endian order  0x0A01, that is, decimal 2561, indicating that the literal contains 2562 bytes.
The next 2562 bytes are the original bytes contained in the literal.


Literal value, the length does not exceed 60 bytes 


Literal, longer than 60 bytes 

When the lowest two bits of the first byte of the element are  01 , it means that this is a back reference <o,l>, and 4≤l≤11,0<o≤2047. At this time, o occupies 11 bits, its low 8 bits are stored in the following byte, and its
high 3 bits are stored in the high 3 bits of the first byte. (l−4) occupies 3 bits and is stored in bits 2 to 4 of the first byte. As shown below:

 7 6 5 4 3 2 1 0   7 6 5 4 3 2 1 0
+-----+-----+-+-+ +----------------+
|o(h3)| l-4 |0|1| |o (lower 8 bits)|
+-----+-----+-+-+ +----------------+

例如,字节 0x2D 0x1A,其首字节的二进制为 001 011 01,其最低两位为 01,表示这是一个回溯引用,其中 2 至 4 位是 011,表示数字 3,意味着 (l−4)=3,即 l=7。
其高 3 位是 001,与随后的字节 0x1A 组成了十六进制数 0x11A,即十进制 282,表示 o=282。因此,该回溯引用是 〈282,7〉。


回溯引用,形式 1 

当元素首字节的最低两位是 10 时,表示这是一个回溯引用 〈o,l〉,且 1≤l≤64,0<o≤65535。此时,o 占 16 位,以小端序存储于随后的两个字节中。
(l−1) 占 6 位,存储于首字节的高 6 位中。例如,字节 0x3E 0x1A 0x01,其首字节的二进制为 0011 1110,其最低两位为 10,表示这是一个回溯引用,其中高 6 位是 001111,表示数字 15,
即 (l−1)=15,即 l=16。随后的两个字节 0x1A 0x01,按小端序组成了十六进制数 0x011A,即十进制 282,表示 o=282。因此,该回溯引用是 〈282,16〉。


回溯引用,形式 2 

我们规定,元素的首字节的最低两位不允许是 11。如果出现了这种情况,那么这个数据域就是非法的。

压缩后的数据为合法的,当且仅当以下条件都满足:

  1. 引导区的长度不超过 4 字节;
  2. 引导区能被正确恢复为原始数据的长度;
  3. 每个元素的首字节的最低两位不是 11
  4. 每个元素都能按照规则被恢复为原始数据;
  5. 得到的原始数据长度恰好等于引导区中编码的原始数据长度。

输入格式

从标准输入读入数据。

输入包含有若干行,第一行是一个正整数 s,表示输入被解压缩数据的字节数。

接下来有 ⌈s8⌉ 行,表示输入的被解压缩的数据。每行只含有数字或字母 a 至 f
每两个字符组成一个十六进制数字,表示一个字节。除了最后一行,每行都恰有 8 个字节。输入数据保证是合法的。

输出格式

输出到标准输出中。

输出解压缩后的数据,每行连续输出 8 个字节,每个字节由两位十六进制数字(数字或字母 a 至 f)表示;但最后一行可以不满 8 个字节。

样例输入

81
8001240102030405
060708090af03c00
0102030405060708
090a0b0c0d0e0f01
0203040506070809
0a0b0c0d0e0f0102
030405060708090a
0b0c0d0e0f010203
0405060708090a0b
0c0d0e0fc603000d
78

样例输出

0102030405060708
090a000102030405
060708090a0b0c0d
0e0f010203040506
0708090a0b0c0d0e
0f01020304050607
08090a0b0c0d0e0f
0102030405060708
090a0b0c0d0e0f0d
0e0f0d0e0f0d0e0f
0d0e0f0d0e0f0d0e
0f0d0e0f0d0e0f0d
0e0f0d0e0f0d0e0f
0d0e0f0d0e0f0d0e
0f0d0e0f0d0e0f0d
0e02030405060708

样例说明

上述输入数据可以整理为:

80 01
24 0102030405060708090a
f0 3c
    000102030405060708090a0b0c0d0e0f
      0102030405060708090a0b0c0d0e0f
      0102030405060708090a0b0c0d0e0f
      0102030405060708090a0b0c0d0e0f
c6 0300
0d 78

首先读入第一字节 80,其最高位为 1,于是继续读入第二字节 01,其最高位为 0,因此读入引导区结束。得到 c0=0,c1=1,
原始数据长度为:0×1280+1×1281=128。

然后继续读入字节 24,其二进制是 0010 0100,最低两位为 00,表示这是一个字面量,取其高六位,是十进制数字 9,
表示这个字面量的长度为 10。接下来读入 10 个字节,得到字面量 0102030405060708090a

然后继续读入字节 f0,其二进制是 1111 0000,最低两位为 00,表示这是一个字面量,取其高六位,是十进制数字 60,表示此后的一个字节是字面量的长度减 1。
继续读入字节 3c,得到数字 60,表示这个字面量的长度是 61,接下来读入 61 个字节。

然后继续读入字节 c6,其二进制是 1100 0110,最低两位为 10,表示这是一个回溯引用,取其高六位,是十进制数字 49,表示回溯引用的长度 l 是 50。
随后继续读入两个字节 0300,按小端序组成十六进制数 0x0003,即十进制 3,表示回溯引用的偏移量 o 是 3。因此,这个回溯引用是 〈3,50〉。
由于 50=16×3+2,将此时缓冲区中最后三个字节 0d 0e 0f 重复输出 16 次,然后继续输出 0d 0e,补足共 50 个字节。

然后继续读入字节 0d,其二进制是 0000 1101,最低两位为 01,表示这是一个回溯引用,取其位 2 至 4,是 011,是十进制数字 3,表示回溯引用的长度 l 是 7。
随后读入一个字节 78,其二进制是 0111 1000,与本元素首字节 0d 的最高三位 000 拼合得到 000 0111 1000,是十进制数字 120,表示回溯引用的偏移量 o 是 120。
因此,这个回溯引用是 〈120,7〉。此前已经输出了 121 字节,此时从缓冲区开始的偏移量 121−120=1 的位置开始输出 7 个字节,即 02030405060708

此时,输入已经处理完成,共输出了 10+61+50+7=128 字节,与从引导区中读入的原始数据长度一致,因此解压缩成功。

子任务

对于 10% 的输入,解压缩后的数据长度不超过 127 字节,且仅含有字面量,每个字面量元素所含数据的长度不超过 60 字节;

对于 20% 的输入,解压缩后的数据长度不超过 1024 字节,且仅含有字面量,每个字面量元素所含数据的长度不超过 60 字节;

对于 40% 的输入,解压缩后的数据长度不超过 1024 字节,且仅含有字面量;

对于 60% 的输入,解压缩后的数据长度不超过 1024 字节,且包含的回溯引用的首字节的最低两位都是 01

对于 80% 的输入,解压缩后的数据长度不超过 4096 字节;

对于 100% 的输入,解压缩后的数据长度不超过 2MiB(2×220 字节),且 s≤2×106,且保证是合法的压缩数据。

真题来源:解压缩

感兴趣的同学可以如此编码进去进行练习提交

题目理解:

        Give you a piece of compressed code, which can be split into a boot field and a data field. The boot field determines the length of the decompressed data, and the data field can also be segmented. Each segment is determined by the lowest two bits of its first byte. , if it is 00, it is a literal value, if it is 01 or 10, it is a back reference. Output the decompressed data, 8 bytes as a line, the last line allows less than 8 bytes.

Idea analysis:

        Since bytes are to be read many times, it is best to encapsulate a function to  read bytes  and record the current read position. Since the string needs to be adjusted in little endian order, you can consider encapsulating a function to  adjust the string in little endian order  . Since both 01 and 10 end with backreferences, a function can also be encapsulated to  fill the string  . Since it is too troublesome to deal with bytes, you can use  stoi () - signed integer  or  stoul - unsigned integer  to perform hexadecimal conversion.

 C++ full score solution:

#include <bits/stdc++.h>
using namespace std;
const int N = 2e6 + 10;
int n, idx, p; // 当前已经解压缩了 p 字节,下一个读的是第 idx 下标的字符
string res; // 解压后的数据

string readBytes(int num)
{
    char byte[2 * num];
    for (int i = 0; i < 2 * num; i ++) cin >> byte[i];
    idx += num * 2;
    return string(byte, 2 * num);
}

void trackBack(int o, int l)
{
    int start = res.length() - o * 2;
    int len = o * 2;
    string back_track_string = res.substr(start, len);
    int cnt = 0;
    while (cnt < l * 2 - l * 2 % len)
    {
        res += back_track_string;
        cnt += len;
    }
    res += back_track_string.substr(0, l * 2 % len);
}
int main()
{
    cin >> n;
    string bts;
    vector<int> c;
    int v_c;
    // 读入字节 直到最高位为0
    while ((bts = readBytes(1)) >= "80")
    {
        v_c = stoi(bts, nullptr, 16);
        v_c -= 128;
        c.push_back(v_c);
    }
    // 最高位为0时,直接保存到c里
    v_c = stoi(bts, nullptr, 16);
    c.push_back(v_c);
    // 引导区结束,计算原始数据长度
    int length = 0;
    for (int i = 0; i < c.size(); i ++) length += c[i] * pow(128, i);

    while (idx < n * 2)
    {
        // 接下来是数据域
        // 读入一个字节
        bts = readBytes(1);
        string string_to_binary = bitset<8>(stoi(bts, nullptr, 16)).to_string();
        string lowest_two_digits = string_to_binary.substr(6, 2);
        if (lowest_two_digits == "00")
        {
            string high_six_digits = string_to_binary.substr(0, 6);
            int ll = stoi(high_six_digits, nullptr, 2);
            // l <= 60,高六位 ll 表示 l - 1
            if (ll <= 59)
                res += readBytes(ll + 1);
            else
            {
                // 第一个字节的高六位存储的值为 60、61、62 或 63 时,分别代表 l - 1 用 1、2、3 或 4 个字节表示
                int literal_length = ll - 59;
                // 按照小端序重组字符串 0x01 0x0A => 0x0A01
                string string1 = readBytes(literal_length);
                string string2;
                // 字符串每两位反转
                for (int i = string1.length() - 2; i >= 0; i -= 2)
                    string2 += string1.substr(i, 2);
                int l = 1 + stoi(string2, nullptr, 16); // 字面量长度
                res += readBytes(l);
            }
        }
        else if (lowest_two_digits == "01")
        {
            // 第 2 ~ 4 位即 从下标 3 开始的三位 001 011 01
            string two_to_four_digits = string_to_binary.substr(3, 3);
            // l - 4 占 3 位,存储于首字节的 2 至 4 位中
            int l = stoi(two_to_four_digits, nullptr, 2) + 4;
            // o 占 11 位,其低 8 位存储于随后的字节中,高 3 位存储于首字节的高 3 位中
            string high_three_digits = string_to_binary.substr(0, 3);
            string next_byte_binary = bitset<8>(stoi(readBytes(1), nullptr, 16)).to_string();
            int o = stoi(high_three_digits + next_byte_binary, nullptr, 2);
            // 回溯引用
            trackBack(o, l);
        }
        else if (lowest_two_digits == "10")
        {
            string high_six_digits = string_to_binary.substr(0, 6);
            // l 占 6 位,存储于首字节的高 6 位中
            int l = stoi(high_six_digits, nullptr, 2) + 1;
            // o 占 16 位,以小端序存储于随后的两个字节中
            string string1 = readBytes(2);
            string string2;
            // 字符串每两位反转
            for (int i = string1.length() - 2; i >= 0; i -= 2)
                string2 += string1.substr(i, 2);
            int o = stoi(string2, nullptr, 16);
            // 回溯引用
            trackBack(o, l);
        }
    }
    for (int i = 0; i < res.length(); i ++)
    {
        cout << res[i];
        // 输出,每16个字符加一个换行
        if ((i + 1) % 16 == 0) cout << endl;
    }
    // 若最后一行不能凑8个,则补一个换行
    if (res.length() % 16) cout << endl;
    return 0;
}

 operation result:

Guess you like

Origin blog.csdn.net/weixin_53919192/article/details/131565864