算数编码的原理及C++实现

介绍

摘自Wikipedia:算术编码是一种无损数据压缩方法,也是一种熵编码的方法。和其它熵编码方法不同的地方在于,其他的熵编码方法通常是把输入的消息分割为符号,然后对每个符号进行编码,而算术编码是直接把整个输入的消息编码为一个数,一个满足(0.0 ≤ n < 1.0)的小数n。

编码原理

算数编码的原理不太容易用三言两语直观地表达出来,其背后的数学思想则更是深刻。因此在这里用“HELLO WORLD”作为例子简单阐述算数编码的原理。

对“HELLO WORLD”编码与解码的过程如下:

  1. 统计“HELLO WORLD”中各字母出现的次数,为了方便忽略了HELLO与WORLD之间的空格:
Character Frequency
D 1
E 1
H 1
L 3
O 2
R 1
W 1
  1. 根据每个字母出现的次数分配区间。(总区间为0~1):
Character Frequency Probability Range
D 1 1/10 0.0-0.1
E 1 1/10 0.1-0.2
H 1 1/10 0.2-0.3
L 3 3/10 0.3-0.6
O 2 2/10 0.6-0.8
R 1 1/10 0.8-0.9
W 1 1/10 0.9-1.0
  1. 编码的算法如下:
Set low to 0.0
Set high to 1.0
While there are still input symbols do
    get an input symbol
    code_range = high - low.
          high = low + code_range *  high_range of the symbol being coded
         low = low + code_range * low_range of the symbol being coded
End of While
output low

将以上算法应用于“HELLO WORLD”:

encoding H (Hs range is 0.2 - 0.3) Range(or code_range above) = 1 - 0 = 1
low =  0 + (1 * 0.2) = 0.2 
high = 0 + (1 * 0.3) = 0.3
no output
encoding E (Es range is 0.1 - 0.2) Range = 0.3 - 0.2 = 0.1

low =  0.2 + (0.1 * 0.1) = 0.21 
high = 0.2 + (0.1 * 0.2) = 0.22
output 0.2
encoding L (Ls range is 0.3 - 0.6) Range = 0.22 - 0.21 = 0.01

low =  0.21 + (0.01 * 0.3) = 0.213
high = 0.21 + (0.01 * 0.6) = 0.216
output 0.21
encoding the next L (Ls range is 0.3 - 0.6) Range = 0.216 - 0.213 = 0.003

low =  0.213 + (0.003 * 0.3) = 0.2139
high = 0.213 + (0.003 * 0.6) = 0.2148
no output
encoding O (Os range is 0.6 - 0.8) Range = 0.2148 - 0.2139 = 0.0009
low =  0.2139 + (0.0009 * 0.6) = 0.21444
high = 0.2139 + (0.0009 * 0.8) = 0.21462 
output 0.214
encoding W (Ws range is 0.9 - 1.0) Range = 0.21462 - 0.21444 = 0.00018
low =  0.21444 + (0.00018 * 0.9) = 0.214602
high = 0.21444 + (0.00018 * 1.0) = 0.21462
output 0.2146
encoding O (Os range is 0.6 - 0.8) Range = 0.21462 - 0.214602 = 0.000018
low =  0.214602 + (0.000018 * 0.6) = 0.2146128
high = 0.214602 + (0.000018 * 0.8) = 0.2146164
output 0.21461
encoding R (Rs range is 0.8 - 0.9) Range = 0.2146164 - 0.2146128 = 0.0000036
low =  0.2146128 + (0.0000036 * 0.8) = 0.21461568
high = 0.2146128 + (0.0000036 * 0.9) = 0.21461604
no output
encoding L (Ls range is 0.3 - 0.6) Range = 0.21461604 - 0.21461568 = 0.00000036
low =  0.21461568 + (0.00000036 * 0.3) = 0.214615788
high = 0.21461568 + (0.00000036 * 0.6) = 0.214615896
output 0.214615
encoding D (Ds range is 0.0 - 0.1) Range = 0.214615896 - 0.214615788 = 0.000000108
low =  0.214615788 + (0.000000108 * 0.0) = 0.214615788
high = 0.214615788 + (0.000000108 * 0.1) = 0.2146157988
output 0.2146157 and 88 from low
...

可以得到如下结果:

Character Frequency Probability Range Low High
H 1 1/10 0.2 – 0.3 0.2 0.3
E 1 1/10 0.1 – 0.2 0.21 0.22
L 3 3/10 0.3 – 0.6 0.213 0.216
L 3 3/10 0.3 – 0.6 0.2139 0.2148
O 2 3/10 0.6 – 0.8 0.21444 0.21462
W 1 1/10 0.9 – 1.0 0.214602 0.214620
O 2 2/10 0.6 – 0.8 0.2146128 0.2146164
R 1 1/10 0.8 – 0.9 0.21461568 0.21461604
L 3 3/10 0.3 – 0.6 0.214615788 0.214615896
D 1 1/10 0.0 – 0.1 0.214615788 0.214615806

以上,0.214615806 就是“HELLO WORLD”对应的码。

  1. 解码算法的思想是:找到码坐落区间的对应字符,将该字符输出,然后更新码。伪代码如下:
get the number encoding the data
loop    
current symbol  =  the symbol/character in which range the number falls
current range =  current symbols high value – current symbols low value   
subtract current symbols low value from number     
divide the number by the current range 
end loop

将0.214615806解码回“HELLO WORLD”的过程如下:

The number is 0.214615788
current symbol =  H (range: 0.2 – 0.3)
current range =  0.3 – 0.2 = 0.1
subtract current symbols low value from number   = 0.214615788 – 0.2 = 0.014615788
divide the number by the current range = 0.014615788/0.1 = 0.14615788

current symbol =  E (range: 0.1 – 0.2)
current range =  0.2 – 0.1 = 0.1
subtract current symbols low value from number   =  0.14615788 – 0.1 = 0.04615788
divide the number by the current range = 0.04615788/0.1 = 0.4615788

current symbol =  L (range: 0.3 – 0.6)
current range =  0.6 – 0.3 = 0.3
subtract current symbols low value from number   =  0.4615788 – 0.3 = 0.1615788
divide the number by the current range = 0.1615788/0.3 = 0.538596

current symbol =  L (range: 0.3 – 0.6)
current range =  0.6 – 0.3 = 0.3
subtract current symbols low value from number   =  0.538596 – 0.3 = 0.238596
divide the number by the current range = 0.238596/0.3 = 0.79532

current symbol =  O (range: 0.6 – 0.8)
current range =  0.8 – 0.6 = 0.2
subtract current symbols low value from number   =  0.79532 – 0.6 = 0.19532
divide the number by the current range = 0.19532/0.2 = 0.9766

current symbol =  W (range: 0.9 – 1.0)
current range =  0.9 – 1.0 = 0.1
subtract current symbols low value from number   =  0.9766 – 0.9 = 0.0766
divide the number by the current range = 0.0766/0.1 = 0.766

解码具体结果如下:

Character Frequency Probability Range Number
H 1 1/10 0.2 – 0.3 0.214615788
E 1 1/10 0.1 – 0.2 0.14615788
L 3 3/10 0.3 – 0.6 0.4615788
L 3 3/10 0.3 – 0.6 0.538596
O 2 3/10 0.6 – 0.8 0.79532
W 1 1/10 0.9 – 1.0 0.9766
O 2 2/10 0.6 – 0.8 0.766
R 1 1/10 0.8 – 0.9 0.83
L 3 3/10 0.3 – 0.6 0.3
D 1 1/10 0.0 – 0.1 0.0

三个问题

以上的例子只是简单的阐述了算数编码的原理,实际编码中还需要解决以下三个问题:

  1. 结束符 实际解码中,最后区间不一定会正好回到0.0-1.0,因此需要设定一个结束符来代表字符串的结束。

  2. 概率自适应 上文的例子中,接收方实际上事先已经知道了字符串中各字符出现的次数。但是这在实际环境中是不现实的,因此要设立一个发送方与接收方都遵守的机制,在事先不沟通编码字符串的统计信息的情况下,能够编码及解码字符串。

  3. 溢出问题 随着需要编码的字符串长度的增长,最后计算得的小数的位数就会越多。无论采用什么样的编程语言,最后都会面临浮点数溢出的问题。

原码

以下c++代码解决了以上三个问题,具体解决方式将在下节详细解释:

#include<iostream>
#include<algorithm>
#include<string.h>
#include<string>
#include<vector>
#include<map>
#include<set>
#include<cmath>
#include<stack>
#include<queue>
#include <assert.h>
#include<fstream>
using namespace std;
typedef long long ll;

long long head = pow(10, 9);
long long second = pow(10, 8);

ll string2ll(string a) {
    
    
	ll ans = 0;
	ll wei = 1;
	for (int i = a.length() - 1; i >= 0; i--) {
    
    
		ans += (a[i] - '0')*wei;
		wei *= 10;
	}
	return ans;
}

//编码
void encode() {
    
    
	string original = "";//原文
	string code = "";//编码
	vector<int> fre;
	int total = 256;//总共可能出现的字符
	ll underflow = -1;
	ll undernum = 0;//下溢位数
	//初始化Low High
	ll l = 0;
	ll h = 9999999999;
	//读进原文
	ifstream in("input.txt");
	string tmp;
	while (getline(in, tmp)) {
    
    
		if (original == "") original = tmp;
		else {
    
    
			original += '\n' + tmp;
		}
	}
	original += "#";
	//初始化概率表
	for (int i = 0; i <= 255; i++) {
    
    
		fre.push_back(1);
	}
	//开始处理原文
	for (int i = 0; i < original.length(); i++) {
    
    
		int pos = int(original[i]);
		ll range = (h - l) + 1;
		ll num = 0;
		for (int j = 0; j < pos; j++) {
    
    
			num += fre[j];
		}
		h = l + range * (num + fre[pos]) / (total + 1) - 1;
		l = l + range * num / (total + 1);
		total++;
		fre[pos]++;
		//接下来开始移位
		//开始移位
		while ((l / head) == (h / head)) {
    
    
			ll d = l / head;
			code = code + char(l / head + '0');
			l = (l  % head) * 10;
			h = (h   % head) * 10 + 9;
			char add = '0';
			if (d == underflow) {
    
    ///要用长整型!!!!!!
				add = '9';
			}
			while (undernum) {
    
    
				undernum--;
				code = code + add;
			}
			underflow = -1;
		}
		//处理下溢
		while (h - l < second) {
    
    
			if (underflow == -1)underflow = l / head;
			undernum++;
			int tmp = l / head;
			l = tmp * head + (l%second) * 10;
			tmp = h / head;
			h = tmp * head + (h%second) * 10 + 9;
		}
	}
	//全部遍历完成
	ll m = (l + h) / 2;
	string magic = to_string(m);
	while (magic.length() < 10) {
    
    
		magic = magic + "0";
	}
	code = code + magic;
	ofstream out("output.txt");
	out << code;
}

//解码
void decode() {
    
    
	string original = "";//原文
	string code = "";//编码
	vector<int> fre;
	int total = 256;//总共可能出现的字符
	ll underflow = -1;
	ll undernum = 0;
	//初始化Low High
	ll l = 0;
	ll h = 9999999999;
	//读进编码
	ifstream in("output.txt");
	in >> code;
	//初始化概率表
	for (int i = 0; i <= 255; i++) {
    
    
		fre.push_back(1);
	}
	ll curcode = string2ll(code.substr(0, 10));
	int pos = 10;//当前运行到第几个code了
	int cur = 257;
	//开始解码
	while (cur != int('#')) {
    
    
		ll tmp = l;
		ll range = (h - l) + 1;
		ll num = 0;
		for (int j = 0; j < fre.size(); j++) {
    
    
			ll th = tmp + range * (num + fre[j]) / (total + 1) - 1;
			ll tl = tmp + range * (num) / (total + 1);
			num += fre[j];
			if (curcode >= tl && curcode < th) {
    
    
				cur = j;
				l = tl;
				h = th;
				total++;
				fre[cur]++;
				break;
			}
		}
		if (cur != int('#')) {
    
    
			original = original + char(cur);
		}
		//开始移位
		while ((l / head) == (h / head)) {
    
    
			code = code + char(l / head + '0');
			l = (l  % head) * 10;
			h = (h   % head) * 10 + 9;
			curcode = (curcode%head) * 10 + code[pos] - '0';
			pos++;
		}
		//处理下溢
		while (h - l < second) {
    
    
			ll tmp = curcode / head;
			curcode = tmp * head + (curcode%second) * 10 + code[pos] - '0';
			pos++;
			tmp = l / head;
			l = tmp * head + (l%second) * 10;
			tmp = h / head;
			h = tmp * head + (h%second) * 10 + 9;
		}

	}
	ofstream out("ans.txt");
	out << original;
}


int main() {
    
    
	encode();
	decode();
	system("pause");
	return 0;
}

问题解决

1. 结束符

这个问题最好解决,在以上代码中默认使用‘#’作为结束符,在读取字符串和编码、解码时,判断到‘#’即停止计算。

2. 概率自适应

默认需要编码的字符均出现在ASCII表中,编码与解码时均建立一张概率表。一开始初始化概率表中每一项均为1,读取(或解码)到某个字符后,将该字符的次数加一。解码时只要能保证与编码更新概率表的机制是相同的即可。具体代码如下:
编码:

//初始化概率表
	for (int i = 0; i <= 255; i++) {
    
    
		fre.push_back(1);
	}
	//开始处理原文
	for (int i = 0; i < original.length(); i++) {
    
    
		int pos = int(original[i]);
		ll range = (h - l) + 1;
		ll num = 0;
		for (int j = 0; j < pos; j++) {
    
    
			num += fre[j];
		}
		h = l + range * (num + fre[pos]) / (total + 1) - 1;
		l = l + range * num / (total + 1);
		total++;
		fre[pos]++;

解码:

//初始化概率表
	for (int i = 0; i <= 255; i++) {
    
    
		fre.push_back(1);
	}
	ll curcode = string2ll(code.substr(0, 10));
	int pos = 10;//当前运行到第几个code了
	int cur = 257;
	//开始解码
	while (cur != int('#')) {
    
    
		ll tmp = l;
		ll range = (h - l) + 1;
		ll num = 0;
		for (int j = 0; j < fre.size(); j++) {
    
    
			ll th = tmp + range * (num + fre[j]) / (total + 1) - 1;
			ll tl = tmp + range * (num) / (total + 1);
			num += fre[j];
			if (curcode >= tl && curcode < th) {
    
    
				cur = j;
				l = tl;
				h = th;
				total++;
				fre[cur]++;
				break;
			}
		}

3.溢出问题

这个问题最难解决,本代码主要参考Arithmetic Compression With C#中的机制解决溢出的问题。并且为了便于理解,本代码采用十进制。

回到首节中的编码表:

Character Frequency Probability Range Low High
H 1 1/10 0.2 – 0.3 0.2 0.3
E 1 1/10 0.1 – 0.2 0.21 0.22
L 3 3/10 0.3 – 0.6 0.213 0.216
L 3 3/10 0.3 – 0.6 0.2139 0.2148
O 2 3/10 0.6 – 0.8 0.21444 0.21462
W 1 1/10 0.9 – 1.0 0.214602 0.214620
O 2 2/10 0.6 – 0.8 0.2146128 0.2146164
R 1 1/10 0.8 – 0.9 0.21461568 0.21461604
L 3 3/10 0.3 – 0.6 0.214615788 0.214615896
D 1 1/10 0.0 – 0.1 0.214615788 0.214615806

仔细阅读上表可以发现,Low与High的有效数字在不断收敛。

在编码E的时候,Low与High的十分位都是2,并且无论后面编码多少个字符,十分位均为2,永远不会再改变。这是算数编码缩小编码返回的特性。

到编码“HELLO”中的“O”的时候,Low和High中十分位、百分位、千分位均已相同。

因此在实现时,一旦发现Low和High的有效数字收敛,那么就可以认为这些位的数字对后续计算不会产生影响。可以输出这些已经收敛的数字。

为了简便,代码采用整数实现算数编码。整数能简化表示,同时摆脱浮点数的某些限制。可以在整数的开头想象设置一个小数点。我们可以将“HELLO WORLD”的概率表写成:

Character Frequency Probability Range
D 1 1/10 0000 – 0999
E 1 1/10 1000 – 1999
H 1 1/10 2000 – 2999
L 3 3/10 3000 – 5999
O 2 2/10 6000 – 7999
R 1 1/10 8000 – 8999
W 1 1/10 9000 – 9999

需要指出的是,这里的“9999”实际上就是0.999999……。与上限1存在无穷小的差距。

基于以上,可以得到以下编码的伪代码,MSD代表整数中最高的有效位:

set low to 00000
set high to 99999

while there are still input symbols

    get an input symbol
    code_range = (high – low) + 1
          high = low + code_range *  high_range of the symbol being coded
         low = low + code_range * low_range of the symbol being coded

while the MSD of high and low match
if the MSD of the high and low match
            output MSD
            remove MSD from low and shift a new 0 into low
            remove MSD of high and shift a new 9 into high
end if
end while the MSD of high and low match
end while there are still input symbols

将以上伪代码应用于“HELLO WORLD”,可以得到:

First initialize high and low,

Set low to 00000 (or 0.000...)
Set high to 99999(or 0.999...)

encoding H (Hs range is 0.2 – 0.3) Range(or code_range above) = 
               99999 - 00000 = 99990 + 1 = 100000 (or 1.00000...)

low = 00000 + (100000 * 0.2) = 20000 (or 0.200000)
high = 00000 + (100000 * 0.3) = 30000 – 1 = 29999 (or .299999...)

在这个时候,首位的2已经收敛,因此移出2,再将9放入High的末尾,0放入Low的末尾,然后继续编码:

output 2

Set low to 00000
Set high to 99999

encoding E (Es range is 0.1 – 0.2) Range = (99999 – 00000) + 1 = 100000

low =  00000 + (100000 * 0.1) = 10000 (or 0.100000...)
high = 19999 + (100000 * 0.2) = 20000 – 1 = 19999 (or 0.199999....)

在这个时候,首位的1已经收敛,因此移出1,再将9放入High的末尾,0放入Low的末尾,然后继续编码:

output 21

Set low to 00000
Set high to 99999


encoding L (Ls range is 0.3 – 0.6) Range = (99999 – 00000) + 1 = 100000

low =  00000 + (100000 * 0.3) = 30000 (or 0.30000)
high = 00000 + (100000 * 0.6) = 60000 – 1 = 59999 (since 0.5999 = 0.599999...)

no output


Set low to 30000
Set high to 59999

encoding the next L (Ls range is 0.3 – 0.6) 
   Range= (59999 - 30000) + 1 = 29999 + 1 = 30000


low =  30000 + (30000 * 0.3) = 30000 + 9000 =  39000  
high = 30000 + (30000 * 0.6) = 30000 + 18000 = 48000 – 1 = 47999 

no output


Set low to 39000
Set high to 47999

encoding O (Os range is 0.6 - 0.8) Range = (47999 – 39000) + 1 = 9000

low =  39000 + (9000 * 0.6) = 39000 + 5400 = 44400
high = 39000 + (9000 * 0.8) = 39000 + 7200 = 46200 – 1 = 46199

在这个时候,首位的4已经收敛,因此移出4,再将9放入High的末尾,0放入Low的末尾,然后继续编码:

output 214

Set low to 44000 
Set high to 61999 

encoding W (Ws range is 0.9 – 1.0) Range = (61999 – 44000) + 1 = 18000

low =  44000 + (18000 * 0.9) = 44000 + 16200 = 60200
high = 44000 + (18000 * 1.0) = 44000 + 18000 = 62000 – 1 = 61999

在这个时候,首位的6已经收敛,因此移出6,再将9放入High的末尾,0放入Low的末尾,然后继续编码:

output 2146

Set low to 02000
Set high to 19999

encoding O (Os range is 0.6 – 0.8) Range = (19999 - 02000) + 1 = 18000

low =  2000 + (18000 * 0.6) = 2000 + 10800 = 12800
high = 2000 + (18000 * 0.8) = 2000 + 14400 = 16400 – 1 = 16399

在这个时候,首位的1已经收敛,因此移出1,再将9放入High的末尾,0放入Low的末尾,然后继续编码:

output 21461

Set low to 28000
Set high to 63999

encoding R (Rs range is 0.8 – 0.9) Range = (63999 – 28000) + 1 = 36000

low =  28000 + (36000 * 0.8) = 28000 + 28800 = 56800
high = 28000 + (36000 * 0.9) = 28000 + 32400 = 60400 – 1 = 60399

no output

Set low to 56800
Set high to 60399

encoding L (Ls range is 0.3 – 0.6) Range = (60399 – 56800) + 1 = 3600

low =  56800 + (3600 * 0.3) = 56800 + 1080 = 57880
high = 56800 + (3600 * 0.6) = 56800 + 2160 = 58960 – 1 = 58959

在这个时候,首位的5已经收敛,因此移出5,再将9放入High的末尾,0放入Low的末尾,然后继续编码:

output 214615

Set low to 78800
Set high to 89599

encoding D (Ls range is 0.0 – 0.1) Range = (89599 – 78800) + 1 = 10800

low =  78800 + (1080 * 0.0) = 78800 + 0 = 78800
high = 78800 + (1080 * 0.1) = 78800 + 108 = 78908

在这个时候,首位的7和8已经收敛,因此移出7和8,因为已经编码完毕,因此将剩下的码(代码中采用(Low+High)/2)一并输出:

output 21461578

output (800+908)/2 =854

the decimal procedure : output 21461578854

END ENCODING
...

综合以上,可以得到下表

Character Probability Range Low High Output
Initialize 00000 99999
H 1/10 0.2 – 0.3 20000 29999 2
E 1/10 0.1 – 0.2 10000 19999 1
L 3/10 0.3 – 0.6 30000 59999
L 3/10 0.3 – 0.6 39000 47999
O 3/10 0.6 – 0.8 44400 46199 4
W 1/10 0.9 – 1.0 60200 61999 6
O 2/10 0.6 – 0.8 12800 16399 1
R 1/10 0.8 – 0.9 56800 60399
L 3/10 0.3 – 0.6 57880 58959 5
D 1/10 0.0 – 0.1 78800 78907 78

最后还输出了854,因此最后的码为21461578854。

下溢(underflow)

在实际编码时,可能出现高低无法收敛的情况,例如以下情况:

High:     700003      
Low:      699994

如果上述情况出现,就意味着High与Low均只有一位有效数字了,无法进行之后的计算。

避免下溢的方式是,从源头上杜绝下溢的出现的情况。如果Low和High的最高位不相同,但是在次高位上出现Low是9并且High是0时,那就有可能出现下溢。

当可能出现下溢时,删除次高位,再将9放入High的末尾,0放入Low的末尾。并且记录Low的最高位的数字(大部分情况是9),以及移位的位数:

Before    After               
------    -----
High:          40344     43449
Low:           39810     38100
Underflow:     0         3
Undernum:      0         1  

然后继续编码,当再次出现高位收敛时,判断收敛的数是否跟underflow相同。如果相同,那说明High向Low收敛,那么需要输出undernum个9;如果不相同,那说明Low向High收敛,那么需要输出undernum个0。

具体代码如下:

while ((l / head) == (h / head)) {
    
    
			ll d = l / head;
			code = code + char(l / head + '0');
			l = (l  % head) * 10;
			h = (h   % head) * 10 + 9;
			char add = '0';
			if (d == underflow) {
    
    ///要用长整型!!!!!!
				add = '9';
			}
			while (undernum) {
    
    
				undernum--;
				code = code + add;
			}
			underflow = -1;
		}
		//处理下溢
		while (h - l < second) {
    
    
			if (underflow == -1)underflow = l / head;
			undernum++;
			int tmp = l / head;
			l = tmp * head + (l%second) * 10;
			tmp = h / head;
			h = tmp * head + (h%second) * 10 + 9;
		}

解码

以上处理问题的方法都是从编码的角度描述的,实际上解码跟编码是一样的,实现的思想就是全真模拟编码的过程,具体代码请直接参考原码。

猜你喜欢

转载自blog.csdn.net/weixin_44318192/article/details/104207701
今日推荐