C#, Numerical Computing - Arithmetic Coding Compression Technology and Method (Compression by Arithmetic Coding) Source Code

Arithmetic Coding for Data Compression

Arithmetic coding is an algorithm commonly used in lossless and lossy data compression algorithms.

This is an entropy coding technique where common symbols are encoded with fewer bits than rare symbols. It has some advantages over well known techniques like Huffman coding. This article will describe the implementation of CACM87 arithmetic coding in detail, giving you a good understanding of all the details needed to implement it.

From a historical perspective, this article is an update of a data compression article I wrote more than 20 years ago on arithmetic coding. This article is published in the print edition of Dr. Dobb's Journal, which means that it has been extensively edited to avoid excessive page counts. In particular, Dr. Dobb's article combines two topics: a description of arithmetic coding, and a discussion of compression using PPM (Partial Match Prediction).

Since this new article will be published online, space considerations are no longer an important factor, and I hope this will do me justice to the details of arithmetic coding. PPM is a worthy topic in its own right and will be discussed in a later article. I hope this new endeavor, while exasperatingly long-winded, will be a thorough exposition of the themes I wanted to do in 1991.

I think the best way to understand arithmetic coding is to break it into two parts, and I'm going to use this idea in this article. First, I'll describe how arithmetic coding works, using conventional floating-point arithmetic implemented with standard C++ data types. This allows for a perfectly understandable but somewhat impractical implementation. In other words, it works, but it can only be used to encode very short messages.

The second part of the article will describe an implementation where we switch to doing special types of math operations on unbounded binary numbers. This is a somewhat mind-boggling topic in itself, so it helps if you already understand arithmetic coding - you don't have to bother learning two things at the same time.

Finally, I'll present working example code written in modern C++. It's not necessarily the most optimized code in the world, but it's portable and easy to add to existing projects. It should be great for learning and experimenting with this coding technique.

Fundamental

The first thing to understand about arithmetic coding is what it produces. Arithmetic encoding takes a message (usually a file) consisting of symbols (almost always eight-bit characters) and converts it to a floating-point number greater than or equal to zero and less than 1. This floating-point number can be long -- in fact, the entire output file is one long number -- which means it's not an ordinary data type that you're used to in traditional programming languages. My implementation of the algorithm has to create this float bit by bit from scratch, and likewise, read it in and decode it bit by bit.

This coding process is done step by step. As each character in the file is encoded, some bits are added to the encoded message, so it builds up over time as the algorithm progresses.

The second thing to understand about arithmetic coding is that it relies on a model to represent the symbols it is processing. The job of the model is to tell the encoder what is the probability of a character in a given message. If the model gives accurate probabilities for the characters in the message, then they will be encoded very close to optimal. If the model skews the probabilities of the symbols, the encoder might actually expand the message instead of compressing it!

using System;

namespace Legalsoft.Truffer
{
/// <summary>
/// compression by arithmetic coding
/// </summary>
public class Arithcode
{
private int NWK { get; } = 20;
private int nch { get; set; }
private int nrad { get; set; }
private int ncum { get; set; }
private int jdif { get; set; }
private int nc { get; set; }
private int minint { get; set; }
private int[] ilob { get; set; }
private int[] iupb { get; set; }
private int[] ncumfq { get; set; }

public Arithcode(int[] nfreq, int nnch, int nnrad)
{
this.nch = nnch;
this.nrad = nnrad;
this.ilob = new int[NWK];
this.iupb = new int[NWK];
this.ncumfq = new int[nch + 2];

if (nrad > 256)
{
throw new Exception("output radix must be <= 256 in Arithcode");
}

minint = (int)(int.MaxValue / nrad);
ncumfq[0] = 0;
for (int j = 1; j <= nch; j++)
{
ncumfq[j] = ncumfq[j - 1] + Math.Max(nfreq[j - 1], 1);
}
ncum = ncumfq[nch + 1] = ncumfq[nch] + 1;
}

public void messageinit()
{
jdif = (int)(nrad - 1);
for (int j = NWK - 1; j >= 0; j--)
{
iupb[j] = nrad - 1;
ilob[j] = 0;
nc = (int)j;
if (jdif > minint)
{
return;
}
jdif = (int)((jdif + 1) * nrad - 1);
}
throw new Exception("NWK too small in arcode.");
}

public void codeone(int ich, byte[] code, ref int lcd)
{
if (ich > nch)
{
throw new Exception("bad ich in Arithcode");
}
advance(ich, code, ref lcd, 1);
}

public int decodeone(byte[] code, ref int lcd)
{
int ja = (byte)code[lcd] - ilob[nc];
for (int j = nc + 1; j < NWK; j++)
{
ja *= (int)nrad;
ja += (byte)code[lcd + j - nc] - ilob[j];
}
int ihi = (int)(nch + 1);
int ich = 0;
while (ihi - ich > 1)
{
int m = (int)((ich + ihi) >> 1);
if (ja >= multdiv(jdif, ncumfq[m], (int)ncum))
{
ich = (int)m;
}
else
{ ihi = m; } } if (i != nch) { advance(i, code, ref lcd, -1); } return i; }

public void advance(int ich, byte[] code, ref int lcd, int isign)
{
int jh = multdiv(jdif, ncumfq[ich + 1], (int)ncum);
int jl = multdiv(jdif, ncumfq[ich], (int)ncum);
jdif = jh - jl;
arrsum(ilob, iupb, jh, NWK, (int)nrad, nc);
arrsum(ilob, ilob, jl, NWK, (int)nrad, nc);
int j = nc;
for (; j < NWK; j++)
{
if (ich != nch && iupb[j] != ilob[j])
{
break;
}
if (isign > 0)
{
code[lcd] = (byte)ilob[j];
}
lcd++;
}
if (j + 1 > NWK)
{
return;
}
nc = j;
for (j = 0; jdif < minint; j++)
{
jdif *= (int)nrad;
}
if (j > nc)
{
throw new Exception("NWK too small in arcode.");
}
if (j != 0)
{
for (int k = nc; k < NWK; k++)
{
iupb[k - j] = iupb[k];
ilob[k - j] = ilob[k];
}
}
nc -= j;
for (int k = (int)(NWK - j); k < NWK; k++)
{
iupb[k] = ilob[k] = 0;
}
return;
}

public int multdiv(int j, int k, int m)
{
return (int)((ulong)j * (ulong)k / (ulong)m);
}

public void arrsum(int[] iin, int[] iout, int ja, int nwk, int nrad, int nc) {
int karry = 0; for (int j = (int)(nwk - 1); j > nc; j--) { int jtmp = ja; ja /= near; iout[j] = iin[j] + (jtmp - ja * nrad) + karry; if (iout[j] >= nrad) { iout[j] -= nrad; curry = 1; } else { carry = 0; } } iout[nc] = iin[nc] + ja + karry;

}
}
}

C#, Numerical Computing - Arithmetic Coding Compression Technology and Method (Compression by Arithmetic Coding) Source Code

Guess you like