Computer Basics Series compression algorithm

File Storage

  Before talking compression algorithm, it is necessary to popularize knowledge stored files.

  The file is a form of data stored on disk and other storage media. Program files stored in the basic data bytes. B = Byte byte file is stored in units of. File is a collection of bytes of data. Data byte 1 byte (8 bits) indicates the 256 species, represented by a binary word is 0000 0000 --11,111,111. If the data is stored in text files, the file is a text file. If the graphic, then the file is an image file. In any case, the number of bytes in the file are stored consecutively.

 

Defined compression algorithm

  When the file is too large, typically use file compression to reduce the space occupied by the file. For example, to save the camera to shoot photos on the computer, when using a compression algorithm will file compression, file compression format is generally JPEG.

  Compression algorithm (compaction algorithm) refers to the data compression algorithm, including a compression and reduction (decompression) in two steps. Is without changing the original file attributes of a file byte algorithm reduces space and occupied space.

The compression algorithm is defined, into different types:

  • Lossy and lossless
    • Lossless compression: without distortion can be reconstructed from the compressed data, accurately restore the original data. It can be used where strict requirements for accuracy of data, such as compressed executable files and regular files, disk compression can also be used to compress multimedia data. The compression method is relatively small. The differential coding, RLE, Huffman coding, the LZW coding, arithmetic coding.
    • Lossy compression: lossy, can not be completely and accurately recover the original data, reconstructed data is only an approximation of the original data. It can be used for less demanding applications the accuracy of the data, such as compressed multimedia data. The compression method is relatively large. E.g. predictive coding, audio coding sense, fractal compression, wavelet compression, JPEG / MPEG.
  • symmetry
    • Symmetric encoding method: encoding and decoding complexity of the algorithm and the required time is almost, most compression algorithms are symmetrical.
    • Asymmetric encoding: encoding is generally difficult to easily decoded, such as Huffman coding, and fractal coding. However, the encoding method used in cryptography, by contrast, is easy to encode, decode is very difficult.
  • Inter and intra: in video encoding method to use both intra-frame coding and the inter
    • Intra-coded: refers to a coding method in the image independently, the same still image encoding such as JPEG
    • Inter-coded: the need to reference to adjacent frames to encode and decode, time and consider redundancy between frames compressed during encoding, such as MPEG
  • Real-time: in some multimedia applications require real-time processing or transmission of data (such as field of digital video recording and playback MP3 / RM / VCD / DVD, video / audio on demand, live network, video telephony, video conferencing) , generally require codec delay ≤50 ms. This requires a simple / quick / efficient algorithms and high-speed / complex CPU / DSP chip
  • Classification treatment: some compression algorithms can simultaneously handle different resolution, different transmission rates, different quality levels of multimedia data, such as JPEG2000, MPEG-2/4

 

Several commonly used compression algorithms understand

RLE algorithm mechanism

  The method of compressing data content of the document in the form of repetitions * expressed called RLE (Run Length Encoding, run-length encoding) algorithms. RLE compression algorithm is a good method, frequently used fax image compression. Because of the nature of the image file is an aggregate of bytes of data, can be compressed with RLE algorithm

The following describes what RLE algorithm with an example:

  1. First AAAAAABBCDDEEEEEF this file 17 half-width characters (text file) is compressed, since the half-size character (actually English character) is stored as a byte in the file, so the magnitude of the file is 17 bytes .

  2. long as they can make the file smaller than 17 bytes, we can use any compression algorithm.

  3. The most obvious way is to put a compression deduplication of the same character, that character * repetitions way compression. So after the above file compression will become follows

  4. As can be seen from the figure, AAAAAABBCDDEEEEEF  17 characters are successfully compressed into  A6B2C1D2E5F1  12 characters, i.e. 12/17 = 70%, a compression ratio of 70% compression was successful.

 

Huffman coding algorithm and Morse

  In understanding Huffman algorithm before, you need to give up alphanumeric characters of a character is one byte (8) of the cognitive data.

  The basic idea of Huffman algorithm: the text file is a combination of different types of characters together, and the number of different characters that appear is not the same. For example, in a text file, A appeared about 100 times, Q is only used three times, situations like this are common. The key lies in the Huffman algorithm  data represented by multiple occurrences of less than 8-bit bytes, not frequently used data can be used more than 8 bytes of the representation . When A and Q are 8 bits to represent the size of the original document is 100 * 8 + 3 * 8 = 824, assuming A with 2, Q 10 bits to represent that 2 * 100 + 3 * 10 = 230. However, to note that ultimately disk storage are in 8-bit byte to save the file.

 

Next, look at the Morse code, the following are examples of Morse coded, can be regarded as a point short (Di), long regarded as the point 11 (despair).

  Morse coding is generally the highest frequency of occurrence of the character is represented by a short text encoding. As shown, if a short pit represents a bit is 1, indicating the bit length is then the point 11, then E (di) of the data character can be represented by 1, C (ticking) 9 can be used the 110,101,101 to represent. In the actual coding Morse, if the length of the short point is 1, is the length of the interval is 1 point 3 points short and long point. Length here refers to the length of the sound.

  Examples of the above example AAAAAABBCDDEEEEEF be rewritten Morse coded, moiré in Mann coding needs to be added between the respective characters represented by the symbol time intervals. Here distinction with 00. Therefore, AAAAAABBCDDEEEEEF this text becomes the A * 6 + B * 2 times + C * 1 Ci + D * 2 times + E * 5 times + F * 1 Ci + character spacing * 16 = 4 * 6 + 8 * bit 2 + 9 + 6 * 1 * 2 * 5 + 1 + 1 + 8 * 16 * 2 = 14 bytes = 106. Therefore Morse code using a compression ratio of 14/17 = 82% . Efficiency and less prominent.

  Morse code is determined according to the frequency of occurrence of each character in the text daily encoded data representing respective characters in length. However, the coding system for AAAAAABBCDDEEEEEF this text is not the most efficient.

 

Huffman binary tree algorithm

  Huffman compression algorithm means for the target file structure respectively best coding system, the coding system and is the basis for compression. Accordingly, what kind of coding (Huffman coding) with the data divided by the respective files will be. Compressed Huffman algorithm file, Huffman coding information and stores compressed data.

  Next, AAAAAABBCDDEEEEEF of A - F of these characters, in accordance with the number of bits encoded high frequency of appearance as few characters to represent this principle to organize. According to the sort order of descending frequency of occurrence, as a result, the coding scheme are also shown.

 

character Frequency of occurrence Encoding (scheme) Digit
A 6 0 1
E 5 1 1
B 2 10 2
D 2 11 2
C 1 100 3
F 1 101 3

  Coding scheme on the table, along with reducing the frequency of occurrence of data bits of the character code information is gradually increased from 1, 2 beginning three successively increased. But this coding system is problematic, I do not know that three of the 100 code, it means using 1,0,0 three code to represent E, A, A what? Or be represented by 10,0 B, A what? 100 or C to represent it.

  而在哈夫曼算法中,通过借助哈夫曼树的构造编码体系,即使在不使用字符区分符号的情况下,也可以构建能够明确进行区分的编码体系。不过哈夫曼树的算法要比较复杂,下面是一个哈夫曼树的构造过程。

  自然界树的从根开始生叶的,而哈夫曼树则是叶生枝

  使用哈夫曼树之后,出现频率越高的数据所占用的位数越少,这也是哈夫曼树的核心思想。通过上图的步骤二可以看出,枝条连接数据时,是从出现频率较低的数据开始的。这就意味着出现频率低的数据到达根部的枝条也越多。而枝条越多则意味着编码的位数随之增加。

  用上图得到的数据表示 AAAAAABBCDDEEEEEF 为 000000000000 100100 110 101101 0101010101 111,40位 = 5 字节。压缩前的数据是 17 字节,压缩后的数据竟然达到了惊人的5 字节,也就是压缩比率 = 5 / 17 = 29% 如此高的压缩率,简直是太惊艳了。

  无论哪种类型的数据,都可以用哈夫曼树作为压缩算法

文件类型 压缩前 压缩后 压缩比率
文本文件 14862字节 4119字节 28%
图像文件 96062字节 9456字节 10%
EXE文件 24576字节 4652字节 19%

可逆压缩和非可逆压缩

  最后,来看一下图像文件的数据形式。图像文件的使用目的通常是把图像数据输出到显示器、打印机等设备上。常用的图像格式有 : BMP、JPEG、TIFF、GIF 格式等。

  • BMP :是使用 Windows 自带的画笔来做成的一种图像形式
  • JPEG:是数码相机等常用的一种图像数据形式
  • TIFF: 是一种通过在文件中包含"标签"就能够快速显示出数据性质的图像形式
  • GIF:是由美国开发的一种数据形式,要求色数不超过 256个

  图像文件可以使用前面介绍的 RLE 算法和哈夫曼算法,因为图像文件在多数情况下并不要求数据需要还原到和压缩之前一摸一样的状态,允许丢失一部分数据。能还原到压缩前状态的压缩称为可逆压缩,无法还原到压缩前状态的压缩称为非可逆压缩 。

  一般来说,JPEG格式的文件是非可逆压缩,因此还原后有部分图像信息比较模糊。GIF 是可逆压缩

 

 

参考:对不起,学会这些知识后我飘了

 

 

 

                      

Guess you like

Origin www.cnblogs.com/zhuminghui/p/12333883.html