Compression-compression implementation method, commonly used compression format

  • Compression is a mechanism to reduce the file size of the computer through a specific algorithm, which can reduce
    the data stored by Bytes . Many companies use the form of compressed packages. The database is rarely used. A friend, new jump After a company assigned the project, unexpectedly, the data sent was all compressed packages, one of which contained 4000W user information (I don't know what the specific stuff was), and his computer could not be opened.
    (We're not a big programmer, Xiaobaibai)

Suddenly I paid attention to compression, because I don’t seem to do anything except decompress some small movies with fast compression, so Mother Du has some relevant knowledge and wanted to share it, and put it here, hope the big guys can give pointers.

Closer to home.
Our popular understanding is that decompression is to remove spaces. In fact, it’s not quite right.
Let’s talk about compression.

  1. file
  • (File) Compression technology is simply to find duplicate Bytes in the file, and then create a dictionary file with the same Bytes, and use a code to represent it.

  • For example, in the original file, there are many repeated Bytes or words such as: LiEnze I love you, this time a code will be produced to express such as: sb Of course, it is just an example, the actual operation is much more troublesome

  1. image
  • (Picture) The computer uses binary representation to process information, and there are countless points of the same color in a picture, right, compression will involve how many blue dots there are at a certain position, and use the formula (0, 1 ) To show that the
    compression methods are divided into:
    (sometimes, if you want to pursue efficiency, you will necessarily discount the quality)
  1. Lossy compression:
  • For example, when you compress the picture, there is one pixel missing in the upper left corner of the picture. Can your naked eye see it? Right!
  • So lossy compression is very suitable for compressing pictures, audio and video. The typical representative format is .mpeg
  1. lossless compression:
  • Lossless compression is used when we are pursuing perfection of data and not particularly considering efficiency. There are too many representative formats, such as .zip .rar.
    In fact, the most important point of compression is to remove duplicates, that is, repeat compression.

There are also two types of repeated compression:

  1. Paragraph repetition
  • Bytes becomes a short sentence after repeating more than three

  • When compressing this type, zip uses two numbers, one is the distance from the repeat position to the current position, and the other is the length of the repeat

  • For example: abcddddd My current repeat position is the third (the first position index is 0), and the repeat length is 5, then I can d(3,5) to represent the repeated d.

Don’t think that one Bytes has 256 possibilities and three bytes means 256 ^ 3 possibilities. This compression method is simply a fantasy.

  • For example: the name of the protagonist in a novel, and the name of the heroine who went to the hotel, appeared many times to a great extent, which fits the method of repeated compression, but repeated compression is only suitable for one compression
  • It doesn't make much sense to compress the file a second time, because the first compression has greatly destroyed the repetitive tendency of the source sentence.
  1. Duplication of single Bytes
  • There are 236 possibilities in a byte. Wouldn't it be more likely to be repeated like this? Because it is a single byte, the range is reduced a lot.

  • For example, letters and numbers are commonly used in ASCII text files. It is said that E has the highest usage rate.

  • The picture is better understood. It’s affirming the use of dark and light tones.
    By the way, the png image format is a lossless compression. Its core algorithm is the zip algorithm. The main difference between it and the zip format file is: A picture format, which stores information such as the size of the picture and the number of colors used in the file header.

  • The result of the phrase compression mentioned above also has this tendency: the repetition tends to appear closer to the current compression position, and the repetition length tends to be relatively short (within 20 bytes).

Common compression formats:

  1. JAR -Java Archive File
    is a document format of Java. You can also understand it as a ZIP file, called a file package. The biggest difference between it and ZIP is that the content of the JAR file contains a META-INF/MANIFEST. MF file, this file is automatically created when the JAR file is generated

  2. ZIP -
    zip is a very common compression format. It does not require a separate compression or decompression software, because the Windows system has integrated support for the ZIP compression format.

  3. RAR -
    The compression position of RAR is second only to ZIP, because the compression rate of RAR is much higher than that of ZIP. There is a rising star called 7Z, which has a higher compression ratio than RAR, but there is no way that RAR has laid a certain foundation in the compression field and cannot be shaken.

  4. CAB
    CAB is a compressed file format introduced by Microsoft. It is mainly used for installation programs. Therefore, the files contained in the CAB file are processed. The price is that we may not be able to use it after decompression.

  5. ISO -
    ISO is a disc image format, right? The data is saved on the disc. As soon as you can understand this is file extraction.

  6. TAR
    TAR .tar is a file with a suffix, WinZIP, WinRAR, can be opened, because both of them are associated with TAR, the point is that TAR is a commonly used file format in Linux

  7. UUE -
    UUE This is more powerful, it uses the compressed format when it encounters garbled codes caused by mixed mail encodings, and can be opened with WinZIP and WinRAR.

The compression operation is relatively troublesome, so you need to study it slowly~~~

Guess you like

Origin blog.csdn.net/weixin_47587864/article/details/108490861