A student who refused to take the final exam created a miracle in the history of computing!

Seventy years ago, MIT professor Robert Fano gave students a multiple-choice question in his information theory class:

(1) Take the traditional final exam

(2) Solve a difficult problem

The puzzle is to find the most efficient way to encode letters, numbers and other symbols using binary digits.

This seems like a flexible puzzle, but in fact, this method can also be used to compress information for easy storage and transmission through computer networks.

Of course, Professor Fano hid the fact that he himself, and even Shannon, the famous founder of information theory, were struggling with this issue.

Hoffman, a 25-year-old graduate student who didn't like taking exams, decided to solve the problem.

The "cheated" Hoffman embarked on a road of no return.

1

Hofmann started working on this problem, considering a message consisting of letters, numbers and punctuation marks. The simplest way to encode it is to assign each character a unique binary number of the same length, for example:

A -> 01000001

B -> 01000010

C -> 01000010

This method is very easy to parse, but very inefficient because some characters are used more frequently and some less frequently.

A better approach is Morse code, in which the frequent letter E is represented by just a dot, while the less common J requires a longer and more laborious dot-dash-dash-dash to express.

24a6d95f8b225e432818ec540a3f867c.png

Although Morse coding has long and short lengths, it is still inefficient. What is even more annoying is that when sending information, additional pauses need to be added between each character, otherwise it will be impossible to distinguish two messages like this:

Dot-dash-dot-dot-dash ("trite") 

dot-dash-dot-dot-dash ("true")

In fact, Professor Fano has partially solved this problem: using prefix-free codes.

For example, if the letter S appears very frequently in a particular message, then it can be assigned the extremely short code 01.

The point is that no other letters in the message starting with 01 will be encoded, and 010, 011, and 0101 are all prohibited.

Since each letter's prefix is ​​completely different, the encoded message can be read from left to right without any ambiguity.

For example:

S -> 01

A -> 000

M -> 001

L ->  1

Then 0100000111 can be translated as SMALL without ambiguity.

So how to find an algorithm that can assign the shortest codes to the most commonly used characters, reserve the longest codes for rarely used characters, and ensure that each character has a different prefix?

Professor Fano proposed an approximate method to construct a binary tree from top to bottom according to the frequency of occurrence of the characters in the message, and then assign codes.

The specific algorithm will not be described here. Interested students can check the relevant information.

Since Professor Fano's method only yields approximations, there must be a better compression strategy, so he challenged the students.

2

Hoffman studied diligently for months and developed various methods, but none proved effective.

Hoffman was desperate: he was deceived by the professor, so he should prepare for the final exam!

Just as he threw his notes into the trash can, a flash of lightning flashed through his mind, and the solution appeared.

"This was definitely the most unique moment of my life!"

Huffman's idea is very simple and elegant. Specifically, he builds a binary tree from bottom to top according to the frequency of character occurrences.

An example will make it clear.

Assuming there is a message SCHOOLROOM, first calculate the frequency of each character.

O: 4 times

S/C/H/L/R/M: 1 time

Huffman first takes the two characters with the lowest frequency, such as R and M, to form a binary tree, in which the frequency of the parent node is the sum of the two leaf nodes, which is 2.

f1601f55e63e099ee8946bb885b8ecc1.png

Now, the most frequent character is O, which appears 4 times. The frequency of the parent node of R/M is 2, and the frequency of other characters is still 1.

Huffman continues to find the lowest frequency to form a binary tree, such as H, L

948b8fcdbc275dbdb8c418f8b34047a6.png

Continue to find the one with the lowest frequency to form a binary tree, such as S, C

9c43d3a1134be769b18c5283dd94fe0d.png

The current frequency table looks like this:

O: 4 times

R/M: 2 times

H/L: 2 times

S/C: 2 times

Huffman still takes the lowest frequency, such as H/L, and S/C, to form a binary tree

6780e33df1d19ec22bc61c3fba4edf3d.png

The frequency table becomes like this:

O: 4 times

R/M: 2 times

H/L/S/C: 4 times

Then take the two lowest frequencies to form a binary tree. Note that the one with low frequency becomes the left subtree and the one with high frequency becomes the right subtree.

9b8078a65e8a194562f097b31f66a136.png

Finally, the remaining O's are also formed into a binary tree.

f45d6cf2f8dc979d3615be9fe07d675c.png

Then, for the branch of the left subtree, mark 0, and for the branch of the right subtree, mark 1

4725308ccdd0ec115e54bfbb648b1f69.png

Finally, the encoding of each character is formed

3c335599c719f89c491e4efb72844ba8.png

For SCHOOLROOM, the encoding is: 11101111110000110110000101.

(Note: When selecting nodes with the same frequency, the order may be different and the shape of the binary tree may be different, so Huffman coding is not unique.)

Huffman's algorithm is called "optimal encoding" and achieves two goals:

(1) Any character encoding is not a prefix for other character encodings.

(2) The total length of information encoding is minimum.

Hofmann surpassed his teacher's algorithm and later said:

"Had I known that Professor Fano and Shannon, the founder of information theory, had struggled with this problem, I might never have attempted to solve it, let alone solve it at age 25."

e18a3dd1161017b0a199a93aff9d493e.jpeg

Even the great Shannon had not thought of an algorithm. It was a miracle that a student accidentally discovered it.

Huffman's algorithm is widely used in data compression, file compression, graphics encoding and other fields. It is a very basic algorithm in the IT industry.

3

Now that the story is over, let’s share some thoughts:

(1) This algorithm was invented 70 years ago. The United States is far ahead in the field of computers, both in theory and practice. When a field is first created, there is gold everywhere. When it matures, it can only deal with the corners.

(2) Huffman’s algorithm is simple and beautiful. Just now, my fifth-grade primary school student read this article and I gave her a question. She was able to quickly draw the binary tree and get the final code. .

However, just like Columbus's discovery of America, most people really can't think of it, let alone use mathematics to express and prove it.

(3) Huffman did not apply for a patent for his invention, and there is still a lot of debate about whether the algorithm can be patented. However, others have made millions using software developed by Huffman's algorithm. Hofmann's main reward was to be exempted from the final exam in information theory.

Knuth, author of "The Art of Computer Programming" said: In the fields of computer science and data communications, Huffman coding is one of the basic ideas that people have always used.

This may be the highest reward!

The full text is over. If you like it, please click like in the lower right corner and read it!

Guess you like

Origin blog.csdn.net/coderising/article/details/132419334