"The Beauty of Mathematics" Reading Notes & Reflections

I had the honor to take a photo with the author of this book when I participated in the shooting of a lecture at Beijing Post. At that time, I didn't know Dr. Wu Jun very well, I just thought he was a very powerful person. After reading the book "The Beauty of Mathematics", I couldn't help but be in awe of him. Although the title may seem like a math textbook, this book is mainly about the magical use of mathematics in fields such as machine learning. The most commendable thing is that the author explained many knowledge points in an easy-to-understand manner, and the style of writing is somewhat similar to that of Liu Weipeng. In the author's own words: a technology or problem is divided into "skill" and "dao"! This book is more about "Tao"! After reading this book, I suddenly realized that mathematics is not just a boring exam question, but the most important tool for human beings to understand the world. Mathematics is not only useful but also infinitely fascinating!

The big detour of natural language processing (NLP)

The development process of NLP is a very interesting and representative example. Its development process in the past 70 years can basically be divided into two stages.

The early 20-odd years, from the 1950s to the 1970s, were the stage when scientists took a detour. Scientists all over the world have limited their understanding of NLP to the way humans learn language, that is, using computers to simulate the human brain. At that time, the academic community generally believed that artificial intelligence and NLP: in order for machines to do things that only humans can do, such as translation or speech recognition, the computer must understand natural language, and to do this, the computer must have the ability of us humans. of intelligence. Why do you think so? Because that's what we humans do, it's that simple. For humans, a person who can translate English into Chinese must have a good understanding of both languages. That's what intuition does. So the main research direction at that time was analyzing sentences and acquiring semantics. In the field of artificial intelligence, such a methodology was later called the "bird-flying school", that is, looking at how birds fly, one can imitate birds to build airplanes without knowing aerodynamics. In fact, we know that the Wright brothers invented the airplane based on aerodynamics rather than biomimicry. Today, machine translation and speech recognition are done well and used by hundreds of millions of people, but most people outside the field still mistakenly believe that these two applications depend on computers understanding natural language. In fact, they all rely on mathematics, more precisely statistics.

From NLP to NMP

Since its birth, natural language has gradually evolved into a context-dependent information expression and transmission method. Therefore, for computers to process natural language, a basic problem is to establish a mathematical model for the context-dependent characteristics of natural language. This model is what we often call a statistical language model , which is the basis of all NLP today, and is widely used in machine translation, speech recognition, print or handwriting recognition, pinyin error correction, Chinese character input, and literature search. Generally, a binary model (Markov chain) is used, assuming that the probability of any word appearing is related to the previous word. Since most of the conditional probabilities are zero, they are smoothed using the Good-Turing estimation formula, the Katz backoff method. Google's Rosetta translation system and voice search system use a quaternion model, which is stored on more than 500 Google servers. In 2005, Google used thousands or even tens of thousands of times more data than other research institutions to train a six-element model and developed the world's best machine translation system at that time.

Some applications of NLP

Because I built an open-source score library, I wondered if I could build a statistical music model based on the score library for research on Natural Music Processing. The following figure is the relevant prospect that I got through sorting and analogy of NMP.

Some applications of NMP

Personally, I feel that the primary task of NMP is to build a complete and huge music score library to provide sufficient training samples for the training of statistical music models and avoid model overfitting. The specific realization idea of ​​the binary model is as follows: the music can be divided into N bars according to the rhythm, and then the binary model can be trained in units of bars. The granularity of segmentation may be different for different applications.

NLP and Communication

Hidden Markov Model was originally used in the field of communication, so how does it relate to NLP?

Let us consider this issue from a different angle. The so-called speech recognition is that the listener guesses what the speaker wants to express. This is actually like in communication, the receiver analyzes, understands, and restores the information sent by the sender according to the received signal. When we talk, our brain is a source of information. Our throats (vocal cords), air, are channels like wires and optical cables. The listener's ear is the receiver, and the sound heard is the transmitted signal. Inferring the speaker's meaning based on acoustic signals is speech recognition.

Similarly, many applications of NLP can be understood in this way. In the translation from Chinese to English, the speaker speaks Chinese, but the channel coding method is English. If a computer is used to infer the Chinese meaning of the speaker according to the received English information, it is machine translation. Similarly, if you want to infer the Chinese meaning that the speaker wants to express based on a sentence with a spelling error, it is automatic error correction. In this way, almost all NLP problems can be equivalent to communication decoding problems.

The earliest successful application of the Hidden Markov Model was speech recognition, the most famous being the IBM Watson Lab led by Jarinick and the world's first large-vocabulary continuous speech recognition system Sphinx developed by Kai-Fu Lee. Before Jarinick, scientists treated the speech recognition problem as an artificial intelligence and pattern matching problem, while Jarinick treated it as a communication problem, and used 2 hidden Markov models (acoustic model and language model) to identify language Summarized clearly. At the same time, Hidden Markov Model is also the main tool of machine learning. Like almost all major tools in machine learning, it requires a training algorithm (Baum-Welch algorithm) and a decoding algorithm (Viterbi algorithm) when used. Today, Hidden Markov Models are widely used in machine translation, pinyin error correction, handwriting recognition, image processing, gene sequence analysis, and stock forecasting and investing.

[Thinking]: Can music score recognition also be regarded as a communication problem, and can it be solved by using acoustic models and music models?

Here is another small episode: I think since the recognition of musical scores can be regarded as communication, can different notes of music be used to encrypt the transmission of information? Just when I thought I'd made a major discovery, I saw Heidi Rama, the mother of CDMA, and it dawned on me.

Nut mobile advertising
One day in the early 1940s, Heidi Lamar met Ansel, who was good at piano, and the two had a speculative chat. When Ansel played the piano for her, Heidi had a whimsy: the beating keys make sounds of different frequencies, so is it possible to achieve confidential communication with such beating frequencies?

This is frequency hopping, the core concept of CDMA technology. Frequency hopping is like dividing the content of a long story into many subsections, and after each section is sent, the frequency is changed and the next section is sent. The sender and receiver can communicate as long as they coordinate the rules for changing frequencies. As long as each bar is short enough, it will be submerged in a cluttered background, and a third party won't even be able to detect the existence of this communication, let alone intercept it. In the war years, she was immersed in the political and military dignitaries, which made her aware of confidential communication. In addition, Heidi studied communication in her early years, but she dropped out of school in order to act in movies.

A few days later, this patent was born:

search engine shuttle

The principle of a search engine is very simple. To build a search engine, you need to do the following things: automatically download as many web pages as possible; establish a fast and efficient index; and sort web pages fairly and accurately according to their relevance.

download

  • Traversal Algorithms in Graph Theory
  • Breadth-First Algorithm (BFS)
  • Depth-First Algorithm (DFS)
  • Hash table: record whether the webpage has been downloaded

index

  • Boolean operations
  • Build common and uncommon levels of indexing according to the importance, quality, and frequency of visits to web pages.

sort

The ranking of search results depends on two sets of information: information about the quality of the page, and information about the relevance of this query to each page

web quality

The core of the PageRank algorithm is to iteratively calculate the weight of each web page, and then rank the web pages according to the size of the weight.

The weight of each web page is the same at the beginning of the iteration, and then the weight of each web page is updated by calculation. The rules are as follows:

1. When a web page is referenced by more web pages, its weight is greater

2. When the weight of a webpage increases, the weight of the webpage it refers to also increases accordingly

3. When a webpage refers to more webpages, the weight of the webpages referred to by it is less.

After repeated iterations like this, the algorithm will eventually converge to a fixed ranking.

web page relevance

TF-IDF方法(Term Frequency-Inverse Document Frequency)

TF: Single-text word frequency, which is the frequency of a given word in the document. This number is normalized to the term count to prevent it from being biased towards long files. (The same word may have a higher word count in a long file than a short file.)

IDF: Inverse Text Frequency Index, a measure of the general importance of a word. The idf of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the base 10 logarithm of the quotient obtained.

Similarity between two websites = ∑ (keyword * word frequency * weight)

Google AK-47 Design Philosophy

In the computer field, a good algorithm should be like AK-47: simple, effective, reliable and easy to read (or operate), and should not be tricky. Dr. Amit Singh, academician of the American Academy of Engineering, is the designer of Google AK-47. The A in Google's internal sorting algorithm Ascorer is his initials. Singer's philosophy of doing things, which is to help users solve 80% of the problems first, and then slowly solve the remaining 20% ​​of the problems, is one of the secrets of success in the industry. Many failures are not because people are not good, but because the way of doing things is wrong. At the beginning, they pursued a large and comprehensive solution, and then they could not complete it for a long time, and finally they were gone.

I totally agree with Dr. Singer on this point. Recalling that when I built the music library alone, more and more excellent friends joined our core group, and then Bilibili gained 500 fans, I feel a lot. In the beginning, there was actually a website called Liberty Shrine that was also doing JE Bar's harmonica music library. However, for the convenience of the diagram, WordPress was used at the beginning, which directly led to the cumbersome upload mechanism and the inability to expand many functions later. Later, almost at the same time as I built the github song library, the Liberty Shrine began to develop version 2.0. But it's easier said than done to achieve the functions that github's issue can provide directly. First of all, the github song library supports drag-and-drop uploading of pictures, with its own perfect search function, support for fuzzy search, and its own sorting. After half a year of practice, Quku has accumulated some experience and optimized the upload template. There are several problems faced by Liberty Shrine to build a score database with the same function: 1. The students in charge of the database are too busy and have no time; 2. It takes a certain amount of time to realize the music library and score search functions; 3. The score needs to be repeatedly transported. At the beginning, many people thought that they must use their own database to store the scores and do the search function by themselves, so that they would feel more at ease, but in the end, it was all over. In the final analysis, it is because for an interest-driven open source project, the development cost of reinventing the wheel is too high, whether it is time or energy. In fact, github is an open source repository for storing code, and it is understandable for storing open source musical scores. In this way, we used 80% of our energy to build Music Library 2.0, providing you with all the necessary functions. The development of Liberty Shrine 2.0 eventually stagnated because of bottlenecks.

The importance of mathematical models

  • A correct mathematical model is simple in form.
  • A correct model may not be as accurate to begin with as a well-crafted wrong model, but if we decide that the general direction is right, we should stick to it.
  • Large amounts of accurate data are important for R&D
  • Correct models can also be disturbed by noise and appear inaccurate; instead of making up for it with a makeshift correction, finding the source of the noise may lead to major discoveries.

Bayesian network

Mathematically speaking, a Bayesian network is a weighted directed graph, an extension of a Markov chain. From an epistemological perspective, a Bayesian network overcomes the mechanical linear constraints of a Markov chain, and it can unify any related events under its framework.

Bayesian networks are easier to account for (context) contextual correlations than neural networks, so they can decode an input sequence, such as speaking a piece of speech and recognizing it as text, or translating an English sentence into Chinese. The output of artificial neural network is relatively isolated. It can recognize words one by one, but it is difficult to process a sequence. Therefore, its main application is often to estimate the parameters of a probability model, such as the training of acoustic model parameters in speech recognition, machine translation training of language model parameters, etc., rather than as a decoder.

[Thinking]: Bayesian network is more suitable for music score recognition

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325340362&siteId=291194637