The beauty of numbers



The Beauty of Numbers
After reading the book "The Beauty of Numbers", I couldn't help sighing at the wonderfulness of mathematical models. I think I am still a mathematics major. I have gone through course study, mathematical modeling competition, postgraduate entrance examination, and the written test for job hunting. For example, I have searched a lot of discrete mathematics, but I have never thought about the application scenarios of these mathematical knowledge in the Internet. Reading this book has a hearty feeling.
An excerpt from the notes from this book is as follows:
Statistical Language Models Statistical language
models have proven to be more effective than any known solution by some rule. After using the Bayesian formula (used in Google's Chinese-English automatic translation), Li Kai reused the statistical language model to simplify the 997-word speech recognition problem into a 20-word recognition problem, achieving the first large vocabulary in history. Recognition of unspecific continuous speech.
An application of statistical language models in Chinese processing.
The Hidden Markov Model is a mathematical model that, to date, has been considered the most successful method for implementing fast and accurate speech recognition systems.
 
Measuring information
In 1948, Shannon proposed the concept of "information entropy" (shāng), which solved the problem of quantitative measurement of information. , the measure of the amount of information is equal to the amount of uncertainty.
In the Internet or computer, the concept of "bit" is used to measure the amount of information.
The greater the uncertainty of the variable, the greater the entropy, and the greater the amount of information required to figure it out.
 
Boolean Algebra and Indexing for Search
Engines There can't be a simpler way of counting than binary, and there can't be a simpler operation than Boolean operations. Although every search engine today claims to be smart and intelligent, it is fundamentally not outside the box of Boolean operations.
    Boolean algebra couldn't be simpler. The elements of the operation are only two 1s (TRUE, TRUE) and 0s
(FALSE, false). The basic operations are "and" (AND), "or" (OR) and "not" (NOT)
. Today's search engines are much smarter by comparison. It automatically converts the user's query statement into a Boolean operation. The simplest index structure is to use a long binary number to indicate whether a keyword appears in each document. . There are as many digits as there are documents. Each digit corresponds to a document. 1 means that the corresponding document has this keyword, and 0 means that it does not. The entire index becomes so large that it is impossible to store it on a single computer. The common practice is to divide the index into many shards according to the serial number of the webpage, and store them in different servers. Whenever a query is accepted, the query is distributed to many servers, which simultaneously process user requests in parallel, send the results to the main server for merging, and finally return the results to the user.
 
Graph Theory and Web Crawler
Discrete mathematics is an important branch of contemporary mathematics and the mathematical foundation of computer science. It includes four branches of mathematical logic, set theory, graph theory and modern algebra. Mathematical logic is based on Boolean operations, so how to automatically download all the web pages on the Internet, it uses the Traverse algorithm in graph theory. The origin of graph theory comes from Euler.
The Internet is actually a big picture. We can treat each web page as a node, and take those hyperlinks (Hyperlinks) as arcs connecting web pages. When you click, the browser is redirected to the corresponding web page through these implied URLs. These hidden URLs behind the text are called "hyperlinks". With hyperlinks, we can start from any web page and use the graph traversal algorithm to automatically visit each web page and store them. Programs that perform this function are called web crawlers.
In a web crawler, we use a list called a "Hash Table" instead of a notepad to keep track of whether or not a web page has been downloaded.
 
The Application of Information Theory in Information Processing
Two other important concepts after entropy in information theory are "Mutual Information" (Mutual Information) and "Relative Entropy" (Kullback-Leibler Divergence).
"Mutual information" is an extended concept of information entropy, which is a measure of the correlation between two random events.
In natural language processing, it is often necessary to measure the relevance of some linguistic phenomena. For example, in machine translation, the most difficult problem is the ambiguity (ambiguity) problem of word meaning. For example, the word Bush can be the name of the President of the United States, or it can be a bush. The specific solution is roughly as follows: first, find out some words with the largest mutual information with President Bush, such as President, United States, Congress, Washington, etc., from a large number of texts. Words with the greatest mutual information that appear together, such as soil, plant, wild, etc. With these two sets of words, when translating Bush, it is enough to see which types of related words are in the context.
Another important concept in information theory is "relative entropy", which is called "cross entropy" in some literature. In English it is Kullback-Leibler Divergence, named after its two proponents, Kullback and Leibler. Relative entropy is used to measure whether two positive functions are similar. For two identical functions, their relative entropy is equal to zero. In natural language processing, relative entropy can be used to measure whether two common words (grammatically and semantically) are synonymous, or whether the content of two articles is similar, and so on. Using relative entropy, we can address one of the most important concepts in information retrieval: term frequency - inverse document frequency (TF/IDF).
 
How to determine the relevance of web pages and queries
The phrase "application of atomic energy" can be divided into three key words: atomic energy, of, application. From our intuition, we know that pages that contain more of these three words should be more relevant than pages that contain less of them. Of course, this method has an obvious loophole, that is, long web pages are cheaper than short web pages, because long web pages generally contain more keywords. Therefore, we need to normalize the number of keywords according to the length of the web page, that is, divide the number of keywords by the total number of words on the web page. We call this quotient "keyword frequency", or "single text word frequency" (Term Frequency). appears 2 times, 35 times and 5 times, then their word frequencies are 0.002, 0.035 and 0.005 respectively. We add these three numbers, and the sum of 0.042 is the corresponding web page and the query "applications of atomic energy".
A simple measure of correlation. In general, if a query contains keywords w1,w2,...,wN, their word frequencies in a particular web page are: TF1, TF2, ..., TFN. (TF: term frequency). Then, the correlation between this query and the page is:
TF1 + TF2 + ... + TFN.
It is of little use for determining the theme of a web page. We call such words "Stopwords", which means that their frequency should not be considered when measuring relevance. In Chinese, "application" is a very general word, while "atomic energy" is a very specialized word, and the latter is more important than the former in the relevance ranking. Therefore, we need to give a weight to each word in Chinese. The setting of this weight must meet the following two conditions:
1. The stronger the ability of a word to predict the topic, the greater the weight, and vice versa, the smaller the weight.
2. The weight of words that should be deleted should be zero.
In information retrieval, the most used weight is "Inverse document frequency index" (Inverse document frequency abbreviated as IDF), and its formula is log(D/Dw) where D is the number of all web pages.
 
Finite State Machines and Address Recognition
Address recognition and analysis are essential techniques for local search, and although there are many ways to identify and analyze addresses, the most effective is the finite state machine.
A finite state machine is a special kind of directed graph. Every finite state machine has a start state, a stop state and several intermediate states. Each arc carries a condition for moving from one state to the next.
To use a finite state machine to identify addresses, the key is to solve two problems, that is, establishing a state machine through some valid addresses, and a matching algorithm for address strings after a finite state machine is given. Fortunately. With the finite state machine on the address, we can use it to analyze the web page, find the address part of the web page, and build a local search database. Similarly, we can also analyze the query entered by the user and pick out the part that describes the address. Of course, the remaining keywords are what the user is looking for: when the address entered by the user is not standard or has typos, the finite state Chance is helpless because it can only do a strict match.
To solve this problem, we want a possibility to do a fuzzy match and give a string as the correct address. To achieve this, scientists have proposed probability-based finite state machines.
 
The Cosine Theorem and News Classification (Junior High School Knowledge)
Google's news is automatically classified and organized. The so-called classification of news is nothing more than to put similar news into one category. A computer can't actually read news, it can only calculate quickly. This requires us to design an algorithm to calculate the similarity of any two news articles. In order to do this, we need to figure out a way to describe a piece of news with a set of numbers.
For all the real words in a piece of news, we can calculate their single-text lexical frequency/inverse text frequency value (TF/IDF). It is not difficult to imagine that those real words related to news topics have a high frequency and a large TF/IDF value. We rank the TF/IDF values ​​of these real words according to their position in the vocabulary. If a certain time in the word list does not appear in the news, the corresponding value is zero. We use this vector to represent the news and become the feature vector of the news. If the feature vectors of two news articles are similar, the corresponding news contents are similar, and they should be grouped into one category, and vice versa.
We use this vector to represent the news and become the feature vector of the news. If the feature vectors of two news articles are similar, the corresponding news contents are similar, and they should be grouped into one category, and vice versa.
Anyone who has studied vector algebra knows that a vector is actually a directional line segment in a multidimensional space. If the directions of the two vectors are the same, that is, the included angle is close to zero, then the two vectors are close. To determine whether the two vectors are in the same direction, the law of cosines is used to calculate the angle between the vectors.
When the cosine of the angle between the two news vectors is equal to one, the two news are completely repeated (this method can be used to delete duplicate web pages); when the cosine of the angle is close to one, the two news are similar and can be classified into one category; The smaller the cosine of the angle, the less correlated the two news.
 
Information fingerprint and its application
Any piece of information text can correspond to a random number that is not too long as a fingerprint to distinguish it from other information.
Information fingerprints have a wide range of applications in encryption, information compression and processing.
Record the web addresses (URLs) that have been visited in the hash table. However, directly storing URLs in the form of strings in the hash table consumes both memory space and search time. Today's URLs are generally longer. A function can be found that only takes 16 bytes per URL instead of the original hundred. This reduces the memory requirements for storing URLs to 1/6. This 16-byte random number is called the information fingerprint of the website.
The key algorithm for generating information fingerprints is the pseudo-random number generator algorithm.
The use of information fingerprints is far more than the deduplication of websites. The twin brother of information fingerprints is passwords. One of the characteristics of information fingerprint is its irreversibility, that is to say, the original information cannot be deduced from the information fingerprint. This property is exactly what is needed for network encrypted transmission. For example, a website can identify different users based on the user's cookie, which is an information fingerprint.
 
Don't put all your eggs in one basket - talk about the maximum entropy model
When investing, it is often said not to put all your eggs in one basket, which can reduce risk. In information processing, the same principle applies. Mathematically, this principle is called the principle of maximum entropy. To put it bluntly, it is to retain all the uncertainty and minimize the risk.
The principle of maximum entropy states that when we need to predict the probability distribution of a random event, our prediction should satisfy all known conditions without making any subjective assumptions about the unknown. (It is important not to make subjective assumptions.) In this case, the probability distribution is the most uniform, and the risk of the prediction is the least. Because the information entropy of the probability distribution is the largest at this time, people call this model the "maximum entropy model".
How to construct a maximum entropy model. All of our maximum entropy models are in the form of exponential functions. Now we only need to determine the parameters of the exponential function. This process is called model training.
What shines is not necessarily gold -- talk about search engine problems
Since there have been search engines, there has been cheating on search engine page rankings (SPAM). As a result, users find that the top-ranked web pages in search engines are not necessarily high-quality. As the saying goes, what shines is not necessarily gold.
Although there are many methods of cheating in search engines, there is only one purpose, which is to use improper means to improve the ranking of one's own web pages. The most common method early on is to repeat keywords.
 
Classification Problems in Matrix Operations and Text Processing The two most common classification problems
in natural language processing are categorizing text by topic (such as categorizing all news about the Asian Games into sports) and categorizing vocabulary The words in it are grouped by meaning (such as grouping the names of various sports into one category). Both classification problems can be solved satisfactorily and simultaneously with the methods introduced in the Cosine Theorem and News Classification by means of matrix operations. In theory, this algorithm is very good. But the computation time is particularly long in text classification. Another way is to use singular value decomposition in matrix operations.
We can use a large matrix A to describe the relevance of this million articles and half a million words. In this matrix, each row corresponds to an article, and each column corresponds to a word. The element in row i, column j, is the weighted word frequency (eg, TF/IDF) of the occurrence of the jth word in the dictionary in the ith article.
Singular value decomposition is to decompose the above large matrix into three small matrices and multiply them. For example, decompose the matrix in the above example into a matrix X of one million by one hundred, a matrix B of one hundred by one hundred, and a matrix Y of one hundred by five hundred thousand. The corresponding storage and The amount of calculation will be more than three orders of magnitude smaller.
The three matrices have very clear physical meanings. Each row in the first matrix X represents a class of words that are related in meaning, and each non-zero element in it represents the importance (or relevance) of each word in this class of words, and the larger the value, the more relevant. Each column in the last matrix Y represents a class of articles on the same topic, where each element represents the relevance of each article in this class of articles. The matrix in the middle represents the correlation between the class word and the article mine. Therefore, we only need to perform singular value decomposition on the correlation matrix A once, and we can complete the classification of synonyms and articles at the same time. (At the same time get the relevance of each category of articles and each category of words).
 
Extended Bayesian Network
of directed graph is regarded as a network, which is the calculation of the probability of each node of the Bayesian network in the network, which can be performed by the Bayesian formula, and the Bayesian network is thus obtained name. Since each arc of the network has a credibility, Bayesian networks are also called belief networks.
To use a Bayesian network it is necessary to know the probability of correlation between states. The process of getting these parameters is called training.
 
Bloom Filter Collections in a
computer are stored in a hash table. The advantage is that it is fast and accurate, and the disadvantage is that it costs storage space. When the collection is relatively small, this problem is not significant, but when the collection is huge, the problem of low storage efficiency of the hash table becomes apparent.
The Bloom filter was proposed by Barton Bloom in 1970. It's actually a long binary vector and a series of random mapping functions, a multi-hash map Bloom hash.
 
The reason why the simple input method based on pinyin is dominant: In fact, the average length of the whole Chinese pinyin is 2.98. As long as the input method based on pinyin can use the context to completely solve the problem of one sound and multiple characters, the average number of keystrokes per Chinese character input The number of times should be about three times, and it is entirely possible to enter 100 words per minute.
 
Moore's Theorem: Every 18 months, the performance of IT products such as computers will double; in other words, the prices of IT products such as computers with the same performance will be halved every 18 months.
An IT company that sells the same amount of the same product today as it did eighteen months ago would see its turnover drop in half. The IT community calls it the inverse Moore's theorem.



Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327012883&siteId=291194637