Read "mathematical beauty" felt

  At first the teacher let us look at "mathematical beauty" This is a book I did not understand, because I want to and not a language lesson, why write a book review, but not the math to see why the mathematical beauty, but after reading, I only to find this book really useful.

  In fact, I have not read a few chapters, but speaking in front of a statistical language model not only caused me a lot of interest, but also gave me great inspiration. Mentioned in the book, if you want to know the probability of a sequence of text appears in the S, is the probability of occurrence of each word in the sequence is multiplied as P (S) = P (w1) P (w2 | w1) P (w3 | w1 w2) ...... P (wn | w1 w2 ..... wn-1), where P (w2 | w1) are known in the case where a word appears first, second word probability of occurrence. But if you want to calculate the probability of occurrence of a word to the previous n-1 words are related, calculate the amount is too large and therefore difficult to calculate high, with Markov assumption that the probability of any word that appears only in the same wi it in front of the word wi-1 related. Formula can be simplified as follows: P (S) = P (w1) P (w2 | w1) P (w3 | w2) ..... P (wi | wi-1).

  This reminds me of my freshman statistical machine translation system, after each machine translation system Bahrain, we all need to be tested in a laboratory filled with Moses BLEU value. BLEU value is used to determine the degree of similarity of the two sentences, give a simple chestnut: two sentences S1 = I learn C ++; S2: I learn Java; similarity of these two sentences is 2/3 molecule is a translation candidate word number appears in the reference translation (whether or not in the same sentence reference translation), the denominator is the number of words the candidate translations. Why not say whether the reference in the same sentence translation, because BLEU is a machine translation of words corresponding to several reference translations are compared to calculate a composite score and therefore are not compared with the sentence, but with more than translation compared to the reference period. In order to avoid interference common words, we also used a number of times to improve the accuracy of the comparison of multi-word appears in a sentence each reference translation, the results will be compared to one another to the maximum to obtain the final BLEU.

  In addition to also use a statistical model to solve the problem of ambiguous word of Chinese, using statistical language model probability of each sentence after the word appeared calculated to find out where the greatest probability is the best word method. This let me think, Moses in the process of installation in Corpus Preparation.

Tokenize: insert a space between words and punctuation.
Truecasing: each of the words in a sentence are the most likely to be converted to a prototype, which helps reduce the sparsity of the data.
 cleaning: long sentences and empty statement can cause problems during training, so remove, delete significantly misaligned sentence deleted.

In the pretreatment corpus, the need for first Chinese word corpus, parallel aligned corpus after facilitate the use GIZA ++.

 I believe that helping statistical language model "mathematical beauty" in reference to statistical machine translation is great.

   Slowly read on "mathematical beauty" I have found to be able to learn a lot, mathematics and computer still inseparable, many algorithms, training models related to mathematics, I will continue to look after the "mathematical beauty," I believe there will be a deeper understanding. 

Guess you like

Origin www.cnblogs.com/snowlxy/p/11442944.html
Recommended