NLP series (1)_On the basis of natural language processing from deciphering alien characters

Author: Dragon Heart dust && cold little sun
time: January 2016.
Source:
http://blog.csdn.net/longxinchen_ml/article/details/50543337
http://blog.csdn.net/han_xiaoyang/article/details/50545650
Disclaimer: All rights reserved, please contact the author and indicate the source for reprinting

1. What would you do if you were asked to decipher the "three-body" characters?

Let’s try to open up our brains: If you have a USB flash drive with a lot of Internet text information about "three-body" people (the intelligent aliens in Liu Cixin’s novels), how would you use this information to understand alien civilization? And get valuable technical intelligence from it? Of course, the characters of the "three-body" people all look like this:

Insert picture description here

"It's all garbled, I can't figure it out!"

Well, it is true. In fact, in the eyes of computers, human languages ​​are no different from alien languages.

Let computers "understand" all kinds of information in human language, and even react like humans, these are the main content of natural language processing.

How do we analyze it? First, we tried to find the smallest observation object, and found that the alien characters seem to be blocks of blocks, and each block can be used as a basic language unit for our analysis. By doing some basic statistics on these square characters, we can roughly know the basic vocabulary of the "three-body" language, common words, rare words, common fixed collocations and so on. It can be seen that statistical methods are a more useful weapon.

Moreover, we found that some squares are directly separated by a space. Therefore, the square characters are divided into different areas. Can each area be understood as a sentence? This job is **"sentence break"**, which is also a typical problem in natural language processing.

Then the blank line can be used as a segment . According to human language experience, the first sentence of a paragraph may contain more information.

Can we continue the analysis? It seems more difficult. But later you discovered that some of the alien corpus in this USB flash drive was "labeled".
For example, if you find a **"Trisolaran Chinese Bilingual Dictionary"**, it turns out that Trisolaris people have long been stealing Chinese information, and they also made a dictionary. Then this dictionary can just be used by you, and the correspondence between words in it is actually a kind of labeling information . And there are some bilingual example sentences, which are also "labeled information". In NLP terms, these are "annotated corpora". With them, machine translation models can be trained through neural networks.
For example, you find that some information is organized like in Douban. Each paragraph in it has some marks like "like" and "dislike". Based on these tags, you can count that certain words appear more likely to appear in good reviews than bad reviews, and these words may be "good words." Similarly, you can also count some "derogatory words". Based on these praise and derogatory words, we can judge the praise and derogation of other texts. This is the process of " comment and derogation analysis " in natural language processing .

……

It can be seen from this that when faced with an unknown language, it seems that the most direct way is to master a large amount of corpus, and these corpus are best marked in various ways. Then perform various statistics on it and discover some valuable information. Among them, the means of learning knowledge through statistical methods is the empirical perspective of natural language processing in the legend , and deep learning also requires a large amount of statistical corpus, which is also an empirical in nature. On the other hand, due to its interconnected network structure, deep neural networks are also a "connectionist" perspective. We can discuss these issues more fully in future articles.

2. Problems to be solved by natural language processing:

In fact, natural language processing is widely used, such as:

  • Spam recognition

    By automatically analyzing the text content in the email, determine whether the email is spam.

  • Chinese input method

    By recognizing the input pinyin character string, the Chinese character that the user wants to input is recognized.

  • machine translation

    Translate text from one language to another, such as Chinese-English machine translation.

  • Automatic question and answer, customer service robot

    Enter a question in text and return a paragraph of text as the answer to the question.

    ……

Here is a brief list of some common areas of NLP : word segmentation, part-of-speech tagging, named entity recognition, syntax analysis, semantic recognition, spam recognition, spelling error correction, word sense disambiguation, speech recognition, phonetics conversion, machine translation, automatic question and answer...

If you are not familiar with the application scenarios of natural language processing, you can go to Baidu's NLPC natural language processing platform and simply play a few examples to get familiar. The APIs inside are very rich, and they are all based on Baidu's more than ten years of NLP technology accumulation and the country's largest tagging corpus, and the effect is leveraged.

3. The development status of natural language processing

According to the introduction of stafford professor Dan Jurafsky:

  • Some problems have been basically solved , such as: part-of-speech tagging, named entity recognition, and spam recognition.

  • Some issues have made great progress , such as sentiment analysis, co-reference resolution, word sense disambiguation, syntactic analysis, machine translation, and information extraction.

  • Some questions are still full of challenges , such as: automatic question and answer, retelling, abstract extraction, conversational robots, etc.

4. Simple classification of natural language processing problems

You may have felt that the problems of natural language processing are very complex, and it is really not easy to sort out systematically for a while. Here we briefly list the mainstream categories:

  • Text classification problem: spam recognition, yellow anti identify, appraise analysis . This type of problem is to directly determine the probability of the input text belonging to a certain category/category.

  • Sequence labeling problem: word, speech tagging, named entity recognition . This type of question is to directly determine whether each word in the input sentence belongs to a word, whether it is the beginning, middle, end, or independent word, what is the part of speech of the word, and whether the word of the noun part of speech is a person's name, a place name, or other exclusive words noun. This type of problem is a bit like a classification problem. However, the judgment of each word needs to consider the information of its context, and even the information of the entire sentence, which affect each other, so it is different from text classification.

  • Sequence generation problem: machine translation, question answering, reading comprehension, automatic summaries . The length of the input sequence and the output sequence of these problems are not equal, and it is difficult to correspond one-to-one, so it is a more complicated text generation problem.

  • Unsupervised learning problems: text clustering, topic model and other issues . The first three questions are all based on the NLP problem with annotated corpus, called the supervised learning problem. For the case where there is only a large amount of corpus without annotation, it is called unsupervised learning problem.

  • Other important issues, such as text representation issues (word bag model, word vectors, sentence vectors, etc.), semantic similarity calculation issues, multi-task learning issues, etc.

5. Basics of text processing

Text processing is often the basis for solving NLP problems. It is also very important, here is a brief introduction.

5.1 Regular expression

For natural languages ​​such as English and other string types, regular expressions can do some simple processing well. Such as stemming, case conversion, etc.

Now mainstream programming languages ​​have good support for regular expressions, such as Grep, Awk, Sed, Python, Perl, Java, C/C++. Some basic tasks can be accomplished through simple programming.

5.2 Participle

For English, word segmentation is more intuitive. Different words are generally distinguished by spaces. But some different vocabulary expressions require careful judgment :

Participle

This requires us to make some simple judgment rules based on different conditions .

This method works well for languages ​​like English that contain fixed separators. But **for Chinese, Japanese, German, and our above "three-body text" and other texts are no longer applicable, and special word segmentation technology is required. **We will discuss it in a later article.

  • Sharapova now lives in Florida in the southeastern United States.
  • Sharapova now lives in Florida in the southeastern United States

5.3 Edit distance

Edit distance (Minimum Edit Distance, MED), also known as Levenshtein distance, refers to the minimum number of edit operations required to convert two strings from one to the other.

Allowed editing operations include:

  • Replace one character with another character (substitution, s)

  • Insert a character (insert, i)

  • Delete a character (delete, d)

A simple schematic diagram is as follows:

Insert picture description here

We can use the dynamic programming algorithm to solve the minimum edit distance, which is formally defined as follows:

Insert picture description here

In this way, a quantitative "distance" concept is defined between strings, and it is very explanatory.

In machine learning, many things can be done with "distance". For example, to judge the similarity of two strings, do some classification and clustering work.

In engineering, **edit distance can be used to provide candidate words for spelling error correction. **For example, I use the English input method to input a word "girlfriand". But the word "girlfriand" is not in the thesaurus. You can find other strings with an edit distance of 1 or 2 from "girlfriand", such as "girlfriend" and "girlfriends", as candidates for correcting spelling errors. The remaining problem is to determine which candidate word has a higher probability of being a corrective word.

6. Basics of Classification

Because a considerable part of natural language processing can be abstracted into classification problems to deal with. We are here to add some basic knowledge of classification issues to facilitate future discussion.

6.1 Multiple types of classification problems

  1. Two categories: True or False

1.1 Analysis of praise and criticism: Determine whether a text is "praise" or "depreciation".

1.2 Spam recognition: Determine whether an email is "normal mail" or "spam".

  1. Multiple categories: multiple choice questions

  2. Multi-label classification: multiple choice questions

Sometimes multiple-choice questions are called soft classification, and single-choice questions are called hard classification.

6.2 Multi-class evaluation indicators

For general two-class classification, the indicators we evaluate are recall rate, accuracy and F value . We also have similar evaluation criteria for multiple categories. If cij is how many ci documents are automatically classified under the cj category , there are:

Insert picture description here

7. Summary

This article mainly talks about some shallow content of natural language processing. We derived the empirical perspective of natural language processing from deciphering alien characters. Because the business scenarios are very complicated, we simply categorized the problems. Some basic content of text processing, such as regular expressions, word segmentation and sentence segmentation, are commonly used methods in the process of natural language preprocessing. Edit distance is traditionally a measure of the similarity of two strings. After understanding these basics, you can perform some typical natural language processing problems, such as text classification. We will introduce them one by one in the next articles.

Guess you like

Origin blog.csdn.net/longxinchen_ml/article/details/50543337