Overview of Natural Language Processing

Natural language processing studies several levels of content such as words, sentences, and documents.

The concept of hierarchy

theoretical school

1. Formal grammar (complex feature set)

2. Lexical methods (WordNet, ConceptNet, FrameNet), manual summarization and organization of concepts, levels, structures, etc. 

3. Statistical language model (language has statistical regularity, let the machine learn the rules by itself)

 

Refinement of statistical language models (how to describe the structural composition of language, such as how words form phrases, sentences, articles)

1. Combinations of words form phrases (without order and context information), using bag of words to describe phrases (one-hot representation)

2. Combinations + sequences form phrases. Distribution representation: contains order and context information.

 

The form of distribution representation

1. Matrix description, such as a word and a matrix of all n word contexts (dimension is too large)

2. The neural network represents n-gram, and the network structure is used to describe the context of each word

3. CBOW (Continuous Bag-of-Words) and Skip-gram, extract features by deep learning, simplifying the network structure

 

CBOW and skip-gram training methods

word2vec

 

Simply put: statistical model->word bag->n-gram->CBOW->word2vec

Sentence description as a vector of sequences of word vectors

 

2. Preprocessing

1. Remove html tags 

2. Coding 

3. doc -- "sentence --" word (part of speech tag, etc.)

4. Remove punctuation and words that are too short

5. Remove stop words 

6. Extract the stemming participles, prototypes, past tense, and synonyms into one

 

2. Analysis (split, then summarize and understand):

1. Word segmentation, tagging, word frequency statistics, etc.

2. Information extraction (recognizing phrases + identifying entities + extracting relationships), unstructured -> structured (knowledge expression)

3. Automatic extraction of keywords and abstracts; similarity comparison (document level)

4. Topic extraction (single document) Gensim LDA: topic extraction based on the bag-of-words model, so the effect is not as good as CBOW-based

5. Classification and clustering (multiple documents)

6. Sentiment Analysis

7. Disambiguation

8. Syntactic analysis (predicate logic (sql) -> Q&A and translation)

9. Abstract meaning? The implication? Inference rules?

 

Summary: abstract, topic, sentiment, classification (all at the semantic level)

 

3. Application

Recommended system

Question answering system

dialogue system

machine translation

 

4. Basic Concepts

1. TF (term frequency): how often a word appears in the entire document

2. IDF (inverse file frequency): the total number of files divided by the number of files containing the word, and then the logarithm, (describes whether it exists in fewer files)

TF-IDF (term frequency–inverse document frequency): The product of TF and IDF, which can be used as a feature of the document. The main idea of ​​TFIDF is: if a word or phrase appears frequently TF in one article and rarely appears in other articles, it is considered that the word or phrase has good ability to distinguish between categories and is suitable for classification .

3. Named Entity Recognition (NER), also known as "proper name recognition", refers to the identification of entities with specific meanings in the text, mainly including names of persons, places, institutions, proper nouns, etc. Three categories (entity, time, and number) and seven subcategories (person, institution, place, time, date, currency, and percentage) named entities.

4. n-gram searches the context of n words to determine part of speech and meaning, that is, the next word is related to the previous n-1 words

5. wordnet dictionary of synonyms

5. Relevant libraries

Traditional bag-of-words based libraries:

Extract the main words and sentences xiaoxu193/PyTeaser (comparison similarity)

Similarity comparison nhirakawa/BM25

Sentiment Analysis sloria/TextBlob

NLTK

snownlp integration

 

based on deep learning

Generate title https://github.com/rockingdingo/deepnlp/tree/master/deepnlp/textsum

Topic Extraction Gensim 

Sentiment Analysis xiaohan2012/twitter-sent-dnn wendykan/DeepLearningMovies

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326520945&siteId=291194637