【Introduction to Natural Language Processing】

Natural language processing (NLP)

 

Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics, but with important differences. Natural language processing is not the general study of natural language, but the development of computer systems that can effectively realize natural language communication, especially the software systems therein. Hence it is part of computer science.

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, linguistics concerned with the interaction between computers and human (natural) language.

 

The main problems to be solved by natural language processing are: 

(1) Spam identification 

(2) Chinese input method 

(3) Machine translation 

(4) Automatic question and answer, customer service robot 

Here is a brief list of some common areas of NLP: word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, semantic recognition, spam recognition, spelling error correction, word sense disambiguation, speech recognition, phonetic conversion, machine translation, automatic question answering...

 

 

2. Corpus knowledge  

A corpus is a collection of corpus with a certain structure, representative, and a certain scale that can be retrieved by a computer program, which is specially collected as one or more application targets.    

Corpus division: ① Time division ② Processing depth division: Annotated corpus and non-annotated corpus ③ Structure division ⑤ Language division ⑥ Dynamic update degree division

Corpus construction principles: ①representative ②structural ③balance ④scale ⑤metadata: pair of metadata       

The advantages and disadvantages of corpus annotation

①Advantages: Research is convenient. Reusable, functional diversity, clear analysis.

②Disadvantages: The corpus is not objective (manual labeling has high accuracy and poor consistency, automatic or semi-automatic labeling has high consistency and poor accuracy), inconsistent labeling, and low accuracy

 

 

 

3. Machine Learning Dimensionality Reduction

Main feature selection, random forest, principal component analysis, linear dimensionality reduction

 

 

4. Naive Bayes Principle  

-->Training text preprocessing to construct a classifier.

-->Construct prediction classification function  

-->Preprocess the test data  

-->Classify using classifier    

 

五、LIBSVM -- A Library for Support Vector Machines

SVMs (Support Vector Machines) are a useful technique for data classification. Although

SVM is considered easier to use than Neural Networks, users not familiar with

it often get unsatisfactory results at first. Here we outline a “cookbook” approach

which usually gives reasonable results.

 

Note that this guide is not for SVM researchers nor do we guarantee you will

achieve the highest accuracy. Also, we do not intend to solve challenging or diffi-

cult problems. Our purpose is to give SVM novices a recipe for rapidly obtaining

acceptable results.

 

Although users do not need to understand the underlying theory behind SVM, we

briefly introduce the basics necessary for explaining our procedure. A classification

task usually involves separating data into training and testing sets. Each instance

in the training set contains one “target value” (i.e. the class labels) and several

“attributes” (i.e. the features or observed variables). The goal of SVM is to produce

a model (based on the training data) which predicts the target values of the test data

given only the test data attributes.

 

 

 

Six, text word frequency algorithm idea:

1 Statistically process all documents of different formats into txt documents, format (remove non-English words such as Chinese characters/punctuation/space) and remove stop words (remove 891 stop words).     

2. Perform de-duplication and word frequency statistics on the cleaned words, count the word frequency through Map, and store the entity: word-word frequency. (Arrays are also possible, but in the face of particularly large data, arrays have out-of-bounds problems). Sort: by word frequency or alphabet

3 Extract the core vocabulary, data greater than 5 and less than 25 times, you can set the threshold yourself. When traversing the list<entity> list, the selected vocabulary size is controlled by obtaining the word frequency attribute of the entity.        

 

 

Original is not easy, welcome to reward, please look for the correct address, beware of counterfeiting



 

 

 

       

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326294420&siteId=291194637