[NLP] Natural language processing steps

table of Contents

Get corpus

Corpus preprocessing

Feature engineering

Feature selection

Model training

Model evaluation

Model prediction



NLP, Natural Language Processing is natural language processing. It is a subfield of artificial intelligence, which is the use of computer computing to process natural language. Natural language processing generally requires the following steps.

Get corpus

It is expected that the collection of language materials and texts becomes a corpus. In machine learning, we usually process a line of data used for model training as a text. However, we generally call a file a text in daily life. This concept can easily cause misunderstandings for beginners. For example, processing data is to read multiple files, but after preprocessing and feature engineering, these files may be merged into one line. At this time, the input file cannot become a text. The text concept of natural language processing is relative to the post-feature engineering form In terms of.

How to obtain corpus:

(1) Existing corpus, that is, corpus accumulated by own business

(2) Online crawling, that is, corpus acquired on the Internet through crawlers and other tools

(3) Public corpus, corpus published by some companies or research institutions

Corpus preprocessing

In the engineering application of natural language, the preprocessing of corpus probably accounts for more than 50% of the total workload. Therefore, most of the developers' work is in the preprocessing of the corpus. Pretreatment usually has the following important tasks:

(1) Data cleaning is to find useful things in the corpus. Such as deduplication, alignment, deletion, merging, splitting, etc.

(2) Word segmentation. There are usually sentences and paragraphs in the corpus, especially there is no obvious gap between words in the Chinese corpus. At this time, you need to segment the sentence or paragraph. Word segmentation methods usually include:

1) Word segmentation method based on understanding;

2) Rule-based word segmentation method;

3) Word segmentation method based on statistics;

4) Word segmentation method based on string.

 3) Part-of-speech tagging is the labeling of words, such as adjectives, verbs, and nouns. Part of speech tagging is not necessary. For example, corresponding text classification does not need to tag part of speech, and sentiment analysis needs to care about part of speech.

4) Remove stop words. Words that are useless or have little contribution to program processing are called stop words. Useless words such as person, tone, punctuation, etc. generally need to be deleted after word segmentation.

Feature engineering

After the corpus is preprocessed, it is necessary to consider how to convert words and words into types that can be processed by the computer; such as converting Chinese into numbers. Commonly used processing methods are:

(1) The bag-of-words model BOW, which does not consider the order of appearance of the words, puts the words into the set, and then counts the words according to the number of occurrences.

(2) Vector model, which converts words into vector matrix. Such as one-hot, wordToVec.

Feature selection

After feature engineering, there will generally be a lot of feature vectors. Feature selection is to select those features that have the greatest effect on model training. This process is very important. Developers with rich experience and good model theory can often choose the correct feature vector, which greatly reduces training time and improves efficiency.

Model training

For different business problems, select appropriate models for training. These models can use open source algorithm frameworks, or they can be developed by themselves. For example, use Naïve Bayes, SVM, FP-Growth, LSTM, etc. It is necessary to pay attention to the problems of over-fitting and under-fitting during the training process. For example, over-fitting can be achieved by increasing training data and increasing regularization items; under-fitting can increase model complexity, reduce regularization, and increase feature dimensions.

Model evaluation

Model evaluation is to measure whether the trained model has reached the established goal. Commonly used evaluation methods are:

(1) Accuracy, precision, recall, recall, specificity, and sensitivity

(2) F-Measure, ROC curve, AUC, PR curve

Model prediction

After the model is trained and evaluated, it can be used to predict business data. In actual production business, the same business usually uses multiple models to predict, analyze and compare.

Guess you like

Origin blog.csdn.net/henku449141932/article/details/110224532