FastText use to build your first text classifier

Foreword

In recent recognition intention to do some work, so try to do a text classifier with fasttext, study records as follows.

Brief introduction

First of all, what we use fasttext purpose is? Text classification, that is a word given category it belongs.

The goal is to text classification documents (such as e-mail, blog, SMS, product reviews, etc.) into one or more categories. These categories can be based on review scores, spam and non-spam to divide, or write the language of the document. Today, the main way to build such a machine learning classifier is that learning classification rules from the sample. In order to build such a classification, we need to mark data, which consists of the document and its corresponding category (also called a tag or label) components.

Fasttext What is it?

FastText Facebook is a fast open-source text categorization, providing a simple and efficient text classification and characterization methods of learning, but the depth model accuracy approaching faster.

principle

To skip this part of the principle, because the principle of special and more online articles, if you are interested, then you can end venue google search or text articles. I was there put a few links.

As for this article, the first online article about the principle of common good. Secondly, I totally level of understanding of the principles can not keep me here clearly written, after all, I'm just a poor engineer. Implement it, my destiny is dry on the line.

Practical application

It must first understand, fasttext just a toolkit, how to use it, in what way to achieve it are optional. Here I chose to use the command line to train the model, after providing online services with java language. Of course, you can choose to use a variety of language training and services, because there are multiple languages ​​fasttext package.

Download and install

We can directly download a version of the official release,


wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
unzip v0.1.0.zip

复制代码

I personally recommend more direct clone it on github project, namely the implementation of:

git clone [email protected]:facebookresearch/fastText.git
复制代码

After entering his catalog, execution makecan be.

After the installation is complete, you can perform direct command without any parameters, you can get relevant help manual.

2019-11-17-21-15-46

Data processing

Tutorials official website is to use the portal part of the data for training, which of course you can, but I think we might want to see more training samples of some of the Chinese.

First of all tell you about the format of training samples. as follows:

__label__name 呼 延 十
__label__name 张 伟
__label__city 北京
__label__city 西安
复制代码

Each line in the text file containing a training sample, followed by appropriate documentation. All labels to label prefix beginning, that's what how fastText identification tag or word yes. Then the model is trained to predict the label given document.

Note that when you generate your sample, you need to distinguish between the training and test sets, we use the general 训练:测试=8:2proportion.

My personal training samples, contains the city name (area), name (name), and a number of other labels. 4 ten million training samples, 10 million bar test samples. Some have been determined essentially using the dictionary to generate. To enhance the effect, the sample as accurate as possible, and as much as possible the number of some.

training

Execute the following command, and then you will see output similar to the following, waiting to run is completed (which is your training data input, output is the model name of your output file):

./fasttext supervised -input data.train -output model_name
Read 0M words
Number of words:  14598
Number of labels: 734
Progress: 100.0%  words/sec/thread: 75109  lr: 0.000000  loss: 5.708354  eta: 0h0m 
复制代码

After training is completed, so you can run your test suite to see a number of key indicators:

Immediately after the test which is your model files and test data sets. The following indicators are accurate and recall. This explained later.

./fasttext test model_name.bin data.test              
N  3000
P@5  0.0668
R@5  0.146
Number of examples: 3000
复制代码

To directly test results of some of the common case, we can run the command, some interactive tests. Some of my tests are as follows:

2019-11-17-21-34-34
.

Tuning

This is the first definition of precision and recall rates.

精确度是由 fastText 所预测标签中正确标签的数量。 召回率是所有真实标签中被成功预测出的标签数量。 我们举一个例子来说明这一点:

Why not put knives in the dishwasher?

在 Stack Exchange 上,这句话标有三个标签:equipment,cleaning 和 knives。 模型预测出的标签前五名可以通过以下方式获得:

>> ./fasttext predict model_cooking.bin - 5
前五名是 food-safety, baking, equipment, substitutions and bread.

因此,模型预测的五个标签中有一个是正确的,精确度为 0.20。 在三个真实标签中,只有 equipment 标签被该模型预测出,召回率为 0.33。
复制代码

There is no doubt, no matter our aim is not to identify multiple tags, these two values ​​we have as high as possible.

Optimization of sample

Our sample is generated by the program, so in theory can not guarantee correct, it is best artificial mark, of course, manual annotation million level of the data more difficult, then we should at least sample some basic cleanup, such as removing the word, symbol remove, lowercase, and so unified operation. As long as none of your data classification theoretically should be removed.

More iterations and better learning rate

In short, change some of the operating parameters, we make training programs more round, and more optimal learning rate, plus these two parameters -lr 1.0 -epoch 25, of course, you can continue to adjust and test the actual situation.

Using the n-gram

This is an additional increase in the earlier model, the training when there is no plus n-gramfeature, which is not considered factors of word order. Here you can simply Gram-understand the n- .

This is the final execution of the training command:


./fasttext  supervised -input data.train -output ft_model -epoch 25 -lr 1.0 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs

复制代码

This is my precision and recall on my test set:

N       10997060
P@1     0.985
R@1     0.985
复制代码

After more than a few simple steps, recognition accuracy has reached 98.5%, which is actually a good effect, because I do not have to determine whether to use the program to practice, so to 98.5% after did not continue to optimize the If there is a subsequent optimization, I will update this article. The optimization method used in the record.

demo

First, we introduced pom file:

        <dependency>
            <groupId>com.github.vinhkhuc</groupId>
            <artifactId>jfasttext</artifactId>
            <version>0.4</version>
        </dependency>

复制代码

Then simply write just fine:

import com.github.jfasttext.JFastText;

/**
 * Created by pfliu on 2019/11/17.
 */
public class FastTextDemo {

    public static void main(String [] args){
        JFastText jt = new JFastText();
        jt.loadModel("/tmp/ft_model_5.bin");

        String ret = jt.predict("呼 延 十");
        System.out.println(ret);

    }
}
复制代码

python code more simple:

2019-11-17-22-18-12

Of course, remember to install it: pip3 install fasttext.

related articles

fastText Principles and Practice

Natural Language Processing n-gram Model


Finish.

ChangeLog

2019-11-17 completed

Above are all personal income and think, correct me if wrong welcomed the comments section.

Welcome to reprint, please sign, and retain the original link.

Contact E-mail: [email protected]

More study notes, see the individual blog or micro-channel public concern number <Huyan ten> ------> Huyan ten

Guess you like

Origin juejin.im/post/5dd158866fb9a0202a602bfd