介绍

fasttext 是一种用于词语表达和语句分类的方法，包括一套数据集合和分类工具。

需要

一般地，fasttext 可以安装在 MacOS 和 linux 系统上。它会使用到 C++，因此需要系统支持 C++11 的编译。包括：

(g++-4.7.2 或者更高) 或者 (clang-3.3 或者更高)

fasttext 需要使用一个 MakeFile 来编译，因此需要系统支持 make 命令或者 cmake 命令（至少在 2.8.9 版本）。
如果要使用单词相似度计算，还需要：

Python 2.6 或以上版本
Numpy
Scipy

如果要使用 Python 来执行 fasttext，还需要：

Python 2.7 或 >=3.4
Numpy
Scipy
pybind11

安装

获取源代码

可以访问 github 来获取最新的稳定版.
如果你是一个开发者，可以直接使用 master 分支上的代码，但是有可能不稳定。

使用 make 命令安装 fasttext（建议）

$ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
$ unzip v0.1.0.zip
$ cd fastText-0.1.0
$ make

这个方法会产生所有类的文件和主目录 fasttext。

使用 cmake 命令安装 fasttext

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

为 python 安装 fasttext

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

使用方法

fasttext 有两种主要应用：单词表达（论文）和文本分类（论文）。

单词表达学习

获取词向量：

$ ./fasttext skipgram -input data.txt -output model

其中，data.txt 是一个 UTF-8 编码的数据集。算法的结果会输出到两个文件：model.bin 和 model.vec 。model.vec 包含了所有词的向量表达，每行一个词。model.bin 是一个二进制文件，包含了模型的所有参数和超参数。这个二进制文件接下来可以用于计算词向量，或者重新进行模型的优化。

计算词典之外的词向量

刚训练出的模型可以用来计算词典之外的单词的向量表达。比如把这些单词写进一个文件 queries.txt ，则可以使用：

$ ./fasttext print-word-vectors model.bin < queries.txt

或者

$ cat queries.txt | ./fasttext print-word-vectors model.bin

这个命令会输出每个单词的向量。
fasttext 给出了一个示例：

$ ./word-vector-example.sh

这个脚本会编译代码、下载数据、根据enwik9词库生成模型、根据模型计算RW词库的词向量。

文本分类

fasttext 可以用来训练有监督的文本分类器，如情感分析。训练命令为：

$ ./fasttext supervised -input train.txt -output model

其中 train.txt 的每行是一个训练语句和标签。默认配置下，标签需要加上前缀 __label__。算法结果输出到两个文件：model.bin 和 model.vec。一旦模型训练完毕，便可以使用它对一个测试集进行分类，并计算准确率和召回率（P@k、R@k）：

$ ./fasttext test model.bin test.txt k

参数 k 默认为 1。
还可以获取一个测试集的前 k 个最可能的标签：

$ ./fasttext predict model.bin test.txt k

或者使用 predict-prob 来获取前 k 个最可能的标签的概率：

$ ./fasttext predict-prob model.bin test.txt k

其中 test.txt 每行包括一段话需要进行分类。
fasttext 给出了一个示例： classification-example.sh 。
如果想要获取句子的向量，可以用：

$ ./fasttext print-sentence-vectors model.bin < text.txt

其中 text.txt 包括了想要求取向量的句子。结果中每一行对应一个句子的向量。
还可以对一个模型进行量化，来缩小占用的空间：

$ ./fasttext quantize -output model

这个命令会生成一个 .ftz （占用空间更小）。所有的标准函数，比如 test 或 predict ，完全兼容这个新的模型：

$ ./fasttext test model.ftz test.txt

量化过程参见论文。fasttext 给出了示例：quantization-example.sh。

API文档

对于一个API的说明文档，可以直接执行不包含参数的命令，就会列出所有参数说明：

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurences [1]
  -minCountLabel      minimal number of label occurences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

fasttext 的安装和使用

介绍

需要

安装