1.fasttext in gensim

fasttext既可以用来生成词向量，也可以进行文本分类。但在gensim中fasttext模块似乎只有向量功能，且用法和word2vec很像，因此，只列举两个例子，其他详见gensim中word2vec实战

from gensim.models.fasttext import FastText
from gensim.test.utils import common_texts
print(common_texts)  # 训练预料
model = FastText(common_texts, size=5, window=5, min_count=1)# sentences一般为可迭代对象
print(model.wv['human'])
print(model['human'])

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
[ 0.0395214  -0.02950497  0.020394    0.00305868 -0.0096908 ]
[ 0.0395214  -0.02950497  0.020394    0.00305868 -0.0096908 ]

from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = FastText(min_count=1, size=5) # an empty model, no training yet
model.build_vocab(sentences)  # prepare the model vocabulary
print(model.corpus_count, model.iter)  # 2句话，5个词
model.train([['model', 'say'], ['cat', 'dog']], total_examples=model.corpus_count, epochs=model.iter)# can be a non-repeatable, 1-pass generator
model.wv['meow'] # cat需在sentences构成的词汇表中，如果没有则报错（如model就会报错），train只是对神经网络进行调参

2 5
array([ 0.04004489,  0.02346651,  0.02794289,  0.00852839, -0.06426186],
      dtype=float32)

2. fasttext in fasttext

在fasttext库中，fasttext算法既可用于生成词向量，也可用于文本分类

2.1 词向量

import fasttext
model = fasttext.train_unsupervised('data/fil9')
model.words
model.get_word_vector('human')
model.save_model(result/fil9.bin)
model = fasttext.load_model('result/fil9.bin')
model.get_nearest_neighbors('asparagus')
model.get_analogies('berlin', 'germany', 'france')

train_unsurpervised参数（具体含义可参考文本分类）：
- input # training file path (required)
- model # unsupervised fasttext model {cbow, skipgram} [skipgram]
- lr # learning rate [0.05]
- dim # size of word vectors [100]
- ws # size of the context window [5]
- epoch # number of epochs [5]
- minCount # minimal number of word occurences [5]
- minn # min length of char ngram [3]
- maxn # max length of char ngram [6]
- neg # number of negatives sampled [5]
- wordNgrams # max length of word ngram [1]
- loss # loss function {ns, hs, softmax, ova} [ns]
- bucket # number of buckets [2000000]
- thread # number of threads [number of cpus]
- lrUpdateRate # change the rate of updates for the learning rate [100]
- t # sampling threshold [0.0001]
- verbose # verbose [2]

2.2 文本分类

fasttext在文本分类时，要求文本以__label__开头，后面是标签，再往后是词（词与词之间最好用空格隔开）。

import fasttext
model = fasttext.train_supervised(input='cooking.train')
model.save_model('model_cooking.ftz')
model = fasttext.load_model(model.cooking.ftz)
model.test('cooking.valid'，k=1) # 对整个文本进行test，k为关注的标签数量，输出结果为(3000L, 0.124, 0.0541)，3000L为样本个数；0.124关注的k个标签的分类准确率，0.0541关注的k个标签的recall（recall at one）
model.test_label('cooking.valid')['__label__baking']# 某一标签的结果
model.predict('Which baking dish is best to bake a banana break') # 对测试集样本逐个预测

train_supervised参数(中括号我默认参数）：
- input # 训练集预料
- lr # 学习率[0.1]
- **dim ** # 词向量 [100]
- ws # size of the context window [5]
- epoch # number of epochs [5]
- minCount # minimal number of word occurences [1]
- minCountLabel # minimal number of label occurences [1]
- minn # min length of char ngram [0]
- maxn # max length of char ngram [0]
- neg # number of negatives sampled [5]
- wordNgrams # max length of word ngram [1]
- loss # loss function {ns, hs, softmax, ova} [softmax]
- bucket # number of buckets [2000000]
- thread # number of threads [number of cpus]
- lrUpdateRate # change the rate of updates for the learning rate [100]
- t # sampling threshold [0.0001]
- label # label prefix [‘label’]
- verbose # verbose [2]
- pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
model.predict参数：
- k：控制返回的可能标签数。如果k=2，则返回最可能的两个标签，如((’__label__CTRL’, ‘__label__AD’), array([0.50008875, 0.49993122]))，默认返回最可能的一个。
- threshold：阈值，即判断标签时的概率值，大于该域名才为某一标签。

2.3 文本分类自动调参

fasttext可以自动进行调参，但是缺点是每次运行结果不一样，且无法保存模型参数，只能通过save model然后load_model的形式。

import fasttext
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')
# fasttext会给出f1score的最佳超参数

默认情况，会搜索5分钟，可以设置autotuneDuration更改搜索时间

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneDuration=600)
# 中途想要停止时，可使用CTRL+C中断，并保留中断时的模型

fasttext还可以设置模型的大小，根据模型大小找到最合适的超参数

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneModelSize="2M")

fasttext默认是所有类别的f1score，还可以设置成某一标签的f1score：

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneMetric="f1:__label__baking")

这等价于手动调参时：

model.test_label('cooking.valid')['__label__baking']

有时，我们可能对两个标签感兴趣，也可以对autotuneMetric进行设置，写成两个标签，等价于

model.test("cooking.valid", k=2)

weixin_43178406

发布了111 篇原创文章 · 获赞 113 · 访问量 1万+

私信关注

fasttext实战

1.fasttext in gensim

2. fasttext in fasttext

2.1 词向量

2.2 文本分类

2.3 文本分类自动调参

猜你喜欢