fasttext实战

1.fasttext in gensim

fasttext既可以用来生成词向量,也可以进行文本分类。但在gensim中fasttext模块似乎只有向量功能,且用法和word2vec很像,因此,只列举两个例子,其他详见gensim中word2vec实战

from gensim.models.fasttext import FastText
from gensim.test.utils import common_texts
print(common_texts)  # 训练预料
model = FastText(common_texts, size=5, window=5, min_count=1)# sentences一般为可迭代对象
print(model.wv['human'])
print(model['human'])
[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
[ 0.0395214  -0.02950497  0.020394    0.00305868 -0.0096908 ]
[ 0.0395214  -0.02950497  0.020394    0.00305868 -0.0096908 ]
from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = FastText(min_count=1, size=5) # an empty model, no training yet
model.build_vocab(sentences)  # prepare the model vocabulary
print(model.corpus_count, model.iter)  # 2句话,5个词
model.train([['model', 'say'], ['cat', 'dog']], total_examples=model.corpus_count, epochs=model.iter)# can be a non-repeatable, 1-pass generator
model.wv['meow'] # cat需在sentences构成的词汇表中,如果没有则报错(如model就会报错),train只是对神经网络进行调参
2 5
array([ 0.04004489,  0.02346651,  0.02794289,  0.00852839, -0.06426186],
      dtype=float32)

2. fasttext in fasttext

在fasttext库中,fasttext算法既可用于生成词向量,也可用于文本分类

2.1 词向量

import fasttext
model = fasttext.train_unsupervised('data/fil9')
model.words
model.get_word_vector('human')
model.save_model(result/fil9.bin)
model = fasttext.load_model('result/fil9.bin')
model.get_nearest_neighbors('asparagus')
model.get_analogies('berlin', 'germany', 'france')
  • train_unsurpervised参数(具体含义可参考文本分类):
    • input # training file path (required)
    • model # unsupervised fasttext model {cbow, skipgram} [skipgram]
    • lr # learning rate [0.05]
    • dim # size of word vectors [100]
    • ws # size of the context window [5]
    • epoch # number of epochs [5]
    • minCount # minimal number of word occurences [5]
    • minn # min length of char ngram [3]
    • maxn # max length of char ngram [6]
    • neg # number of negatives sampled [5]
    • wordNgrams # max length of word ngram [1]
    • loss # loss function {ns, hs, softmax, ova} [ns]
    • bucket # number of buckets [2000000]
    • thread # number of threads [number of cpus]
    • lrUpdateRate # change the rate of updates for the learning rate [100]
    • t # sampling threshold [0.0001]
    • verbose # verbose [2]

2.2 文本分类

fasttext在文本分类时,要求文本以__label__开头,后面是标签,再往后是词(词与词之间最好用空格隔开)。

import fasttext
model = fasttext.train_supervised(input='cooking.train')
model.save_model('model_cooking.ftz')
model = fasttext.load_model(model.cooking.ftz)
model.test('cooking.valid',k=1) # 对整个文本进行test,k为关注的标签数量,输出结果为(3000L, 0.124, 0.0541),3000L为样本个数;0.124关注的k个标签的分类准确率,0.0541关注的k个标签的recall(recall at one)
model.test_label('cooking.valid')['__label__baking']# 某一标签的结果
model.predict('Which baking dish is best to bake a banana break') # 对测试集样本逐个预测
  • train_supervised参数(中括号我默认参数):
    • input # 训练集预料
    • lr # 学习率[0.1]
    • **dim ** # 词向量 [100]
    • ws # size of the context window [5]
    • epoch # number of epochs [5]
    • minCount # minimal number of word occurences [1]
    • minCountLabel # minimal number of label occurences [1]
    • minn # min length of char ngram [0]
    • maxn # max length of char ngram [0]
    • neg # number of negatives sampled [5]
    • wordNgrams # max length of word ngram [1]
    • loss # loss function {ns, hs, softmax, ova} [softmax]
    • bucket # number of buckets [2000000]
    • thread # number of threads [number of cpus]
    • lrUpdateRate # change the rate of updates for the learning rate [100]
    • t # sampling threshold [0.0001]
    • label # label prefix [‘label’]
    • verbose # verbose [2]
    • pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
  • model.predict参数:
    • k:控制返回的可能标签数。如果k=2,则返回最可能的两个标签,如((’__label__CTRL’, ‘__label__AD’), array([0.50008875, 0.49993122])),默认返回最可能的一个。
    • threshold:阈值,即判断标签时的概率值,大于该域名才为某一标签。

2.3 文本分类自动调参

fasttext可以自动进行调参,但是缺点是每次运行结果不一样,且无法保存模型参数,只能通过save model然后load_model的形式。

import fasttext
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')
# fasttext会给出f1score的最佳超参数

默认情况,会搜索5分钟,可以设置autotuneDuration更改搜索时间

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneDuration=600)
# 中途想要停止时,可使用CTRL+C中断,并保留中断时的模型

fasttext还可以设置模型的大小,根据模型大小找到最合适的超参数

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneModelSize="2M")

fasttext默认是所有类别的f1score,还可以设置成某一标签的f1score:

model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneMetric="f1:__label__baking")

这等价于手动调参时:

model.test_label('cooking.valid')['__label__baking']

有时,我们可能对两个标签感兴趣,也可以对autotuneMetric进行设置,写成两个标签,等价于

model.test("cooking.valid", k=2)
发布了111 篇原创文章 · 获赞 113 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_43178406/article/details/102465629