DOC2VEC:所涉及的参数以及WORD2VEC所涉及的参数

DOC2VEC:所涉及的参数
class gensim.models.doc2vec.Doc2Vec(documents=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, **kwargs)
Bases: gensim.models.word2vec.Word2Vec
Class for training, using and evaluating neural networks described in http://arxiv.org/pdf/1405.4053v2.pdf
Initialize the model from an iterable of documents. Each document is a TaggedDocument object that will be used for training.
The documents iterable can be simply a list of TaggedDocument elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network.
If you don’t supply documents, the model is left uninitialized – use if you plan to initialize it in some other way.
dm defines the training algorithm. By default (dm=1), ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
Dm：训练算法：默认为1，指DM；dm=0,则使用DBOW。
size is the dimensionality of the feature vectors.
· size：是指特征向量的维度，默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为几十到几百。
window is the maximum distance between the predicted word and context words used for prediction within a document.
window：窗口大小，表示当前词与预测词在一个句子中的最大距离是多少。
alpha is the initial learning rate (will linearly drop to min_alpha as training progresses).
alpha: 是初始的学习速率，在训练过程中会线性地递减到min_alpha。

seed = for the random number generator. Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.)
min_count = ignore all words with total frequency lower than this.
min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5。

max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).
max_vocab_size：设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个，则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
sample = threshold for configuring which higher-frequency words are randomly downsampled;default is 1e-3, values of 1e-5 (or lower) may also be useful, set to 0.0 to disable downsampling.
sample: 高频词汇的随机降采样的配置阈值，默认为1e-3，官网给的解释 1e-5效果比较好。设置为0时是词最少的时候！不进行降采样，结果词少，当设置1e-5，相应的词展现更丰富！
workers = use this many worker threads to train the model (=faster training with multicore machines).
workers：用于控制训练的并行数。
iter = number of iterations (epochs) over the corpus. The default inherited from Word2Vec is 5, but values of 10 or 20 are common in published ‘Paragraph Vector’ experiments.
语料库上的迭代次数。他默认继承Word2vec是5，但值10或20是常见的Paragraph Vector。
hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.
效果对比，hs=1，不调用negative sampling，结果并不理想，它相对于Hierarchical softmax 模型来说，不再采用huffman树，这样可以大幅提高性能。
negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). Default is 5. If set to 0, no negative samping is used.
针对negative sampling设置noise word频率
dm_mean = if 0 (default), use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
dm_mean：当使用DM训练算法时，对上下文向量相加（默认0）；若设为1，则求均值。
dm_concat = if 1, use concatenation of context vectors rather than sum/average; default is 0 (off). Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
dm_concat：默认为0，当设为1时，在使用DM训练算法时，直接将上下文向量和Doc向量拼接。
使用上下文向量的连接，而不是总和/平均；默认值是0（off）。注意连接在一个更大的模型，作为输入的不再是size的一个（采样或算术组合）词向量，但size的标签（S）和语境中的所有单词串在一起。
dm_tag_count = expected constant number of document tags per document, when using dm_concat mode; default is 1.
使用dm_concat模式，文档标签每文件预计数量不变；默认是1。
dbow_words if set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; default is 0 (faster training of doc-vectors only).
dbow_words :当设为1时，则在训练doc_vector（DBOW）的同时训练Word_vector（Skip-gram）；默认为0，只训练doc_vector，速度更快。

trim_rule = vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either util.RULE_DISCARD, util.RULE_KEEP or util.RULE_DEFAULT. Note: The rule, if given, is only used prune vocabulary during build_vocab() and is not stored as part of the model.
用于设置词汇表的整理规则，指定那些单词要留下，哪些要被删除。可以没有（min_count将使用），或一个可接受的参数（字，计数，min_count）和返回util.rule_discard，util.rule_keep或util.rule_default。注：规则，如果给定的，只是用来修剪在build_vocab()词汇而不是存储作为模型的一部分。

WORD2VECC参数：
**架构：skip-gram（慢、对罕见字有利）vs CBOW（快）
· 训练算法：分层softmax（对罕见字有利）vs 负采样（对常见词和低纬向量有利）
　　负例采样准确率提高，速度会慢，不使用negative sampling的word2vec本身非常快，但是准确性并不高
· 欠采样频繁词：可以提高结果的准确性和速度（适用范围1e-3到1e-5）
· 文本（window）大小：skip-gram通常在10附近，CBOW通常在5附近**
1、DM模型方面的参数
· dm：训练算法：默认为1，指DM；dm=0,则使用DBOW。
· dm_mean：当使用DM训练算法时，对上下文向量相加（默认0）；若设为1，则求均值。
· dm_concat：默认为0，当设为1时，在使用DM训练算法时，直接将上下文向量和Doc向量拼接。
· dbow_words：当设为1时，则在训练doc_vector（DBOW）的同时训练Word_vector（Skip-gram）；默认为0，只训练doc_vector，速度更快。
其他参数与Word2vec的训练参数类似。
2、其他参数
sentences：可以是一个list，对于大语料集，建议使用BrownCorpus,Text8Corpus或lineSentence构建。
· size：是指特征向量的维度，默认为100。
· alpha: 是初始的学习速率，在训练过程中会线性地递减到min_alpha。
· window：窗口大小，表示当前词与预测词在一个句子中的最大距离是多少。
· min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5。
· max_vocab_size: 设置词向量构建期间的RAM限制，设置成None则没有限制。
· sample: 高频词汇的随机降采样的配置阈值，默认为1e-3，官网给的解释 1e-5效果比较好。设置为0时是词最少的时候！不进行降采样，结果词少，当设置1e-5，相应的词展现更丰富！
· seed：用于随机数发生器。与初始化词向量有关。
· workers：用于控制训练的并行数。
· min_alpha：学习率的最小值。
· sg：用于设置训练算法，默认为0，对应CBOW算法；sg=1则采用skip-gram算法。
· hs: 如果为1则会采用hierarchica·softmax技巧。如果设置为0（默认），则使用negative sampling。
· negative: 如果>0,则会采用negativesampling，用于设置多少个noise words（一般是5-20）。
· cbow_mean: 如果为0，则采用上下文词向量的和，如果为1（default）则采用均值，只有使用CBOW的时候才起作用。
· hashfxn： hash函数来初始化权重，默认使用python的hash函数。
· iter：迭代次数，默认为5。
· trim_rule：用于设置词汇表的整理规则，指定那些单词要留下，哪些要被删除。可以设置为None（min_count会被使用）。
· sorted_vocab：如果为1（默认），则在分配word index 的时候会先对单词基于频率降序排序。
· batch_words：每一批的传递给线程的单词的数量，默认为10000。
一些参数的选择与对比：
1.skip-gram （训练速度慢，对罕见字有效），CBOW（训练速度快）。一般选择Skip-gram模型；
2.训练方法：Hierarchical Softmax（对罕见字有利），Negative Sampling（对常见字和低维向量有利）；
3.欠采样频繁词可以提高结果的准确性和速度（1e-3~1e-5）
4.Window大小：Skip-gram通常选择10左右，CBOW通常选择5左右。

参考文献1：
参考文献2：
参考视频3：需要翻墙

DOC2VEC:所涉及的参数以及WORD2VEC所涉及的参数

猜你喜欢