from gensim.models import Word2Vec Word2Vec(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()): """ Initialize the model from an iterable of `sentences`. Each sentence is a list of words (unicode strings) that will be used for training. Parameters ---------- a hash value for a random number generator, word + `str (seed) ` as an initial vector for each word sentences: iterable of iterables corpus to be analyzed, it may be a list, or read from a file traversal. For large corpus, it is recommended to use BrownCorpus, Text8Corpus or lineSentence building. sg: int {1, 0} defined training algorithm sg = 1: skip-gram (input word output context); sg = 0:. CBOW ( input context output word), the default sg = 0, i.e. CBOW model size: int wherein dimension vector or vectors words, the default value is 100 window: int word vector maximum distance context, skip-gram algorithm and cbow prediction is done based on a sliding window. The default is 5. In actual use, the size may be dynamically adjusted according to actual needs window. For a general corpus recommended value between [5,10]. alpha: float is the initial learning rate, linearly decremented to min_alpha during training. min_alpha: a float algorithm supports is gradually reduced during the iteration step, min_alpha gives the minimum value of the step size. SEED: int minimum cutoff value, the number of word frequency than min_count words are discarded, the default value is 5. min_count: int max_vocab_size: int setting word RAM limits during vector construction, there is no limit set to None. 10 Million Word types need Every About 1GB of the RAM. Sample: a float arranged randomly threshold downsampled high-frequency words, the default is 1e-3, the range (0,1e-5). workers: int number of parallel training for controlling hs: int {1,0} select word2vec Solution two: if is 0, it is Negative Sampling; if it is negative and a negative number of samples is greater than 0, it is Hierarchical Softmax . The default is 0, ie Negative Sampling. negative: int If greater than 0, using negativesampling, a number of noise words (typically 5-20) is provided. cbow_mean: int {1,0} only CBOW when doing the projection, it is 0, then the use of the word vectors and the context; 1, compared to the average vector word context. The default value is 1, is not recommended to modify the default values. hashfxn: function hash function to initialize the weights, use the default python hash function. iter: int The maximum number of stochastic gradient descent method iterations, default 5. For large corpus, you can increase this value. trim_rule: function for setting sorting rules vocabularies, specify those words to leave, which is to be deleted. Can be set to None (min_count will be used). sorted_vocab: int {1,0} , when If is 1 (the default), then the allocation word index of the word will first sorted in descending order based on frequency. batch_words: int number passed to the thread of words per batch, the default is 10000. Examples -------- the Initialize and Train A `Word2Vec` Model from gensim.models Import Word2Vec Sentences = [[" CAT "," say "," Meow "], [" Dog "," say "," Woof "]] Model = Word2Vec (Sentences, min_count =. 1) say_vector Model = [ 'say'] GET # Vector for Word