gensim in word2vec

from gensim.models import Word2Vec
Word2Vec(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
                 max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
                 sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
                 trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()):
        """
        Initialize the model from an iterable of `sentences`. Each sentence is a
        list of words (unicode strings) that will be used for training.

        Parameters
        ---------- 
           a hash value for a random number generator, word + `str (seed) ` as an initial vector for each word
        sentences: iterable of iterables 
           corpus to be analyzed, it may be a list, or read from a file traversal. For large corpus, it is recommended to use BrownCorpus, Text8Corpus or lineSentence building. 
        sg: int {1, 0} 
           defined training algorithm sg = 1: skip-gram (input word output context); sg = 0:. CBOW ( input context output word), the default sg = 0, i.e. CBOW model 
        size: int 
           wherein dimension vector or vectors words, the default value is 100 
        window: int 
            word vector maximum distance context, skip-gram algorithm and cbow prediction is done based on a sliding window. The default is 5. In actual use, the size may be dynamically adjusted according to actual needs window. For a general corpus recommended value between [5,10]. 
        alpha: float 
           is the initial learning rate, linearly decremented to min_alpha during training. 
        min_alpha: a float 
            algorithm supports is gradually reduced during the iteration step, min_alpha gives the minimum value of the step size. 
        SEED: int 
           minimum cutoff value, the number of word frequency than min_count words are discarded, the default value is 5.
        min_count: int 
        max_vocab_size: int 
           setting word RAM limits during vector construction, there is no limit set to None. 10 Million Word types need Every About 1GB of the RAM. 
        Sample: a float 
            arranged randomly threshold downsampled high-frequency words, the default is 1e-3, the range (0,1e-5). 
        workers: int 
           number of parallel training for controlling 
        hs: int {1,0} 
           select word2vec Solution two: if is 0, it is Negative Sampling; if it is negative and a negative number of samples is greater than 0, it is Hierarchical Softmax . The default is 0, ie Negative Sampling. 
        negative: int 
           If greater than 0, using negativesampling, a number of noise words (typically 5-20) is provided. 
        cbow_mean: int {1,0} 
           only CBOW when doing the projection, it is 0, then the use of the word vectors and the context; 1, compared to the average vector word context. The default value is 1, is not recommended to modify the default values. 
        hashfxn: function
            hash function to initialize the weights, use the default python hash function. 
        iter: int
           The maximum number of stochastic gradient descent method iterations, default 5. For large corpus, you can increase this value. 
        trim_rule: function 
            for setting sorting rules vocabularies, specify those words to leave, which is to be deleted. Can be set to None (min_count will be used). 
        sorted_vocab: int {1,0} 
           , when If is 1 (the default), then the allocation word index of the word will first sorted in descending order based on frequency. 
        batch_words: int 
            number passed to the thread of words per batch, the default is 10000. 
       
        Examples 
        -------- 
        the Initialize and Train A `Word2Vec` Model 

       from gensim.models Import Word2Vec 
       Sentences = [[" CAT "," say "," Meow "], [" Dog "," say "," Woof "]] 
       Model = Word2Vec (Sentences, min_count =. 1) 
       say_vector Model = [ 'say'] GET # Vector for Word

 

Guess you like

Origin www.cnblogs.com/jeshy/p/11434241.html