Overview of word2vec API in gensim

click here to read the original article

In gensim, word2vec related APIs are in the package gensim.models.word2vec. The parameters related to the algorithm are in the class gensim.models.word2vec.Word2Vec. The parameters that the algorithm needs to pay attention to are:

  • sentences: The corpus we want to analyze can be a list or traversed from a file. Later we will have an example of reading from the file
  • size: The dimension of the word vector, the default value is 100. The value of this dimension is generally related to the size of our corpus. If it is a small corpus, such as a text corpus less than 100M, the default value is generally sufficient. If it is a very large corpus, it is recommended to increase the dimension.
  • Window: The maximum distance of the word vector context. This parameter is marked as cc in our algorithm principle article. The larger the window, the words farther from a word will also have a context relationship. The default value is 5. In actual use, the size of this window can be dynamically adjusted according to actual needs. If it is a small corpus, this value can be set smaller. For general corpus, this value is recommended to be between [5,10].
  • sg: That is the choice of our word2vec two models. If it is 0, it is the CBOW model, if it is 1, it is the Skip-Gram model, and the default is 0, which is the CBOW model.
  • hs: That is, our word2vec two solutions are selected. If it is 0, it is Negative Sampling, if it is 1, and the number of negative samples is greater than 0, it is Hierarchical Softmax. The default is 0, which is Negative Sampling
  • Negative: The number of negative samples when using Negative Sampling, the default is 5. It is recommended to be between [3,10]. This parameter is marked as neg in our algorithm principle article.
  • cbow_mean: only used when CBOW is doing projection, it is 0, then xwxw in the algorithm is the sum of context word vectors, and 1 is the average value of context word vectors. In our principle article, it is described in terms of the average value of word vectors. Personally, I prefer to use the average value to represent xwxw. The default value is also 1. It is not recommended to modify the default value.
  • min_count: Need to calculate the minimum word frequency of the word vector. This value can remove some very rare low-frequency words, the default is 5. If it is a small corpus, you can lower this value.
  • iter: The maximum number of iterations in the stochastic gradient descent method, the default is 5. For large corpora, this value can be increased.
  • alpha: The initial step size of the iteration in the stochastic gradient descent method. The algorithm principle is marked as ηη, the default is 0.025
  • min_alpha: Since the algorithm supports gradually reducing the step size during the iteration process, min_alpha gives the smallest iteration step size value. The iteration step length of each round in stochastic gradient descent can be obtained by iter, alpha, and min_alpha together. Because this part is not the core content of the word2vec algorithm, we did not mention it in the principle article. For large corpus, you need to adjust the parameters of alpha, min_alpha, and iter together to select the appropriate three values.

Guess you like

Origin blog.csdn.net/jokerxsy/article/details/106567777