word2vec implements training of your own word vectors and detailed explanation of their parameters

Code

from gensim.models import Word2Vec

# 准备训练数据
sentences = [['I', 'love', 'coding'],
             ['Python', 'is', 'great'],
             ['Machine', 'learning', 'is', 'fascinating']]

# 将数据传入Word2Vec中,训练Word2Vec模型
model = Word2Vec(sentences, vector_size=10, window=5, min_count=1, workers=4)

# 获取词向量
word_vectors = model.wv

# 获取单个单词的词向量,比如'Python'
python_vector = word_vectors['Python']
print("词向量维度:", len(python_vector))
print("词向量Python表示:", python_vector)

#获取训练后的词表
print("模型训练后的词表:",word_vectors.index_to_key)

#获取模型训练后的每个词对应的下标
print("词表中每个词及其位置:",word_vectors.key_to_index)

# 获取每个词和其对应的词向量
print("每个词和其对应的词向量:")
for word, index in word_vectors.key_to_index.items():
    print(word,":" ,word_vectors.vectors[index])

Detailed explanation of parameters

  • sentences: training data, a list containing multiple sentences, each sentence is a list containing multiple words.

  • vector_size: The dimension of the word vector, that is, how many dimensions each word is represented as a vector. Larger values ​​generally result in richer semantic representations, but also require more computational resources. Generally speaking, try using a value between 100 and 300.

  • window: window size, used to specify the maximum distance between the current word and the predicted word. The window size determines how much contextual information the model will consider. Larger window sizes can capture further context, but may result in a sparser model. Typically, the choice of window size depends on the characteristics of the training data.

  • min_count: Minimum word frequency, used to specify that words with a frequency lower than this value are ignored during the training process. Smaller values ​​can filter out noisy words, but may also filter out some useful information. It is generally recommended to set this to a reasonable value, such as 1 or 5.

  • workers: Number of threads used for training. Multi-threading can be utilized to speed up the training process. Generally speaking, setting the value to the number of CPU cores is a reasonable choice.

  • sg: model training algorithm selection. The default value is 0, which means using the CBOW (Continuous Bag of Words) algorithm; setting it to 1 means using the Skip-gram algorithm. The Skip-gram algorithm generally works better for rare words, while the CBOW algorithm trains faster for common words and overall.

  • hs: Hierarchical Softmax settings. The default value is 0, which means negative sampling is used. When set to 1, the model will use hierarchical softmax to speed up the training process, especially suitable for large corpora.

  • negative: The number of negative sampling (Negative Sampling). The default value is 5-20. Negative sampling is used to approximate the Softmax loss function in training and reduce computational complexity. Larger values ​​can increase training speed, but may also result in slower model performance.

  • alpha: the initial value of the learning rate. The default value is 0.025. The learning rate controls how quickly the model updates its weights on each training example. A larger learning rate can speed up model convergence, but if set too large, the model may diverge. During the training process, the learning rate will gradually decrease.

Guess you like

Origin blog.csdn.net/David_house/article/details/131008716