Word2Vec model - a method for converting text into vectors

        When I was in Mei Sai, I used this model and posted it.

        Word2Vec is a technique for converting text into vector representations. It is a tool developed by Google in 2013, which is mainly used to convert words into vector representations and find the semantic relationship between words in the vector space. The Word2Vec model has two architectures: Continuous Bag-of-Words (CBOW) and Skip-Gram.

        In the CBOW model, the model tries to infer the current word from the context, while in the Skip-Gram model, the model tries to infer the context word from the current word. The goal of Word2Vec is to learn a vector space in which semantically similar words are relatively close in space. Specifically, Word2Vec represents words as high-dimensional vectors, which are designed to capture the probability distribution of words in context. After these vectors are trained, they can be used in various natural language processing tasks, such as text classification, language translation, and sentiment analysis.

        In general, the Skip-gram algorithm performs better for training smaller corpora or low-frequency words, while the CBOW algorithm performs better for training larger corpora or high-frequency words.

        Not much to say, just go to the code.

import pandas as pd
from gensim.models import Word2Vec

# 读入数据
# 读取训练文本
with open('output.txt', 'r', encoding='utf-8') as f:
    sentences = [line.strip().split() for line in f]

# 训练Word2Vec模型
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1)
model.save('word2vec.model')


# 读取另一个文件,提取单词的特征向量并保存到vector.csv
df = pd.read_csv('word.csv',encoding="gbk")
word_list = df['Word'].tolist()
vectors = []
for word in word_list:
    if word in model.wv:
        vectors.append(model.wv[word])
    else:
        vectors.append([0] * 100)  # 如果单词不在词汇表中,填充为0向量
vectors_df = pd.DataFrame(vectors)
vectors_df.to_csv('2.csv', index=False, header=None)

Then I explain what each step does.

with open('output.txt', 'r', encoding='utf-8') as f:
    sentences = [line.strip().split() for line in f]

         Open the file named "output.txt" and read the text in it, converting it into a form of nested lists, each list represents a sentence in the text, and the words in each sentence are split into individual element.

      

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4, sg=1)
model.save('word2vec.model')

        Sentences are trained using Word2Vec. Among them, vector_size represents the dimension of the feature vector, window represents the maximum distance around each word, min_count represents the minimum number of occurrences of the word, workers represents the number of threads for parallel training, sg represents the type of algorithm used (represents the use of Skip-gram algorithm sg=1for training, while sg=0the representation uses the CBOW algorithm for training). Finally, save the trained model in a file named "word2vec.model".                

        

df = pd.read_csv('word.csv', encoding="gbk")
word_list = df['Word'].tolist()

        Use the pandas library to read the file named "word.csv", extract the data in the "Word" column, and convert it into a list. This data is the data we need to extract the feature vector

vectors = []
for word in word_list:
    if word in model.wv:
        vectors.append(model.wv[word])
    else:
        vectors.append([0] * 100)

        For each word in the list, determine whether it is in the trained Word2Vec model. If present, its feature vector is extracted and added to the vectors list; otherwise its vector is set to a vector of all zeros.

        The problem is here, if you don't have enough text for training, and there are no words you want to extract vectors from, your result will be 0.

vectors_df = pd.DataFrame(vectors)
vectors_df.to_csv('WordVector.csv', index=False, header=None)

Convert the list of vectors to a pandas dataframe format and save it as a file named "WordVector.csv"

  output.txt is the training data of the model. Originally, Google has provided the training data, but I have been unsuccessful in downloading it. Google's data is larger, and it should take a long time to train.

Data URL: https://code.google.com/archive/p/word2vec/

If the download is successful, you can change the code of the training model. (Write the path correctly)

# 加载预训练模型
model_path = 'path/to/GoogleNews-vectors-negative300.bin.gz'
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

Then look at the data

Word.csv looks like this

 As a result, the extracted vector looks like this, each row corresponds to a feature vector of a word, and a total of 100 columns are 100-dimensional feature vectors.

Guess you like

Origin blog.csdn.net/m0_56540237/article/details/129246634