The use of word2vec to construct features in the field of atypical NLP

Recently, embeding time series data is very popular in various competitions, and the effect is very good. For example, the sharing of rank1 in the construction of Smart Haihai in Digital China 2020, such as the 2020 Tencent Advertising Contest.

In order to score, NLP Xiaobai had to spend some time chewing on tfidf, word2vec, and doc2vec.

The following is the code that uses gensim to implement word2vec to build features (now, show u the code):

# -*- coding: utf-8 -*-
"""
Created on Thu Jun  4 16:23:02 2020

@author: csdn lanxuml
"""
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

import numpy as np
import pandas as pd

#构建n_dims维的特征
n_dims = 64
#模型训练
model = Word2Vec(common_texts, size=n_dims, window=5, min_count=1, workers=4)
#构建n_dims维0行的numpy array
vector_corpus_np = np.zeros((0, n_dims))
#将common_texts中每行记录的模型分数求列均值作为改行的特征向量
vector_corpus_np = np.insert(vector_corpus_np, 0, values=[ np.mean(model[common_texts[i]], axis=0)for i in range(0,len(common_texts))], axis=0)
#将numpy array转为pandas dataframe
vector_corpus_df = pd.DataFrame(vector_corpus_np)
#为了避免在建模时特征名为整数而报错,修改特征名
vector_corpus_df.columns = ['dim_'+str(i) for i in range(0,len(vector_corpus_df.columns.values.tolist()))]

Note (a bit messy, if you don’t understand, you can copy the code directly and run it in your notebook or IDE):

    1. If the value of min_count is greater than 1 for n, the words that appear less than n in common_texts need to be deleted before model training;

    2. Since the non-numerical features usually given in the atypical NLP field are all type data, jieba is not used for word segmentation in this article;

    3. In actual operation, it is necessary to construct common_texts according to the known characteristics. For example, the set of advertising IDs that user i has clicked in 90 days (can be spliced ​​into a list) is used as a text sentence common_texts[i] (model[common_texts[i] ]] Find the column mean structure as the n_dims-dimensional feature vector of the user).

Above, after constructing the features, you can input various models to build (adjust) models (parameters).

Guess you like

Origin blog.csdn.net/lanxuxml/article/details/106573356