1,Word2Vector参数解释
Word2Vector是gensim封装好的模块,gensim是generate similatirity的缩写。
本文默认有词向量的基础。参数:
from gensim.models import Word2Vec #下面的参数均是默认值 Word2Vec(sentences=None, #sentences可以是分词列表,也可以是大语料 size=100,#特征向量的维度 alpha=0.025,#学习率 window=5,#一个句子内,当前词和预测词之间的最大距离 min_count=5,#最低词频 max_vocab_size=None,# sample=0.001, #随机下采样的阈值 seed=1,#随机数种子 workers=3,#进程数 min_alpha=0.0001,#学习率下降的最小值 sg=0, #训练算法的选择,sg=1,采用skip-gram,sg=0,采用CBOW hs=0,# hs=1,采用hierarchica·softmax,hs=10,采用negative sampling negative=5,#这个值大于0,使用negative sampling去掉'noise words'的个数(通常设置5-20);为0,不使用negative sampling cbow_mean=1,#为0,使用词向量的和,为1,使用均值;只适用于cbow的情况 iter = 5,#迭代次数 null_word = 0, trim_rule = None, #裁剪词汇规则,使用None(会使用最小min_count) sorted_vocab = 1,#对词汇降序排序 batch_words = 10000,#训练时,每一批次的单词数量 compute_loss = False, callbacks = ())
2,kaggle电影评论实战
- 导入需要用到的模块
import pandas as pd import numpy as np from gensim.models import word2vec from bs4 import BeautifulSoup from nltk.corpus import stopwords import nltk.data import re
- 训练数据详情
train = pd.read_csv('../Bag of Words Meets Bags of Popcorn/labeledTrainData.tsv/labeledTrainData.tsv',header=0,delimiter='\t',quoting=3) print(train.head())#头5个数据 print(train.tail())#最后5个数据
结果:
id sentiment review 0 "5814_8" 1 "With all this stuff going down at the moment ... 1 "2381_9" 1 "\"The Classic War of the Worlds\" by Timothy ... 2 "7759_3" 0 "The film starts with a manager (Nicholas Bell... 3 "3630_4" 0 "It must be assumed that those who praised thi... 4 "9495_8" 1 "Superbly trashy and wondrously unpretentious ... id sentiment review 24995 "3453_3" 0 "It seems like more consideration has gone int... 24996 "5064_1" 0 "I don't believe they made this film. Complete... 24997 "10905_3" 0 "Guy is a loser. Can't get girls, needs to bui... 24998 "10194_3" 0 "This 30 minute documentary Buñuel made in the... 24999 "8478_8" 1 "I saw this movie as a child and it broke my h...