word2vector从参数解释到实战

1，Word2Vector参数解释

Word2Vector是gensim封装好的模块，gensim是generate similatirity的缩写。

本文默认有词向量的基础。参数：

from  gensim.models import Word2Vec
#下面的参数均是默认值
Word2Vec(sentences=None,  #sentences可以是分词列表，也可以是大语料
        size=100,#特征向量的维度
        alpha=0.025,#学习率
        window=5,#一个句子内，当前词和预测词之间的最大距离
        min_count=5,#最低词频
        max_vocab_size=None,#
        sample=0.001, #随机下采样的阈值
        seed=1,#随机数种子
        workers=3,#进程数
        min_alpha=0.0001,#学习率下降的最小值
        sg=0, #训练算法的选择，sg=1，采用skip-gram，sg=0，采用CBOW
        hs=0,# hs=1,采用hierarchica·softmax，hs=10,采用negative sampling
        negative=5,#这个值大于0，使用negative sampling去掉'noise words'的个数（通常设置5-20）；为0，不使用negative sampling
        cbow_mean=1,#为0，使用词向量的和，为1，使用均值；只适用于cbow的情况
        iter = 5,#迭代次数
        null_word = 0,
        trim_rule = None, #裁剪词汇规则，使用None（会使用最小min_count）
        sorted_vocab = 1,#对词汇降序排序
        batch_words = 10000,#训练时，每一批次的单词数量
        compute_loss = False,
        callbacks = ())

2，kaggle电影评论实战

导入需要用到的模块

import pandas as pd
import numpy as np
from gensim.models import word2vec
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import nltk.data
import re

训练数据详情

数据集下载链接 kaggle电影评论文本情感分析数据集

train = pd.read_csv('../Bag of Words Meets Bags of Popcorn/labeledTrainData.tsv/labeledTrainData.tsv',header=0,delimiter='\t',quoting=3)
print(train.head())#头5个数据
print(train.tail())#最后5个数据

结果：

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3  "3630_4"          0  "It must be assumed that those who praised thi...
4  "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
              id  sentiment                                             review
24995   "3453_3"          0  "It seems like more consideration has gone int...
24996   "5064_1"          0  "I don't believe they made this film. Complete...
24997  "10905_3"          0  "Guy is a loser. Can't get girls, needs to bui...
24998  "10194_3"          0  "This 30 minute documentary Buñuel made in the...
24999   "8478_8"          1  "I saw this movie as a child and it broke my h...

word2vector从参数解释到实战

1，Word2Vector参数解释

Word2Vector是gensim封装好的模块，gensim是generate similatirity的缩写。

2，kaggle电影评论实战

猜你喜欢