Kaggle movie review sentiment analysis


  • Kaggle is nothing terrible.
  • Simple algorithms are also very effective, and logistic regression hits the world.
  • Data preprocessing and feature engineering are very important.

Kaggle competition website:
https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

1. Import the data set

import pandas as pd
data_train = pd.read_csv('./train.tsv', sep = '\t')
data_test = pd.read_csv('./test.tsv', sep = '\t')
data_train.head()
data_train.shape

2. Building the corpus

# 提取训练集中的文本内容
train_sentences = data_train['Phrase']

# 提取测试集中的文本内容
test_sentences = data_test['Phrase']

# 构建一个语料库。通过pandas中的contcat函数将训练集和测试集的文本内容合并到一起
sentences = pd.concat([train_sentence, test_sentence])

# 合并的一起的语料库的规模
sentences.shape
# 提取训练集中的情感标签
label = data_train['Sentiment']

# 导入停词库
stop_words = open('./stop_words.txt', encoding = 'utf-8').read().splitlines()

3. Feature Engineering

Bag of words model, TF-IDF model, word2vec model for text feature engineering .

3.1 Build a model
Choose one of two:
bag of words model

from sklearn.feature_extraction.text import CountVectorizer
co = CountVectorizer(
    analyzer = 'word',
    ngram_range=(1,4),
    stop_words=stop_words,
    max_features=15000
)

TF-IDF model

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(
    analyzer = 'word',
    ngram_range=(1,4),
    max_features=150000
)

3.2 Training model
Choose one of two:
co.fit(sentences)
tf.fit(sentences)

3.3 Data set split

Randomly split the training set into new training and validation sets

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(train_sentence, label,random_state = 1234)
#- x_train 训练集数据 (相当于课后习题)
#- x_test 验证集数据 (相当于模拟考试题)
#- y_train 训练集标签 (相当于课后习题答案)
#- y_test 验证集标签(相当于模拟考试题答案)

3.4 Feature engineering of the split training set and validation set
Choose one of two:
use the bag-of-words model to transform the training set and validation set into vectors for feature engineering.

x_train = co.transform(x_train)
x_test = co.transform(x_test)
#查看训练集中的一个数据
x_train[1]

Use the TF-IDF model to transform the training set and the validation set into vectors for feature engineering.

x_train = tf.transform(x_train)
x_test = tf.transform(x_test)
x_train[1]

4. Build the classifier algorithm

4.1 Polynomial Naive Bayes Classifier

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(x_train,y_train)
print('词袋方法进行文本特征工程,使用sklearn默认的多项式朴素贝叶斯分类器,验证集上的预测正确率', classifier.score(x_test,y_test))

4.2 Logistic regression classifier

from sklearn.linear_model import LogisticRegression
#实例化一个逻辑回归类
lgl = LogisticRegression()
#训练模型
lgl.fit(x_train,y_train)
#预测正确率
print('词袋方法进行文本特征工程,使用sklearn默认的逻辑回归分类器,验证集上的预测正确率',lgl.score(x_test,y_test))

4.3 Logistic regression classifier with 2 parameters added

lg2 = LogisticRegression(C=3,dual=True,solver='liblinear')

The hyperparameter grid search GridSearchCV
adds the two parameters C and dual to the logistic regression to improve the prediction accuracy on the validation set. Manually each time is too troublesome.

Use sklearn to provide a powerful grid search function for batch experiments of hyperparameters.
Search space: 1-9. For each C, try two parameters with dual as True and False respectively.

Finally, jump out of all the parameters that can make the model have the highest prediction accuracy in the validation set.

from sklearn.model_selection import GridSearchCV
param_grid = {
    
     'C':range(1,10),
              'dual':[True,False]
}

lgGS = LogisticRegression()
grid = GridSearchCV(lgGS, param_grid=param_grid,cv=3, n_jobs=-1)
grid.fit(x_train,y_train)

#最优参数
grid.best_params_

#获取最佳模型
lg_final = grid.best_estimator_
print('经过网格搜索,找到最优超参数组合对应的逻辑回归模型,在验证集上的预测正确率', lg_final.score(x_test,y_test))

5. Make predictions on the test set data

#使用TF-IDF对测试集中的文本进行特征工程
test_X = tf.transform(data_test['Phrase'])

# 对测试集中的文本,使用lg_final逻辑回归分类器进行预测
predictions = lg_final.predict(test_X)

# 查看预测结果
predictions

#将测试结果加在测试集中
data_test.loc[:, 'Sentiment'] = predictions
data_test.head()

6. Organize the format according to the requirements of the kaggle competition official website


#loc通过索引标签来抽取数据:
final_data = data_test.loc[:,['PhraseId','Sentiment']]
final_data.head()

#保存为.csv文件,即为最终结果
final_data.to_csv('final_data.csv',index=None)

doubt

lg2 = LogisticRegression(C=3,dual=True)

This statement will report an error, don't know why?
Stay pit! ! !

Guess you like

Origin blog.csdn.net/Cirtus/article/details/109320548