"Youth of you" sentiment analysis Commentary - logic machine learning regression

background

"Youth of you" this domestic TV serials influence is still quite large, box office reached 1.4 billion, from the box office point of view, the film in TV serials produced in the mainland more successful, Actor: Yi Xi smelt one thousand + ZhouDongYu, of course, will cause some traffic effects, but many actors the plot, the acting evaluation of the film is quite good, including his sister is also very fond of; it also into the "white Night" and "suspects plagiarism Keigo Higashino of X's dedication, "the hot, causing many original fans dissatisfied. Here on the use of logistic regression (LogisticRegression) for "young you" were some of Commentary sentiment analysis, take a look at how people have been viewing the evaluation of the film.

retrieve data

Data from IMDb - "You're young," get on the Commentary
Here Insert Picture Description
Although the figures show there are 220,200 commentaries, and I only crawled to 600, but a small sample set of data is sufficient
crawler process is not hard, not too much Overview

Data processing

Libraries and tools needed

import pandas as pd
import jieba
import re

Tools : jupyter Notebook

Data cleansing

FIG following data reading
Here Insert Picture Description
data content: name, Commentary, evaluation
since Commentary content crawled into 500 samples and 100 samples, it is necessary first to integrate two sets of data into a data set
using pandas the merge method may be
Here Insert Picture Description
in rating of this column, data or list format, observability is relatively poor, it is easy to infer 10-50 can be divided into five levels, which is a few star reviews we saw on the page, in order to facilitate the analysis, you can write a function rating is divided into five grades 1-5

def rating(e):
    if '50' in e:
        return 5
    elif '40' in e:
        return 4
    elif '30' in e:
        return 3
    elif '20' in e:
        return 2
    else:
        return 1
data['new_rating'] = data['rating'].map(rating)
data.head()

After running the data shown below
Here Insert Picture Description
then the question again, only praise for the evaluation of negative feedback and bad points, but the rating has five levels of how to do it?
Samsung may first evaluate omitted, probably because these evaluations Evaluation was neutral, and then the four-star and star as received, denoted by 1; then a satellite positioning Award and Poor, represented by -1

new_data = data[data['new_rating']!=3]
new_data['sentiment'] = new_data['new_rating'].apply(lambda x : +1 if x>3 else -1)
new_data

Here Insert Picture Description
Only 557 samples, indicating that 43 Samsung neutral evaluation omitted
Here Insert Picture Description
ratio praise and Poor's is about 3.5: 1, it can be seen like the film's still quite a lot of people
but there have been no sample sample balance problems, this will have an impact late model

Determine how a short commentary is good or bad?
I like you I hate you
like hate

Well it can not be judged according to the number of words in a sentence, so the next step with jieba carried out every word library for a short commentary
before the first word we can analyze the text, there will be a lot of content has no effect on sentiment analysis, such as digital and letters, so when the word can be removed

#分词
def cut_word(text):
    text = jieba.cut(str(text),cut_all=False)
    return ' '.join(text)
new_data['new_short'] = new_data['short'].apply(cut_word)
#剔除数字
def remove_num(new_short):
    return re.sub(r'\d+','',new_short)
#剔除字母
def remove_word(new_short):
    return re.sub(r'[a-z]+','',new_short)
new_data['new_short'] = new_data['new_short'].apply(remove_num)
new_data['new_short'] = new_data['new_short'].apply(remove_word)

Segmentation results as
Here Insert Picture Description
text you can see some words with personal feelings, such as dedication, affection, etc.

Logistic regression modeling

Required libraries

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
from pandas import DataFrame

Analysis and Modeling

The first step involves dividing data analysis of good data, divided into a training set and test set

train_data,test_data = train_test_split(new_data,train_size = 0.8,random_state = 0)
#文本提取
transfer = CountVectorizer()
train_word = transfer.fit_transform(train_data['new_short'])
test_word = transfer.transform(test_data['new_short'])
#稀疏矩阵
print('new_data:\n',train_word.toarray())
#特征值
print('feature_name:\n',transfer.get_feature_names())

A second step after the text word feature value extraction may be generated corresponding to a sparse matrix, eigenvalues and sparse matrix corresponding to
the feature value and the target value Logistic regression modeling a third step, i.e., so that the training set is intended together to generate a model

x_train,x_test,y_train,y_test = train_test_split(new_data['new_short'],new_data['sentiment'],train_size = 0.8,random_state = 0)
x_train = train_word
x_test = test_word
model = LogisticRegression()
model.fit(x_train,y_train)
y_predict = model.predict(x_test)
print('布尔比对:\n',y_predict==y_test)
score = model.score(x_test,y_test)
print('模型准确率:\n',score)

The results obtained and the model accurately predicted rate follows
Here Insert Picture Description
the model accuracy was 85.7%, the modeling results in general
we can pick out a few examples from the test set to validate the text, look at sentiment analysis is correct

example = test_data[50:55]
example[['short','new_rating','sentiment']]

Here Insert Picture Description
If you want to observe a complete commentaries, you can write an iterator, the Commentary full output
but in the figure we can see these semantic Commentary is how, for example, the third involving plagiarism, the corresponding sentiment is -1
through logistic regression the predict_proba can get an evaluation of the probability of winning, that is, the more probability close to 1, the more likely this short commentary is praise, empathy commentary for the difference in assessment of probability close to zero

possibility = model.predict_proba(test_word)[:,1]
test_data.loc[:,'possibility'] = possibility
test_data.head()

The data obtained in the following figure
Here Insert Picture Description
after the index can be obtained on the "young you" evaluate the best five and worst five Commentary
Here Insert Picture Description
Here Insert Picture Description
same iteration to see the full commentaries, you can see more than the praise Top5 write, but also more take heart, most are saying that the film reflected a social problem - school bullying; and poor Top5 have pointed out that the movie is copied, causing many people unhappy
Here Insert Picture Description
picture shows the number of times the word appears more in the commentary:
( acting like youth or hope ) - these words should be out as to some of the praise belongs to positive words, can influence emotional evaluation carries a
( bullying protecting school bullying ) - these words are the words of a statement film background, although some band have a negative emotion, but based on semantics needed to determine the sentence Commentary good Bad points
( plagiarism ) - this word appears 67 times, while sentiment -1 essay a total of only 122, the analysis also can be learned with evaluation of plagiarism was probably bad review

to sum up

A box office may reach 14 million movies, must have its own unique, whether it is cast or background theme, but a movie is destined to determine plagiarism will not be a good movie, as a passer, on "Teenager you "can not make a judgment whether or not plagiarism; but anyone's intellectual property rights can not be violated is for certain!

No public "toffee cat" backstage reply "young you" can get the source code and data for reference, thanks for the support.

Guess you like

Origin blog.51cto.com/14746554/2476358