The road of data analysis and learning - (9) Labeling the message data with emotion is actually very simple

        I've been busy looking for a job recently, and it's rare to find time to update my blog. Some time ago, I carefully studied an article-Using machine learning to predict sentiment classification, that is, the application of the most basic classification model in data analysis. It feels very interesting. So on the basis of this article, I improved my ideas and codes, not only a visual data experience, but also a good prediction result (mainly by adjusting the parameters). This article draws on an article on text sentiment classification in the WeChat public account R language Chinese community .

        When it comes to text analysis-sentiment classification, the customer service application system I am working on also has related business requirements, that is, to judge the friendliness of customers based on the data of chats between agents and customers, and to label customers based on their conversations and chats. , and predict new customer label classifications. Since I have access to the production data of some customers in my actual work, I have tried related aspects. However, due to the confidentiality of the data, this article still takes the public message data of a hotel as an example, and the analysis ideas are the same.

analysis of idea

        All message data is placed in an excel file, divided into two sheet pages, the first page of data is train_data, and the second page is test_data, so when building the model, you need to put the data of the two sheet pages together to select features, and then build the model. Only train_data is required. But there is one important thing to do before selecting features. The data in the sheet page is a long sentence, and there is no way to analyze it, so it is necessary to perform word segmentation and select important words as features. Then the most important data processing in the early stage comes, one is word segmentation and the other is feature vector selection .

Participle

        There are many methods of word segmentation. There is a jieba package in Python, which is very convenient to deal with this problem. There are three modes of jieba word segmentation - full mode ( jieba.cut(text, cut_all=True) ), exact mode (default) ( jieba.cut(text, cut_all=False) ), search engine mode ( jieba.cut_for_search(text) ), you can choose appropriate parameters and methods, I chose the default precision mode in this project. When dealing with word segmentation, it is necessary to set custom words according to different business needs (for example, my fund industry has a fixed word such as fund net value. If no custom word is set, it is easy to be divided into two words: fund and net value); In addition, common and meaningless general words need to be set. These words have no meaning for selecting features and building models, but also increases the execution time of the program, which are called stop words .

Feature vector

        Feature selection, that is, after the word segmentation is completed, we need to determine which words appear the most frequently, and how much influence they have on the message. The meaning of the influence effect means: a certain word or a few words can indicate that this message is more likely to be a long sentence. In which category. For example, if a message contains "poor service attitude", then the feature of "poor attitude" will have a higher probability that the message will be classified as a negative evaluation. So how to define and calculate eigenvectors? This training uses TF-IDF weights to construct the document term matrix.

        There are many introductions to the analysis of TF-IDF (word frequency - inverse document frequency) on the Internet, so I won't say more. TF, refers to the proportion of a word in all words, measured from the dimension of the word itself; IDF, refers to all sentences divided by the number of sentences containing a word, and then obtained by taking the logarithm, measured from the sentence dimension , multiply these two dimension values ​​to get the weight value of our feature vector.

        So far, I have completed the construction of features and weights, and the next step is to train the model with train_data and make predictions.

Implementation process

data collection

        The first is data acquisition. The training data and the test data are read separately. Because the feature words need to be selected when building the model, the two parts of the data need to be merged.

evaluation =pd.read_excel('C:/Users/Nekyo/DA/HelloBI/Hotel Evaluation.xlsx',sheetname=0)       # training data 
evaluation1 =pd.read_excel('C:/Users/Nekyo/DA/HelloBI/Hotel Evaluation.xlsx',sheetname=1)      # Test data 
evaluation_new =pd.concat([evaluation,evaluation1])                                            # Combine the two parts of the data, and then select the features

Participle

        The next step is to set custom words and stop words, as well as word segmentation. Because some invalid characters such as "333", "n" and so on are generated in the process of word segmentation, so through regularization, I only keep the Chinese words.

'''Read custom words, according to different business scenarios, you can define new words to add to the file''' 
with open('C:/Users/Nekyo/DA/HelloBI/all_words.txt') as words:
    my_words=[i.strip() for i in words.readlines()]
for word in my_words:
    jieba.add_word(word)
'''Read stop words'''
with open('C:/Users/Nekyo/DA/HelloBI/mystopwords.txt') as words:
    stop_words=[i.strip() for i in words.readlines()]
'''Process the message data in the dataset, and output a list set containing only Chinese words'''
def get_sentence(evaluation):
    swords=[]
    for eva in evaluation['Content']:
        words=cut_word(eva)
        pattern = re.compile(u"[\u4e00-\u9fa5]+")
        result = re.findall(pattern, words)                        # After word segmentation, only Chinese words are retained
        swords.append(' '.join(result))
    return swords
'''Word segmentation function'''
def cut_word(sentence):
    words=[i for i in jieba.cut(sentence) if i not in stop_words]
    result=' '.join(words)                                          # result is a collection of stop words removed 
    return result

        This is the set of all the words after the word segmentation, the key is to get it, it will be useful later:

s_words=get_sentence(evaluation_new)

Visual Analysis

        Next, let’s do a visual display of the words we got. The cloud map can display the frequency of all words very intuitively. Through these words, we can roughly judge the overall trend of the message.

'''Visualize cloud map''' 
def set_wordcloud(s_words):
    all_words = []
    for word in s_words:
        all_words.extend(str(word).split(' '))
    words_count={}
    words_count=words_count.fromkeys(all_words)
    for word in list(words_count.keys()):
        words_count[word.decode('utf-8')]=all_words.count(word)    # Cloud map needs dictionary format: {'xx':100,'yy':200,'zz':300} 
    del[words_count[' ']]                                           # There are empty elements in the dictionary, delete them
    wordcloud=WordCloud(font_path='C:/Users/Nekyo/tools/SOFTWARE/Anaconda2/Library/lib/fonts/songti.ttf',
                                   width=1000,height=600,background_color='white')
    wordcloud.fit_words(words_count)
    plt.imshow(wordcloud)
    plt.axis ('OF3')
    plt.show()

            

Feature weights

        As shown in the figure, the words with positive reviews account for the vast majority, indicating that the hotel has left a very good impression on customers. Relatively speaking, more customers have given a higher rating. Then the next step is to build a model and test the prediction effect.

'''Set TF-IDF weights''' 
def setModel(s_words):
    tfidf = TfidfVectorizer(max_features=100)                 # Select the top 100 most frequent features
    dtm = tfidf.fit_transform(s_words).toarray()
    columns = tfidf.get_feature_names()
    X_data = pd.DataFrame(dtm, columns=columns)               # Convert matrix to data frame
    return X_data
X=setModel(s_words)
y = evaluation.Emotion                                        # emotion label variable
X_train,X_test,y_train,y_test = train_test_split\
              (X[:3890],y,train_size = 0.8, random_state=1)  # X[:3890] is the aforementioned train_data

Analytical model

        The last step is to use the training data X[:3890] to build a model and classify and predict the test data X[3890:], which is to call Python's powerful sklearn package. I used naive Bayes, random forest, and SVM three models respectively, and finally found that the prediction effect of random forest is really good, which can reach 80%. The SVM model is very ineffective in this project, and the overfitting is too serious. Therefore, we need to try the prediction effect and stability of various models according to different actual business environments, and select the appropriate model.

print '################################## Naive Bayes Model########## ######################'
from sklearn.naive_bayes GaussianNB
nb = GaussianNB()
fit = nb.fit(X_train,y_train)
pred = fit.predict(X_test)
accuracy = metrics.accuracy_score(y_test,pred)
print 'The accuracy rate on the training set is %-8.6f%%' % (accuracy*100)
pred1 = fit.predict(X[3890:])
print 'Predicted classification result:\n', pred1
accuracy1 = metrics.accuracy_score(evaluation1.Emotion,pred1)
print 'The accuracy rate on the test set is %-8.6f%%' % (accuracy1*100)
print '############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################### ######################'
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=150,n_jobs=-1,                           # Parameter adjustment is individual labor
                          oob_score=True,max_depth=150,min_samples_split=2,min_samples_leaf=1)
rfit=rf.fit(X_train,y_train)
rpred=rfit.predict(X_test)
raccuracy = metrics.accuracy_score(y_test,rpred)
print 'The accuracy rate on the training set is %-8.6f%%' % (raccuracy*100)
rpred1=rfit.predict(X[3890:])
print 'Predicted classification result:\n', rpred1
raccuracy1 = metrics.accuracy_score(evaluation1.Emotion,rpred1)
print 'The accuracy rate on the test set is %-8.6f%%' % (raccuracy1*100)
print '##################################### SVM #################################'
from sklearn.svm import SVC,LinearSVC
#svc=LinearSVC()   # Different SVM models can be selected                                                            
svc = SVC(kernel='sigmoid')
#svc = SVC(kernel='poly', degree=3)
sfit=svc.fit(X_train,y_train)
spred=sfit.predict(X_test)
saccuracy = metrics.accuracy_score(y_test,spred)
print 'The accuracy rate on the training set is %-8.6f%%' % (saccuracy*100)
spred1=sfit.predict(X[3890:])
print 'Predicted classification result:\n', spred1
saccuracy1 = metrics.accuracy_score(evaluation1.Emotion,spred1)
print 'The accuracy rate on the test set is %-8.6f%%' % (saccuracy1*100)

               

Model evaluation

        Finally, I use the ROC Curve to evaluate the model. Why is this step necessary? Because the value we predict through the model is an accuracy rate, but based on an accuracy rate alone, it is not certain to say that this is a good model. As an extreme example, a data set is divided into two categories, in which the number of category A is 990, and the number of category B is 10, and then the results predicted by a model are all category A, with an accuracy rate of 99%, which can illustrate this The model is very good, and the answer is obviously no. At this time, in addition to the accuracy rate (Accuracy), the criteria such as precision rate (Precision), recall rate (Recall), and F1-Score are also required to comprehensively evaluate the model.

        The ROC curve (a lot of introductions on the Internet, not much to talk about here) is a method of comprehensively evaluating the model. The abscissa FPR represents the false positive class rate, the ordinate TPR represents the true class rate, and the area under the solid curve indicates that the model is predicted to be positive The probability of the class, the value is between (0, 1), generally > 0.5, the larger the value, the better the prediction effect of the model. Then evaluate the random forest model, the value is 0.7673, and the above accuracy rate shows that the classification effect of this model is not bad.

        

Epilogue

        This article is actually a simple case of text analysis + machine learning + classification prediction, which is the end of the explanation. However, there is still a lot to think about, and that is where the implementation can actually be optimized. For example: 1. The method of selecting features needs to be optimized. If there is a new message, a new word will be generated after the word segmentation, but the original model does not have this word; 2. The prediction method is too simple, and the step of cross-validation should be added. , so that the bias and variance of the model are in a balanced state, and the model is more robust.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324437795&siteId=291194637