Sentiment analysis [R] day4 language learning notes of text mining

1. Objective: To analyze and digging tweets on Twitter, to determine their attitudes toward Apple's (positive, negative, or other) as accurately as possible.

 

2. Source: Twitter API; Construction dependent variable Method: Amazon Mechanical Turk; arguments to push the message contents.

Amazon Mechanical Turk: Amazon Mechanical Turk crowdsourcing is a market in which individuals or businesses can perform computer tasks not currently performed using artificial intelligence. As one of the world's largest crowdsourcing market, Mechanical Turk provides on-demand, scalable workforce, will start-up companies, enterprises, researchers, artists, well-known technology companies and government agencies and individuals linked to solve computer vision, machine issues learning, natural language processing. 

 

tweets <- read.csv("tweets.csv", stringsAsFactors=FALSE) 
View(tweets)

  

 

str (tweets) # View data structure

  

 

 

Create a dependent variable

tweets$Negative = as.factor(tweets$Avg <= -1)
table(tweets$Negative)

 

 

 

 

3. Data pre-processing: word model bags (bag of words)

 

Creating Corpus 

corpus <- VCorpus(VectorSource(tweets$Tweet)) 
# VCorpus() creates volatile corpora.
# VectorSource(): A vector source interprets each element of the vector x as a document.

# Look at corpus
corpus
corpus[[1]]$content

# output: [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

  

3.1 all lowercase letters (irregularities) 

# Convert to lower-case
corpus <- tm_map(corpus, content_transformer(tolower))
corpus[[1]]$content

# output: [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

  

3.2 to remove all punctuation (punctuation)

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
corpus[[1]]$content

# output: [1] "i have to say apple has by far the best customer care service i have ever received apple appstore"

  

However, sometimes it makes sense to punctuation. For example, in Twitter, @ Apple on behalf of a message to Apple; #Apple represent this information on Apple's; in addition, punctuation is important for the website.

 

Table 3.3 removal of stop words (stop words)

Stop-list refers to is unlikely to help improve the accuracy of machine learning vocabulary collection, for example: at The, IS, AT, and Which. So we can remove these words to reduce the amount of data.

# Look at stop words 
stopwords("english")[1:10]

# output: [1] "i"         "me"        "my"        "myself"    "we"        "our"       "ours"     "ourselves"     "you"       "your"


# Remove stopwords and apple corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english"))) corpus[[1]]$content # output: [1] " say far best customer care service ever received appstore"

 

However, removal of stop words, there are some potential problems. Sometimes, two stop words together may have very important implications. For example, 'The Who' of these two stop words together, is the name of a band.

 

3.4 Reserved roots (stemming)

When a different method when the end of the same word, but it represents the same meaning, and we can only keep these word stemming to remove redundancy. For example argue,  is argued, argues, as well as representatives of arguing the same meaning, but are counted according to different words, so you can keep only method by root, so that they argu replaced with a common root. 

# Stem document 
corpus = tm_map(corpus, stemDocument)
corpus[[1]]$content

# output: [1] "say far best custom care servic ever receiv appstor"

 

Similarly, the shortcomings of this approach is that some words have the same root, but different endings have different meanings, if deleted, it will affect the accuracy of prediction

 

3.5 Creating word frequency matrix

Wherein each row represents each observation value (tweets), each column corresponding to a word appears in the text push numerical matrix each word occurrence frequency in each observation value (number of times).

# Create frequency matrix
frequencies <- DocumentTermMatrix(corpus)
frequencies

 

 

Thus, frequencies contained in the 1181 and 3289 observations words.

 

View Matrix

# Look at matrix 
inspect(frequencies[1000:1005,505:515])

 

 

 

View high-frequency words

# Check for sparsity
findFreqTerms(frequencies, lowfreq=20) # lowfre = 20 means the terms that appear at least 20 times

 

 

 

我们可以发现,在3289个单词当中,只有56个词出现的频率不少于20次,因此我们可以去除一些低频词汇,因为这些低频词汇对于模型预测没有较大的帮助,且其存在会导致大量的计算,进而延长模型运算的时间。

# Remove sparse terms
sparse <- removeSparseTerms(frequencies, 0.995) # sparsity threshold = 0.995 means only keep terms that appear in 0.5% or more of the tweets,
sparse

 

 

 

将低频词汇剔除后,sparse中只包含了309个观测值。

 

 

接下来,将matrix转化为data frame以便于构建预测模型;并确保data frame的列名是恰当的。

# Convert to a data frame that we'll be able to use for our predictive models.
tweetsSparse <- as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly # Since R struggles with variable names that start with a number, and we probably have some words here that start with a number. colnames(tweetsSparse) <- make.names(colnames(tweetsSparse))

 

添加因变量,并将数据分为测试集(70%)和训练集(30%)

# Add dependent variable
tweetsSparse$Negative <- tweets$Negative

# Split the data
library(caTools)
set.seed(123)
split <- sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse <- subset(tweetsSparse, split==TRUE) # 测试集
testSparse <- subset(tweetsSparse, split==FALSE) # 训练集

 

 

4. 构建预测模型

 

4.1 CART分类树模型

# Build a CART model
library(rpart)
library(rpart.plot)

tweetCART <- rpart(Negative ~ ., data=trainSparse, method="class")
prp(tweetCART)

 

因此,当推文中出现freak、hate或者wtf任意一个词的时候,用户被划分为负面情绪;当推文中不包含这三个词的时候,推文不是负面情绪。

 

预测模型的准确性:

# Evaluate the performance of the model
predictCART <- predict(tweetCART, newdata=testSparse, type="class")

# Compute accuracy
library(caret)
confusionMatrix(predictCART, testSparse$Negative)

 

 

 根据混淆矩阵的结果,将CART分类树模型应用到测试集,其准确性为87.89%。

 

# Baseline accuracy 
table(testSparse$Negative) # baseline accuracy is about 84.50704% = 300/(300+55)

 

因此,CART分类树模型在baseline的基础上,准确性有所提升(从94.5%提升至87.89%)。

 

接下来,考虑CART模型复杂性, 改进模型进而提升模型准确性

# consider cp value
tweetCART2 <- rpart(Negative ~ ., data=trainSparse, method="class", cp = 0.000001, minsplit = 5,
                    xval = 10)
printcp(tweetCART2)
prune.cart <- prune(tweetCART2, cp = 0.0236220)
prp(prune.cart)

 

 根据改进的CART模型,当推文中出现freak、hate、wtf或者shame任意一个词的时候,用户被划分为负面情绪;当推文中不包含这三个词的时候,推文不是负面情绪。

 

# predict
prune.cart.pred <- predict(prune.cart, newdata = testSparse, type = 'class')
confusionMatrix(prune.cart.pred, testSparse$Negative)

 

 

 根据混淆矩阵的结果,将改进的CART分类树模型应用到测试集,其准确性由87.89%提升至88.17%。

 

 

4.2 随机森林模型

# Random forest model
library(randomForest)
set.seed(123)
tweetRF <- randomForest(Negative ~ ., data=trainSparse)

# Make predictions:
predictRF <- predict(tweetRF, newdata=testSparse)
confusionMatrix(predictRF, testSparse$Negative)

 

 

 

将随机森林模型应用到测试集中,其准确性高达88.45%。然而,随机森林模型比改进后的CART模型解释性差很多,且改进后的模型准确性也高达88.17%。因此,更推荐改进后的CART模型。

 

 

4.3 逻辑回归模型

# logistic regression mode;
tweetLog <- glm(Negative ~ ., data = trainSparse, family = 'binomial')
predictions <- predict(tweetLog, newdata=testSparse, type="response")
confusionMatrix(as.factor(predictions>0.5), as.factor(testSparse$Negative))

 

 

 

当cutoff为0.5时,模型准确性为80.28%,低于baseline准确性。

 

综上所述,在本案例中,推荐使用优化后的CART模型。

 

 

 

Guess you like

Origin www.cnblogs.com/shanshant/p/11910894.html