[R] language learning notes day5 text mining of sentiment analysis ---- damage detector machine

 

 

1. BACKGROUND AND PURPOSE: Wikipedia is anyone can edit and contribute free online encyclopedia. It supports multiple languages, and has been growing. In the English version of Wikipedia, currently there are 4.7 million, a total of more than 760 million times edited. One of the consequences is that anyone can edit some people destroyed the page. This can take the following forms: delete content, add promotions or inappropriate content, or more subtle changes to alter the meaning of the article. There are so many articles and edit every day is difficult for humans to detect all acts of vandalism and restore (undo) them. As a result, the use of Wikipedia bots (computer program), the computer program will automatically revert edits look like vandalism. Therefore, we try to use text mining to develop a way to destroy the machine detector, the detector uses machine learning to distinguish between valid and edit destruction machine.

 

2. Data Description: The data in this issue of the revision history of language-based page. Wikipedia gives you the history of each page, including the revision of each page status. Instead of manually consider each revision, but run a script to check whether to retain or restore edit editing. If the change was eventually revoked, the revision is marked as vandalism. This may cause some misclassification, but the script to meet our needs.

As a result of this pretreatment it has been completed some common processing tasks, including lower housing and punctuation positions. Data set as:

  • Vandal:  if this edit is vandalism, the destroyer = 1, and 0 otherwise.
  • Minor:  If the user sets the edit flag to "Secondary Edit", the Minor = 1, and 0 otherwise.
  • Loggedin:  If users use Wikipedia edit this account, Loggedin = 1, otherwise, or 0
  • Added: The only word to add
  • Removed: The only word deleted   

Note: We have the data than the traditional bag of words, but words set to be deleted or added. For example, if a word is deleted in the revised version several times, the word appears only once in the "deleted" column.

 

3. The application and analysis

 

# Read data
wiki <- read.csv("wiki.csv", stringsAsFactors = F)
str(wiki)

 

 

 

 

# Convert the "Vandal" column to a factor
wiki$Vandal <- as.factor(wiki$Vandal)
table(wiki$Vandal)

 

 

 

Data preprocessing

Note: This data set has all lowercase and punctuation removed

library(tm)
library(NLP)
corpusAdded <- VCorpus(VectorSource(wiki$Added)) # create corpus for the Added column
length(stopwords("english")) # check the length of english-language stopwords
corpusAdded <- tm_map(corpusAdded, removeWords, stopwords('english')) # remove the english-language stopwords
corpusAdded <- tm_map(corpusAdded, stemDocument) # stem the words

dtmAdded <- DocumentTermMatrix(corpusAdded) # build the document term matrix
sparseAdded <- removeSparseTerms(dtmAdded, 0.997) # remove sparse words

wordsAdded <- as.data.frame(as.matrix(sparseAdded)) # convert sparseAdded into a data frame
colnames(wordsAdded) = paste("A", colnames(wordsAdded)) # prepend all the words with the letter A


corpusRemoved <- VCorpus(VectorSource(wiki$Removed)) # create corpus for the Added column
length(stopwords("english")) # check the length of english-language stopwords
corpusRemoved <- tm_map(corpusRemoved, removeWords, stopwords('english')) # remove the english-language stopwords
corpusRemoved <- tm_map(corpusRemoved, stemDocument) # stem the words

dtmRemoved <- DocumentTermMatrix(corpusRemoved) # build the document term matrix
sparseRemoved <- removeSparseTerms(dtmRemoved, 0.997) # remove sparse words

wordsRemoved <- as.data.frame(as.matrix(sparseRemoved)) # convert sparseAdded into a data frame
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved)) # prepend all the words with the letter A

wikiWords = cbind(wordsAdded, wordsRemoved)
wikiWords$Vandal <- wiki$Vandal # final data frame

  

The data is divided into a training set and a test set

library(caTools)
set.seed(123)
spl <- sample.split(wikiWords$Vandal, 0.7)
wiki.train <- subset(wikiWords, spl ==T)
wiki.test <- subset(wikiWords, spl == F)

# baseline accuracy
table(wiki.test$Vandal)/nrow(wiki.test)

 

 

 

Construction of CART model

# CART model
library(rpart)
library(rpart.plot)
wiki.train.cart1 <- rpart(Vandal ~ ., data = wiki.train, method = 'class')
prp(wiki.train.cart1)

wiki.cart1.pred <- predict(wiki.train.cart1, newdata = wiki.test, type = 'class') # predict on the test set
library(caret)
confusionMatrix(wiki.cart1.pred, wiki.test$Vandal)

 

 

 尽管CART模型比baseline,但准确率也只有54.17%。因此,通过增删词汇预测并不是一个很好的办法,接下来我们将想办法改进模型。

 

模型改进1:辨认关键词

“网站地址”(也称为URL-统一资源定位符),由两个主要部分组成。例如“ http://www.google.com”。第一部分是协议,通常为“ http”(超文本传输​​协议);第二部分是网站的地址,例如“ www.google.com”。我们删除了所有标点符号,因此指向网站的链接在数据中显示为一个词,例如“ httpwwwgooglecom”。我们假设,由于很多人为破坏行为似乎都在添加到促销或不相关网站的链接,网址的存在是故意破坏的迹象。因此,我们可以通过在“添加的”列中搜索“ http”来搜索添加的单词中是否存在网址。

 

# Create a copy of your dataframe from the previous question:
wikiWords2 <- wikiWords

# Make a new column in wikiWords2 that is 1 if "http" was in Added:
wikiWords2$HTTP <- ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)
table(wikiWords2$HTTP)

 

 

 

 

将新的数据分为测试集和训练集,并构建CART模型

wikiTrain2 <- subset(wikiWords2, spl==TRUE)
wikiTest2 <- subset(wikiWords2, spl==FALSE)


# CART model
wiki2.train.cart1 <- rpart(Vandal ~ ., data = wikiTrain2, method = 'class')
prp(wiki2.train.cart1)

用构建的CART模型预测并计算准确性

wiki2.cart1.pred <- predict(wiki2.train.cart1, newdata = wikiTest2, type = 'class')
confusionMatrix(wiki2.cart1.pred, wikiTest2$Vandal)

 

 

 

 

模型改进2:文本计数

在此,我们假设可以通过增加/删除文字的个数来预测。因此,构建新的一列来表示每一个观测值增加/删除的文字的总数,重新划分测试集、训练集,并构建CART模型

# We already have a word count available in the form of the document-term matrices (DTMs).
wikiWords2$NumWordsAdded <- rowSums(as.matrix(dtmAdded)) wikiWords2$NumWordsRemoved <- rowSums(as.matrix(dtmRemoved)) wikiTrain3 <- subset(wikiWords2, spl==TRUE) wikiTest3 <- subset(wikiWords2, spl==FALSE)

# CART model
wiki3.train.cart1 <- rpart(Vandal ~ ., data = wikiTrain3, method = 'class')
prp(wiki3.train.cart1)

 

用构建的CART模型预测测试集数据并计算准确性

wiki3.cart1.pred <- predict(wiki3.train.cart1, newdata = wikiTest3, type = 'class')
confusionMatrix(wiki3.cart1.pred, wikiTest3$Vandal)

 

 

 

 

模型改进3:使用更多的信息

在此,我们引入Minor和Loggin两个变量

wikiWords3 <- wikiWords2
wikiWords3$Minor <- wiki$Minor
wikiWords3$Loggedin <- wiki$Loggedin
wikiTrain4 <- subset(wikiWords3, spl==TRUE)
wikiTest4 <- subset(wikiWords3, spl==FALSE)

# CART model
wiki4.train.cart1 <- rpart(Vandal ~ ., data = wikiTrain4, method = 'class')
prp(wiki4.train.cart1)

 

 

 模型预测及其准确性

wiki4.cart1.pred <- predict(wiki4.train.cart1, newdata = wikiTest4, type = 'class')
confusionMatrix(wiki4.cart1.pred, wikiTest4$Vandal)

 

 

 

因此,通过不断地改进,模型准确率由54.17%提升至71.88%。

 

 

 

 

Guess you like

Origin www.cnblogs.com/shanshant/p/11939350.html