1. BACKGROUND AND PURPOSE: Wikipedia is anyone can edit and contribute free online encyclopedia. It supports multiple languages, and has been growing. In the English version of Wikipedia, currently there are 4.7 million, a total of more than 760 million times edited. One of the consequences is that anyone can edit some people destroyed the page. This can take the following forms: delete content, add promotions or inappropriate content, or more subtle changes to alter the meaning of the article. There are so many articles and edit every day is difficult for humans to detect all acts of vandalism and restore (undo) them. As a result, the use of Wikipedia bots (computer program), the computer program will automatically revert edits look like vandalism. Therefore, we try to use text mining to develop a way to destroy the machine detector, the detector uses machine learning to distinguish between valid and edit destruction machine.
2. Data Description: The data in this issue of the revision history of language-based page. Wikipedia gives you the history of each page, including the revision of each page status. Instead of manually consider each revision, but run a script to check whether to retain or restore edit editing. If the change was eventually revoked, the revision is marked as vandalism. This may cause some misclassification, but the script to meet our needs.
As a result of this pretreatment it has been completed some common processing tasks, including lower housing and punctuation positions. Data set as:
- Vandal: if this edit is vandalism, the destroyer = 1, and 0 otherwise.
- Minor: If the user sets the edit flag to "Secondary Edit", the Minor = 1, and 0 otherwise.
- Loggedin: If users use Wikipedia edit this account, Loggedin = 1, otherwise, or 0
- Added: The only word to add
- Removed: The only word deleted
Note: We have the data than the traditional bag of words, but words set to be deleted or added. For example, if a word is deleted in the revised version several times, the word appears only once in the "deleted" column.
3. The application and analysis
# Read data wiki <- read.csv("wiki.csv", stringsAsFactors = F) str(wiki)
# Convert the "Vandal" column to a factor wiki$Vandal <- as.factor(wiki$Vandal) table(wiki$Vandal)
Data preprocessing
Note: This data set has all lowercase and punctuation removed
library(tm) library(NLP) corpusAdded <- VCorpus(VectorSource(wiki$Added)) # create corpus for the Added column length(stopwords("english")) # check the length of english-language stopwords corpusAdded <- tm_map(corpusAdded, removeWords, stopwords('english')) # remove the english-language stopwords corpusAdded <- tm_map(corpusAdded, stemDocument) # stem the words dtmAdded <- DocumentTermMatrix(corpusAdded) # build the document term matrix sparseAdded <- removeSparseTerms(dtmAdded, 0.997) # remove sparse words wordsAdded <- as.data.frame(as.matrix(sparseAdded)) # convert sparseAdded into a data frame colnames(wordsAdded) = paste("A", colnames(wordsAdded)) # prepend all the words with the letter A corpusRemoved <- VCorpus(VectorSource(wiki$Removed)) # create corpus for the Added column length(stopwords("english")) # check the length of english-language stopwords corpusRemoved <- tm_map(corpusRemoved, removeWords, stopwords('english')) # remove the english-language stopwords corpusRemoved <- tm_map(corpusRemoved, stemDocument) # stem the words dtmRemoved <- DocumentTermMatrix(corpusRemoved) # build the document term matrix sparseRemoved <- removeSparseTerms(dtmRemoved, 0.997) # remove sparse words wordsRemoved <- as.data.frame(as.matrix(sparseRemoved)) # convert sparseAdded into a data frame colnames(wordsRemoved) = paste("R", colnames(wordsRemoved)) # prepend all the words with the letter A
wikiWords = cbind(wordsAdded, wordsRemoved)
wikiWords$Vandal <- wiki$Vandal # final data frame
The data is divided into a training set and a test set
library(caTools) set.seed(123) spl <- sample.split(wikiWords$Vandal, 0.7) wiki.train <- subset(wikiWords, spl ==T) wiki.test <- subset(wikiWords, spl == F) # baseline accuracy table(wiki.test$Vandal)/nrow(wiki.test)
Construction of CART model
# CART model library(rpart) library(rpart.plot) wiki.train.cart1 <- rpart(Vandal ~ ., data = wiki.train, method = 'class') prp(wiki.train.cart1)
wiki.cart1.pred <- predict(wiki.train.cart1, newdata = wiki.test, type = 'class') # predict on the test set library(caret) confusionMatrix(wiki.cart1.pred, wiki.test$Vandal)
尽管CART模型比baseline,但准确率也只有54.17%。因此,通过增删词汇预测并不是一个很好的办法,接下来我们将想办法改进模型。
模型改进1:辨认关键词
“网站地址”(也称为URL-统一资源定位符),由两个主要部分组成。例如“ http://www.google.com”。第一部分是协议,通常为“ http”(超文本传输协议);第二部分是网站的地址,例如“ www.google.com”。我们删除了所有标点符号,因此指向网站的链接在数据中显示为一个词,例如“ httpwwwgooglecom”。我们假设,由于很多人为破坏行为似乎都在添加到促销或不相关网站的链接,网址的存在是故意破坏的迹象。因此,我们可以通过在“添加的”列中搜索“ http”来搜索添加的单词中是否存在网址。
# Create a copy of your dataframe from the previous question: wikiWords2 <- wikiWords # Make a new column in wikiWords2 that is 1 if "http" was in Added: wikiWords2$HTTP <- ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0) table(wikiWords2$HTTP)
将新的数据分为测试集和训练集,并构建CART模型
wikiTrain2 <- subset(wikiWords2, spl==TRUE) wikiTest2 <- subset(wikiWords2, spl==FALSE) # CART model wiki2.train.cart1 <- rpart(Vandal ~ ., data = wikiTrain2, method = 'class') prp(wiki2.train.cart1)
用构建的CART模型预测并计算准确性
wiki2.cart1.pred <- predict(wiki2.train.cart1, newdata = wikiTest2, type = 'class') confusionMatrix(wiki2.cart1.pred, wikiTest2$Vandal)
模型改进2:文本计数
在此,我们假设可以通过增加/删除文字的个数来预测。因此,构建新的一列来表示每一个观测值增加/删除的文字的总数,重新划分测试集、训练集,并构建CART模型
# We already have a word count available in the form of the document-term matrices (DTMs).
wikiWords2$NumWordsAdded <- rowSums(as.matrix(dtmAdded)) wikiWords2$NumWordsRemoved <- rowSums(as.matrix(dtmRemoved)) wikiTrain3 <- subset(wikiWords2, spl==TRUE) wikiTest3 <- subset(wikiWords2, spl==FALSE)
# CART model
wiki3.train.cart1 <- rpart(Vandal ~ ., data = wikiTrain3, method = 'class')
prp(wiki3.train.cart1)
用构建的CART模型预测测试集数据并计算准确性
wiki3.cart1.pred <- predict(wiki3.train.cart1, newdata = wikiTest3, type = 'class') confusionMatrix(wiki3.cart1.pred, wikiTest3$Vandal)
模型改进3:使用更多的信息
在此,我们引入Minor和Loggin两个变量
wikiWords3 <- wikiWords2 wikiWords3$Minor <- wiki$Minor wikiWords3$Loggedin <- wiki$Loggedin wikiTrain4 <- subset(wikiWords3, spl==TRUE) wikiTest4 <- subset(wikiWords3, spl==FALSE) # CART model wiki4.train.cart1 <- rpart(Vandal ~ ., data = wikiTrain4, method = 'class') prp(wiki4.train.cart1)
模型预测及其准确性
wiki4.cart1.pred <- predict(wiki4.train.cart1, newdata = wikiTest4, type = 'class') confusionMatrix(wiki4.cart1.pred, wikiTest4$Vandal)
因此,通过不断地改进,模型准确率由54.17%提升至71.88%。