R语言垃圾邮件分类--朴素贝叶斯(机器学习)

邮件分类练习–朴素贝叶斯

思路

  • 数据导入
  • 数据处理
  • 构建训练集和测试集
  • 词云展示
  • 数据降维
  • 训练模型
  • 模型测试
  • 提升模型

一、数据导入

文件目录为:C:\Users\kelanj\Documents\data\spam\…和C:\Users\kelanj\Documents\data\ham\…
# 数据导入
# 获得文件路径/目录
setwd("C:\\Users\\kelanj\\Documents")
spam.path<-file.path("data","spam")
ham.path<-file.path("data","ham")
# 获得目录下的文件名 向量
spam.docs <- dir(spam.path)
ham.docs<-dir(ham.path)
spam.docs[1:4]#查看前四个文件
## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00001.7848dde101aa985090474a91ec93fcf0"
## [3] "00002.9438920e9a55591b18e60d1ed37d992b"
## [4] "00002.d94f1b97e48ed3b553b3508d116e6a09"
ham.docs[1:4]
## [1] "00001.1a31cc283af0060967a233d26548a6ce"
## [2] "00001.7c53336b37003a9286aba55d2945844c"
## [3] "00002.5a587ae61666c5aa097c8e866aedcc59"
## [4] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
# 编写函数getContent 实现一封邮件内容读取 返回内容的字符串
# 注意:邮件格式规定"每份邮件包含头部和正文两个部分一般由第一个空行分割"
getContent<-function(path){
  conn<-file(path,open = "rt")#不需要指定encoding否则读取的时候会出错
  line<-readLines(conn,warn = F)
  content<- tryCatch(line[seq(which(line == "")[1]+1, length(line), 1)], error = function(e) e)
  close(conn)
  content<-paste(content,collapse = '\n')
  return(content)
}
# 分别获取spam 1897个文件和ham 3900个文件 的邮件内容
spamContent<-sapply(spam.docs,function(path) getContent(file.path(spam.path,path)))
hamContent<-sapply(ham.docs,function(path) getContent(file.path(ham.path,path)))
s.h.content<-c(spamContent,hamContent)#合并邮件内容

二、数据处理

首先加载包:NLP、tm、SnowballC、slam

1.自定义去除HTML,URL的函数,以及自己的停词表;

2.定义字符处理函数,返回预料库

library(NLP)
library(tm)
library(SnowballC)#提取词干
library(slam)

#将各种url转化为http
myremoveURL<-function(x){
  x<-gsub(pattern = "(https?|ftp|file):\\/\\/[-A-Za-z0-9+&@#\\/%?=~_|!:,\\.;]+[-A-Za-z0-9+&@#\\/%=~_|]","http",x)
}

#除去html标签
myremoveHTML<-function(x){
  x<-gsub(pattern = "<[^>]+>","",x)
}

#自己的英文停词表
myenstopwords<-function(){
  c(stopwords(),"will","also")
}

#自己的文本处理函数 
cleanContent1<-function(content){
  contentCorpus<-Corpus(VectorSource(content))
  contentCorpus<-tm_map(contentCorpus,PlainTextDocument)
  contentCorpus <- tm_map(contentCorpus, myremoveURL)
  contentCorpus <- tm_map(contentCorpus, myremoveHTML)
  contentCorpus <- tm_map(contentCorpus, tolower)
  contentCorpus <- tm_map(contentCorpus, removeNumbers)
  contentCorpus<-tm_map(contentCorpus,removeWords,myenstopwords())
  contentCorpus <- tm_map(contentCorpus, removePunctuation)
  contentCorpus <- tm_map(contentCorpus, stripWhitespace)
  return(contentCorpus)
}

防止编码问题报错

Sys.setlocale(category = "LC_ALL", locale = "us")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
#进行文本处理
s.h.corpus<-cleanContent1(s.h.content)
inspect(s.h.corpus[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] greetings receiving letter expressed interest receiving information online business opportunities erroneous please accept sincere apology onetime mailing removal necessary burned betrayed backstabbed multilevel marketing mlm please read letter important one ever landed inbox multilevel marketing huge mistake people mlm failed deliver promises past years pursuit mlm dream cost hundreds thousands people friends fortunes sacred honor fact mlm fatally flawed meaning work people companies earn big money mlm going tell real story finally someone courage cut hype lies tell truth mlm good news alternative mlm works works big yet abandoned dreams need see earning kind income dreamed easier think permission like send brief letter tell mlm work people introduce something new refreshing wonder heard promise unwanted follow sales pitch one call email address used send information period receive free lifechanging information simply click reply type send info subject box hit send get information within hours just look words mlm wall shame inbox cordially siddhi ps someone recently sent letter eyeopening financially beneficial information ever received honestly believe feel way read free email never sent unsolicited spam receiving email explicitly signed list online signup form use ffa links page emaildom systems explicit terms use state use agree receive emailings may member altra computer systems list one many numerous free marketing services agreed signed list receiving emailing due email message considered unsolicitated spam irish linux users group iluglinuxie http unsubscription information list maintainer listmasterlinuxie                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [2]  save life insurance spend life quote savings ensurin g familys financial security important life quote savings ma kes buying life insurance simple affordable provide free access best companies lowest rates life quote savings fast eas y saves money let us help get started best val ues country new coverage can save hundreds even tho usands dollars requesting free quote lifequote savings service take less minutes complete shop d compare save types life insurance click free quote protecting family best investment eve r make receipt email error wish removed list please click type remove reside state prohibits email solicitations insuran ce please disregard email                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [3]  need safety real might get one chance ready free shipping handling within usa order may day super sale now may save items getting spring protect walk jog exercise outside protect loved ones return home college legal protection college students great coming outdoor protection gifts nothing worth protecting life stun devices pepper products legal protection join war crime stun guns batons effective safe nonlethal protect loved ones matter matter city town live live america touched crime hear tv read newspaper secret crime major problem us today criminals finding easier commit crimes time weapons readily available cities police forces work can handle even criminal caught spending long nations overcrowded jails lawmakers well aware crime problem seem effective answers email address merchantsallaolcom interested protecting within days wait visit web page join war crime http well effective answer take responsibility security site variety quality personal security products visit site choose personal security products right use join war crime free pepper spray stun unit purchase value ship orders within days every state us ups fedex us postal service visa mastercard american express debt card gladly accepted ask question help loved ones interested http the stun monster volts the zforce slim style volts the stunmaster volts straight the stunmaster volts curb the stunmaster volts straight the stunmaster volts curb the stunbaton volts the stunbaton volts pen knife one two wildfire pepper spray one two add shipping handling charge order postal mail please send address make payable mega safety technology mega safety technology merrimac ave dayton ohio email address merchantsallaolcom order hour fax important credit card information please read credit card address city state zip code must match billing address processed check moneyorder visa mastercard americanexpress debt card name appears check credit card address appears check credit card citystatezip appears check credit card country credit card number expiration month year authorized signature important note shipping address different billing address please fill information shipping name shipping address shipping citystatezip country email address phone numberplease write neat
s.h.dtm<-DocumentTermMatrix(s.h.corpus)
s.h.dtm#98191列,5797行 前1897行是spam,后3900行是ham
## <<DocumentTermMatrix (documents: 5797, terms: 71600)>>
## Non-/sparse entries: 553817/414511383
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)

三、构建训练集和测试集

注:选择75%的数据作为训练集,25%的作为测试集

#分离语料库
s.h.corpus.train<-s.h.corpus[c(1:1423,1897:4822)]#共4349条
s.h.corpus.test<-s.h.corpus[c(1424:1896,4823:5797)]#共1448条
#分离DTM
s.h.dtm.train<-s.h.dtm[c(1:1423,1897:4822),]
s.h.dtm.test<-s.h.dtm[c(1424:1896,4823:5797),]

四、词云展示

加载包:wordcloud2
注:选择垃圾邮件训练集。即s.h.dtm.train的前1423行

#先转换为正常的矩阵进行  词频统计
s.dtm.train<-as.matrix(s.h.dtm.train[1:1423,])
s.sum<-col_sums(s.dtm.train)
s.term<-names(s.sum)
s.freq<-as.numeric(s.sum)
#转换为数据框
s.frame<-as.data.frame(cbind(s.term,s.freq),row.names=NULL,optional=F)
s.frame$s.freq<-as.numeric(s.frame$s.freq)
head(s.frame)
##        s.term s.freq
## 1   abandoned     43
## 2      accept      8
## 3     address    330
## 4       agree     88
## 5      agreed    164
## 6 alternative    269

(直接添加的图片!!)

library(wordcloud2)
#wordcloud2(s.frame)


wordcloud

五、数据降维

选择出现频数大于100的词汇

myfindFreqTerms <- function(x,lowfreq=0,highfreq=Inf){
  stopifnot(inherits(x,c("DocumentTermMatrix","TermDocumentMatrix","simple_triplet_matrix")),
            is.numeric(lowfreq),is.numeric(highfreq))
  if(inherits(x,"DocumentTermMatrix"))
    x<-t(x)
  rs <- slam::row_sums(x)
  y <- which(rs >= lowfreq & rs<= highfreq)
  return(x[y,])
}
s.h.dict<-Terms(myfindFreqTerms(s.h.dtm.train,100))
length(s.h.dict)#共有1151个term
## [1] 1151

使用筛选后的词汇对原始数据进行处理

s.h.train<-DocumentTermMatrix(s.h.corpus.train,list(dictionary=s.h.dict))
s.h.train#4349行,1151列
## <<DocumentTermMatrix (documents: 4349, terms: 1151)>>
## Non-/sparse entries: 224076/4781623
## Sparsity           : 96%
## Maximal term length: 35
## Weighting          : term frequency (tf)
s.h.test<-DocumentTermMatrix(s.h.corpus.test,list(dictionary=s.h.dict))
s.h.test#1448行,1151列
## <<DocumentTermMatrix (documents: 1448, terms: 1151)>>
## Non-/sparse entries: 52503/1614145
## Sparsity           : 97%
## Maximal term length: 35
## Weighting          : term frequency (tf)

六、训练模型

加载包:e1071

#首先将训练集中的0 1值转换为因子No Yes
convert_counts <- function(x){
  x <- ifelse(x>0,1,0)
  x <- factor(x, levels=c(0,1),labels=c("No","Yes"))
  return(x)
}
s_h_train <- apply(s.h.train, MARGIN=2, convert_counts)
s_h_test<-apply(s.h.test, MARGIN = 2, convert_counts)

训练开始

library(e1071)
s_h_train_type<-c(rep("spam",1423),rep("ham",2926))
s_h_test_type<-c(rep("spam",473),rep("ham",975))
s_h_train_type<-as.data.frame(s_h_train_type)

model_s_h<-naiveBayes(s_h_train,s_h_train_type$s_h_train_type,laplace=1)
s_h_prediction<-predict(model_s_h,s_h_test,type = "class")

七、模型测试

加载包:gmodels

扫描二维码关注公众号,回复: 5618631 查看本文章
library(gmodels)
CrossTable(s_h_prediction,s_h_test_type,prop.chisq=TRUE,prop.t=FALSE,dnn=c("predicted","actual"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1448 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       967 |        69 |      1036 | 
##              |   104.053 |   214.486 |           | 
##              |     0.933 |     0.067 |     0.715 | 
##              |     0.992 |     0.146 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         8 |       404 |       412 | 
##              |   261.648 |   539.337 |           | 
##              |     0.019 |     0.981 |     0.285 | 
##              |     0.008 |     0.854 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       975 |       473 |      1448 | 
##              |     0.673 |     0.327 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

分析: 从表中可以看出,975条非垃圾短信中有8条短信被错误的归为垃圾短信,比例为:0.8%,而473条垃圾短信中有69条短信被错误的归为非垃圾短信,比例为14.6%。

八、提升模型

修改laplace值

model_s_h<-naiveBayes(s_h_train,s_h_train_type$s_h_train_type,laplace=0.001)
s_h_prediction<-predict(model_s_h,s_h_test,type = "class")
CrossTable(s_h_prediction,s_h_test_type,prop.chisq=TRUE,prop.t=FALSE,dnn=c("predicted","actual"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1448 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |       969 |        66 |      1035 | 
##              |   106.231 |   218.975 |           | 
##              |     0.936 |     0.064 |     0.715 | 
##              |     0.994 |     0.140 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         6 |       407 |       413 | 
##              |   266.220 |   548.762 |           | 
##              |     0.015 |     0.985 |     0.285 | 
##              |     0.006 |     0.860 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       975 |       473 |      1448 | 
##              |     0.673 |     0.327 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

分析:
通过将拉普拉斯值调整为0.001,来优化建立的贝叶斯模型。可以看出,调整之后,假阴性错误的比例减少为:0.6%,假阳性错误的比例减少为:14%.


结束!

猜你喜欢

转载自blog.csdn.net/kelanj/article/details/80883849