learn naive bayes classification api using python3
Design to String Extraction of Feature Vectors
Welcome to my git download source code: https://github.com/linyi0604/kaggle
1 from sklearn.datasets import fetch_20newsgroups 2 from sklearn.cross_validation import train_test_split 3 #Import text feature vector conversion module 4 from sklearn.feature_extraction.text import CountVectorizer 5 #Import Naive Bayes model 6 from sklearn.naive_bayes import MultinomialNB 7 #Model evaluation Module 8 from sklearn.metrics import classification_report 9 10 ''' 11 Naive Bayesian models are widely used for massive internet text classification tasks. 12 Due to the assumption that the feature conditions are independent of each other, the size of the parameters to be estimated in the prediction decreases from the power exponential order to a linear order, saving memory and computing time 13 However, the model cannot consider the relationship between features, and the data is strongly related to classification tasks poor performance. 14 ''' 15 16 ''' 17 1 Read the data part 18 ''' 19 #The api will download data even if it is connected to the Internet 20 news = fetch_20newsgroups(subset= " all " ) 21 #Check the data size and details 22 # print( len(news.data)) 23 # print(news.data[0]) 24 ''' 25 18846 26 27 From: Mamatha Devineni Ratnam <[email protected]> 28 Subject: 29 Organization: Post Office, Carnegie Mellon, Pittsburgh, PA 30 Lines: 12 31 NNTP-Posting-Host: po4.andrew.cmu.edu 32 33 I am sure some bashers of Pens fans are pretty confused about the lack 34 of any kind of posts about the recent Pens massacre of the Devils. Actually, 35 I am bit puzzled too and a bit relieved. However, I am going to put an end 36 to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they 37 are killing those Devils worse than I thought. Jagr just showed you why 38 he is much better than his regular season stats. He is also a lot 39 fo fun to watch in the playoffs. Bowman should let JAgr have a lot of 40 fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final 41 regular season game. PENS RULE!!! 42 ''' 43 44 ''' 45 2 分割数据部分 46 ''' 47 x_train, x_test, y_train, y_test = train_test_split(news.data, 48 news.target, 49 test_size=0.25, 50 random_state=33 ) 51 52 ''' 53 3 Bayesian classifier to predict news 54 ''' 55 #Convert text to features 56 vec = CountVectorizer() 57 x_train = vec.fit_transform(x_train) 58 x_test = vec .transform(x_test) 59 #Initialize the naive Bayesian model 60 mnb = MultinomialNB () 61 #Train on the training set, estimate parameters 62 mnb.fit(x_train, y_train) 63 #Predict the test set and save the prediction result 64 y_predict = mnb.predict(x_test) 65 66 ''' 67 4 Model evaluation 68 ''' 69 print ( " Accuracy: " , mnb.score(x_test, y_test)) 70 print ( " Other metrics: \n " ,classification_report(y_test, y_predict, target_names= news.target_names)) 71 ''' 72Accuracy rate: 0.8397707979626485 73Other indicators : 74 precision recall f1-score support 75 76 alt.atheism 0.86 0.86 0.86 201 77 comp.graphics 0.59 0.86 0. 70 250 78 comp.os.ms-windows.misc 0.89 0.10 0.17 248 79 comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240 80 comp.sys.mac.hardware 0.93 0.78 0.85 242 81 comp.windows.x 0.82 0.84 0.83 263 82 misc.forsale 0.91 0.70 0.79 257 83 rec.autos 0.89 0.89 0.89 238 84 rec.motorcycles 0.98 0.92 0.95 276 85 rec.sport.baseball 0.98 0.91 0.95 251 86 rec.sport.hockey 0.93 0.99 0.96 233 87 sci.crypt 0.86 0.98 0.91 238 88 sci.electronics 0.85 0.88 0.86 249 89 sci.med 0.92 0.94 0.93 245 90 sci.space 0.89 0.96 0.92 221 91 soc.religion.christian 0.78 0.96 0.86 232 92 talk.politics.guns 0.88 0.96 0.92 251 93 talk.politics.mideast 0.90 0.98 0.94 231 94 talk.politics.misc 0.79 0.89 0.84 188 95 talk.religion.misc 0.93 0.44 0.60 158 96 97 avg / total 0.86 0.84 0.82 4712 98 '' '