The Road to Machine Learning: Python Naive Bayes Classifier to Predict News Categories

 

learn naive bayes classification api using python3

Design to String Extraction of Feature Vectors

Welcome to my git download source code:  https://github.com/linyi0604/kaggle

 

1  from sklearn.datasets import fetch_20newsgroups
 2  from sklearn.cross_validation import train_test_split
 3  #Import text feature vector conversion module 
4  from sklearn.feature_extraction.text import CountVectorizer
 5  #Import Naive Bayes model 
6  from sklearn.naive_bayes import MultinomialNB 7 #Model
 evaluation Module 8 from sklearn.metrics import classification_report
 9 10 ''' 11 Naive Bayesian models are widely used for massive internet text classification tasks.
 
  
 
 12  Due to the assumption that the feature conditions are independent of each other, the size of the parameters to be estimated in the prediction decreases from the power exponential order to a linear order, saving memory and computing time
 13  However, the model cannot consider the relationship between features, and the data is strongly related to classification tasks poor performance.
14  ''' 
15  
16  ''' 
17  1 Read the data part
 18  ''' 
19  #The api will download data even if it is connected to the Internet 
20 news = fetch_20newsgroups(subset= " all " )
 21  #Check the data size and details 
22  # print( len(news.data)) 
23  # print(news.data[0]) 
24  ''' 
25  18846
 26  
27  From: Mamatha Devineni Ratnam <[email protected]>
 28  Subject:
29 Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
30 Lines: 12
31 NNTP-Posting-Host: po4.andrew.cmu.edu
32 
33 I am sure some bashers of Pens fans are pretty confused about the lack
34 of any kind of posts about the recent Pens massacre of the Devils. Actually,
35 I am  bit puzzled too and a bit relieved. However, I am going to put an end
36 to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
37 are killing those Devils worse than I thought. Jagr just showed you why
38 he is much better than his regular season stats. He is also a lot
39 fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
40 fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
41 regular season game.          PENS RULE!!!
42 '''
43 
44 '''
45 2 分割数据部分
46 '''
47 x_train, x_test, y_train, y_test = train_test_split(news.data,
48                                                     news.target,
49                                                     test_size=0.25,
50                                                     random_state=33 )
 51  
52  ''' 
53  3 Bayesian classifier to predict news
 54  ''' 
55  #Convert text to features 
56 vec = CountVectorizer()
 57 x_train = vec.fit_transform(x_train)
 58 x_test = vec .transform(x_test)
 59  #Initialize the naive Bayesian model 
60 mnb = MultinomialNB ()
 61  #Train on the training set, estimate parameters 
62  mnb.fit(x_train, y_train)
 63  #Predict the test set and save the prediction result 
64 y_predict = mnb.predict(x_test)
65  
66  ''' 
67  4 Model evaluation
 68  ''' 
69  print ( " Accuracy: " , mnb.score(x_test, y_test))
 70  print ( " Other metrics: \n " ,classification_report(y_test, y_predict, target_names= news.target_names))
 71  ''' 
72Accuracy  rate: 0.8397707979626485 73Other
 indicators  :
 74                             precision recall f1-score support
 75  
76               alt.atheism 0.86 0.86 0.86 201
 77             comp.graphics 0.59 0.86 0.
70       250
78  comp.os.ms-windows.misc       0.89      0.10      0.17       248
79 comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
80    comp.sys.mac.hardware       0.93      0.78      0.85       242
81           comp.windows.x       0.82      0.84      0.83       263
82             misc.forsale       0.91      0.70      0.79       257
83                rec.autos       0.89      0.89      0.89       238
84          rec.motorcycles       0.98      0.92      0.95       276
85       rec.sport.baseball       0.98      0.91      0.95       251
86         rec.sport.hockey       0.93      0.99      0.96       233
87                sci.crypt       0.86      0.98      0.91       238
88          sci.electronics       0.85      0.88      0.86       249
89                  sci.med       0.92      0.94      0.93       245
90                sci.space       0.89      0.96      0.92       221
91   soc.religion.christian       0.78      0.96      0.86       232
92       talk.politics.guns       0.88      0.96      0.92       251
93    talk.politics.mideast       0.90      0.98      0.94       231
94       talk.politics.misc       0.79      0.89      0.84       188
95       talk.religion.misc       0.93      0.44      0.60       158
96  
97               avg / total 0.86 0.84 0.82 4712
 98  '' '

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325113850&siteId=291194637