'' ' Natural Language Processing (NLP) bag of words model: a word semantics to a large extent depends on the number of times a word appears, so you can put all the words in a sentence that may arise as a feature name, every sentence is a sample, the number of times the word appears in the sentence are characterized construct a mathematical model, called the bag of words model. For example: . 1 of The Brown Dog IS running 2 of The Black Dog IS in The Black Room . 3 Running in The Room IS Forbidden The Brown Dog IS running Black in Room Forbidden . 1. 1. 1. 1. 1 0 0 0 0 2 0. 1. 1 0 2. 1. 1 0 . 1 0 0 0. 1. 1. 1. 1. 1 bag of words model of related API: Import sklearn.feature_extraction.text AS ft # construct bag of words model object CV = ft.CountVectorizer () # train model, all the words in the sentence that may arise as a feature name, every sentence is a sample the number of times the word appears in the sentence as a characteristic value. = cv.fit_transform Bow (Sentences) .toArray () Print (Bow) # get all the features were words = cv.get_feature_names () term frequency (TF): total number of words divided by the number of times the word appears in the sentence in the sentence called word frequency . That is, a frequency of occurrence of the word in a sentence. The number can be more objective assessment of the contribution of semantic word word word frequency of appearance compared to words. The higher the word frequency, the greater the contribution of semantics. Bags for word matrix normalized word frequency can be obtained. Document frequency (DF): the document containing the number of samples of a word / total number of sample documents. The higher the document frequency, the lower the contribution of semantics. Inverse document frequency (IDF): The total number of samples / number of samples containing a word. Inverse document the higher the frequency, the higher the contribution of semantics. Term frequency - inverse document frequency (TF-IDF): word frequency of each element of the matrix inverse document frequency multiplied by the corresponding words, the greater its value the greater the contribution to the sample word semantics, the contribution of each word according to the intensity, Construction of learning model. Get term frequency inverse document frequency (TF-IDF) matrix associated API: # Gets bag of words model CV = ft.CountVectorizer () Bow = cv.fit_transform (Sentences) .toArray () # get TF-IDF model trainer tt = ft. TfidfTransformer () TFIDF = tt.fit_transform (Bow) .toArray () '' ' Import nltk.tokenize AS TK Import sklearn.feature_extraction.text AS. ft Import numpy AS NP DOC = " of The Brown Dog IS running. " \ " At The Black Dog IS in at The Black Room. "\ " Running in IS at The Forbidden Room. " Sents = tk.sent_tokenize (DOC) Print (sents) # construct bag of words model CV = ft.CountVectorizer () Bow = cv.fit_transform (sents) # sparse matrix bow = bow.toarray () # normal matrix Print (Bow) # print feature name Print (cv.get_feature_names ()) # Get TFIDF matrix TT = ft.TfidfTransformer () TFIDF = tt.fit_transform (Bow) Print (np.round (tfidf.toarray ( ), 2))
'' ' Text classification (body identification): Use the given text data sets to identify training topics, custom test set to test model accuracy. '' ' Import sklearn.datasets AS SD Import sklearn.feature_extraction.text AS. Ft Import sklearn.naive_bayes AS Nb Import numpy AS NP Train = sd.load_files ( ' ./ml_data/20news ' , encoding = ' latin1 ' , shuffle = True , random_state = 7 ) # get the training sample train_data = train.data # obtain a sample output train_y = train.target # get the name of the category label the categories = train.target_names Print(np.array (train_data) .shape) Print (np.array (train_y) .shape) Print (the Categories) # Construction TFIDF matrix CV = ft.CountVectorizer () Bow = cv.fit_transform (train_data) .toArray () TT = ft.TfidfTransformer () tfidf = tt.fit_transform (Bow) # model training, because the use MultinomialNB sample distribution matrix tfidf multinomial distribution more matching model = nb.MultinomialNB () model.fit (tfidf, train_y) # test test_data = [ ' of The curveballs of right handed pithers The TEND to Curve to left. ' , 'Caesar cipher is an ancient form of encryption.', 'This two-wheeler is really good on slipper' ] test_bow = cv.transform(test_data) test_tfidf = tt.transform(test_bow) pred_test_y = model.predict(test_tfidf) for sent, index in zip(test_data, pred_test_y): print(sent, '---->', categories[index])