Natural language processing and case

'' ' 
    Natural Language Processing (NLP) 
            bag of words model: a word semantics to a large extent depends on the number of times a word appears, so you can put all the words in a sentence that may arise as a feature name, 
                    every sentence is a sample, the number of times the word appears in the sentence are characterized construct a mathematical model, called the bag of words model. 
                    For example: 
                        . 1 of The Brown Dog IS running 
                        2 of The Black Dog IS in The Black Room 
                        . 3 Running in The Room IS Forbidden 
                        The Brown Dog IS running Black in Room Forbidden 
                        . 1. 1. 1. 1. 1 0 0 0 0 
                        2 0. 1. 1 0 2. 1. 1 0  
                        . 1 0 0 0. 1. 1. 1. 1. 1
            bag of words model of related API: 
                Import sklearn.feature_extraction.text AS ft 
                # construct bag of words model object 
                CV = ft.CountVectorizer () 
                # train model, all the words in the sentence that may arise as a feature name, every sentence is a sample the number of times the word appears in the sentence as a characteristic value. 
                = cv.fit_transform Bow (Sentences) .toArray () 
                Print (Bow) 
                # get all the features were 
                words = cv.get_feature_names () 

            term frequency (TF): total number of words divided by the number of times the word appears in the sentence in the sentence called word frequency . That is, a frequency of occurrence of the word in a sentence. 
                        The number can be more objective assessment of the contribution of semantic word word word frequency of appearance compared to words. The higher the word frequency, 
                        the greater the contribution of semantics. Bags for word matrix normalized word frequency can be obtained. 
            Document frequency (DF): the document containing the number of samples of a word / total number of sample documents. The higher the document frequency, the lower the contribution of semantics. 
            Inverse document frequency (IDF): The total number of samples / number of samples containing a word. Inverse document the higher the frequency, the higher the contribution of semantics.
            Term frequency - inverse document frequency (TF-IDF): word frequency of each element of the matrix inverse document frequency multiplied by the corresponding words, the greater its value the greater the contribution to the sample word semantics, 
                                    the contribution of each word according to the intensity, Construction of learning model.
            Get term frequency inverse document frequency (TF-IDF) matrix associated API: 
                            # Gets bag of words model 
                            CV = ft.CountVectorizer () 
                            Bow = cv.fit_transform (Sentences) .toArray () 
                            # get TF-IDF model trainer 
                            tt = ft. TfidfTransformer () 
                            TFIDF = tt.fit_transform (Bow) .toArray () 

'' ' 
Import nltk.tokenize AS TK
 Import sklearn.feature_extraction.text AS. ft
 Import numpy AS NP 

DOC = " of The Brown Dog IS running. " \
      " At The Black Dog IS in at The Black Room. "\
       " Running in IS at The Forbidden Room. " 

Sents = tk.sent_tokenize (DOC)
 Print (sents) 

# construct bag of words model 
CV = ft.CountVectorizer () 
Bow = cv.fit_transform (sents)   # sparse matrix 
bow = bow.toarray ()   # normal matrix 
Print (Bow)
 # print feature name 
Print (cv.get_feature_names ()) 

# Get TFIDF matrix 
TT = ft.TfidfTransformer () 
TFIDF = tt.fit_transform (Bow)
 Print (np.round (tfidf.toarray ( ), 2))

 

 

'' ' 
    Text classification (body identification): Use the given text data sets to identify training topics, custom test set to test model accuracy. 
'' ' 
Import sklearn.datasets AS SD
 Import sklearn.feature_extraction.text AS. Ft
 Import sklearn.naive_bayes AS Nb
 Import numpy AS NP 

Train = sd.load_files ( ' ./ml_data/20news ' , encoding = ' latin1 ' , shuffle = True , random_state = 7 )
 # get the training sample 
train_data = train.data
 # obtain a sample output 
train_y = train.target
 # get the name of the category label 
the categories = train.target_names
 Print(np.array (train_data) .shape)
 Print (np.array (train_y) .shape)
 Print (the Categories) 

# Construction TFIDF matrix 
CV = ft.CountVectorizer () 
Bow = cv.fit_transform (train_data) .toArray () 
TT = ft.TfidfTransformer () 
tfidf = tt.fit_transform (Bow) 

# model training, because the use MultinomialNB sample distribution matrix tfidf multinomial distribution more matching 
model = nb.MultinomialNB () 
model.fit (tfidf, train_y) 
# test 
test_data = [
     ' of The curveballs of right handed pithers The TEND to Curve to left. ' ,
     'Caesar cipher is an ancient form of encryption.',
    'This two-wheeler is really good on slipper'
]
test_bow = cv.transform(test_data)
test_tfidf = tt.transform(test_bow)
pred_test_y = model.predict(test_tfidf)
for sent, index in zip(test_data, pred_test_y):
    print(sent, '---->', categories[index])

 

Guess you like

Origin www.cnblogs.com/yuxiangyang/p/11235618.html