Extracting characteristic text data, Chinese word and bag of words model
1. The use of the text feature extraction CountVectorizer
# Import Tool CountVectorizer quantitative tool from sklearn.feature_extraction.text Import CountVectorizer vect = CountVectorizer () # CountVectorizer fitting text data using EN = [ 'of The Quick Brown Dog Fox jumps over the lazy A'] vect.fit (EN) # print result print ( 'number of words: {}'. the format (len (vect.vocabulary_))) print ( 'word: {}'. format (vect.vocabulary_ ))
Number of Words: 8 word: { 'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3,' over ': 5,' lazy ': 4,' dog ': 1}
Test using Chinese text # cn = [ 'The quick fox jumps over a comprehensive color lazy dog "] # fitting Chinese text data vect.fit (CN) # print result print (' Number of words: {} '.format (len (vect.vocabulary_))) Print (' word:. {} 'format (vect.vocabulary_ ))
Number of words: 1 word: { 'The quick comprehensive color fox jumps over a lazy dog': 0}
2. Use the word tool for Chinese text word
# Import stuttered word Import jieba # use stuttered word of Chinese text word cn = jieba.cut ( 'The quick brown fox jumps over a lazy dog') # use the space as a dividing line between words cn = [ '' .join (CN)] # print result print (cn)
[ 'The quick brown fox jumps over a lazy dog']
# CountVectorizer use of Chinese text vectorization (CN) vect.fit # print result print ( 'Number of words: {}'. The format (len (vect.vocabulary_))) print ( 'word: {}'. Format (vect .vocabulary_))
Number of words: 6 word: { 'agile': 2, 'Brown': 3, 'Fox': 4 'skip': 5, 'a': 0, 'lazy': 1}
3. Use a bag of words model text data into an array
Bag # define word model bag_of_words = vect.transform (CN) # print data word wherein the bag model print ( 'feature word into the bag: \ n {}'. Format (repr (bag_of_words)))
Bag of words into features: <1X6 sparse Matrix of type '<class' numpy.int64 '>' with. 6 Stored Compressed Sparse Row Elements in the format>
# Printing density bag of words model expression print ( 'density expressed bag words: \ n {}'. Format (bag_of_words.toarray ()))
Words bag density expression: [[111111]]
# Enter new Chinese text cn_1 = jieba.cut ( 'lazy fox inferior quick fox agile, not as quick fox fox lazy lazy') # separated by spaces CN2 = [ '' .join (CN_1)] # Print The results print (cn2)
[ 'Lazy fox inferior quick fox agile, not as quick fox Fox lazy lazy']
# Create a new bag of words model new_bag = vect.transform (CN2) # print result print ( 'bag of words into features: \ n-{}'. The format (the repr (new_bag))) print ( 'word density expressed bag : \ n {} 'format ( new_bag.toarray ())).
Bag of words into features: <1X6 sparse Matrix of type '<class' numpy.int64 '>' with Elements. 3 Stored in the format Compressed Sparse Row> density expressed bag words: [[033040]]
Further optimize the processing of text data
1. improving n-Gram model bag of words
# Improve the use of n-Gram bag of words model # then write a word of it joke = jieba.cut ( 'Li Xiaoming seen riding a bicycle Xiali') # insert spaces joke = [ '' .join (joke )] vector # into vect.fit (Joke) joke_feature = vect.transform (Joke) # print text feature data print ( 'feature words expression: \ n {}'. format (joke_feature.toarray ()))
Wherein the expression of this statement: [[11111]]
# The scrambled text just joke2 = jieba.cut ( 'see Xia Li Li Xiaoming bicycle ride') # insert spaces joke2 = [ '' .join (joke2)] # feature extraction joke2_feature = vect.transform (joke2) # feature printed text print ( 'feature words expression: \ n {}'. format (joke2_feature.toarray ()))
Wherein the expression of this statement: [[11011]]
# CountVectorizer modify the parameters ngram vect = CountVectorizer (ngram_range = (2,2 &)) # re text data feature extraction CV = vect.fit (Joke) joke_feature = cv.transform (Joke) # print a new result of print ( ' after dictionary adjustment parameters n-Gram: '. the format (cv.get_feature_names ())) {} Print (' new features expression: {} 'format (joke_feature.toarray ( ).))
N-Gram dictionary after adjusting the parameters: [ 'Xia Li bicycle', 'see Bob', 'riding Xia Li Li', 'Ride see Lee'] novel features of the expression: [[1111]]
# Adjust the text sequence joke2 = jieba.cut ( 'see Xia Li Li Xiaoming bicycle ride') # insert spaces joke2 = [ '' .join (joke2)] # text data is extracted feature joke2_feature = vect.transform (joke2) print ( 'new features expression: {}' format (joke2_feature.toarray ( )).)
New features expression: [[0000]]
- After adjusting ngram_range CountVectorizer parameters, the machine is no longer considered the same meaning that two, so the n-Gram model text feature extraction were good optimization
2. using tf-idf text data processing model
# Display ACLIMDB dataset tree folder list ! Tree ACLIMDB
Volume Data Folder PATH listing volume serial number 06B1-81F6 D: \ JUPYTERNOTEBOOK \ ACLIMDB ├─test │ ├─neg │ └─pos └─train ├─neg ├─pos └─unsup
# Import Tool CountVectorizer quantitative tool from sklearn.feature_extraction.text Import CountVectorizer # import file loading tool from sklearn.datasets Import load_files # define training dataset train_set = load_files ( 'aclImdb / Train') X_train, y_train, = train_set.data, train_set.target # training data set file print number print ( 'number of training set file: {}' the format (len (X_train)).) # extract a critic just printed print ( 'randomly a look:', X_train [ twenty two])
训练集文件数量:75000 随机抽一个看看: b"Okay, once you get past the fact that Mitchell and Petrillo are Dean and Jerry knockoffs, you could do worse than this film. Charlita as Princess Nona is great eye candy, Lugosi does his best with the material he's given, and the production values, music especially (except for the vocals) are better than you'd think for the $50k cost of production. The final glimpses of the characters are a hoot. Written by Tim Ryan, a minor actor in late Charlie Chan films, and husband of Grannie on the Beverly Hillbillies. All in all, WAY better than many late Lugosi cheapies."
# Loading Test set Test = load_files ( 'aclImdb / Test /') X_test, android.permission.FACTOR. = The test.data, test.target # returns the number of test data set file print (len (X_test))
25000
# CountVectorizer fit the training data set with vect = CountVectorizer (). Fit (X_train) # text into a vector X_train_vect = vect.transform (X_train) # the test data set into a vector X_test_vect = vect.transform (X_test) # Print wherein the training set number of print ( 'number of samples wherein the training set: {}'. the format (len (vect.get_feature_names ()))) # print the last 10 wherein the training set print ( 'the last 10 wherein the training set: { .} 'format (vect.get_feature_names () [- 10:]))
Training set number of samples wherein: 124255 the last 10 training set wherein: [ 'üvegtigris',' üwe ',' ÿou ',' ıslam ',' ōtomo ',' şey ',' дом ',' книги ',' Color ring ',' rock ']
# Import tfidf conversion tool from sklearn.feature_extraction.text Import TfidfTransformer # conversion using the training set and test set tfidf tool tfidf = TfidfTransformer (smooth_idf = False) tfidf.fit (X_train_vect) X_train_tfidf = tfidf.transform (X_train_vect) X_test_tfidf = tfidf.transform (X_test_vect) # feature compared before and after the printing process print ( 'feature without processing tfidf: \ n', X_train_vect [:. 5,:. 5] .toArray ()) print ( 'treatment after tfidf features: \ n ', X_train_tfidf [: 5,: 5] .toarray ())
Without treatment tfidf wherein: [[00,000] [00,000] [00,000] [00,000] [00,000]] after treatment tfidf features: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]]
# Import SVC linear classification model from sklearn.svm Import LinearSVC # cross validation tool introduced from sklearn.model_selection Import cross_val_score # using cross validation of the model scoring Scores = cross_val_score (LinearSVC (), X_train_vect, y_train, CV =. 3) # retrains linear SVC model CLF = LinearSVC (). Fit (X_train_tfidf, y_train) # with new data for cross-validation scores2 = cross_val_score (LinearSVC (), X_train_tfidf, y_train, CV =. 3) # print a new score comparison print ( 'through tf processing the training set -idf cross-validation score: {:. 3f} 'format ( scores.mean ())). after treatment the test set tf-id score print (':. {:. 3f } 'format (clf. score (X_test_tfidf, y_test)))
After training set tf-idf treated cross-validation score: 0.660 After test set tf-id treated Score: 0.144
3. Remove the stop word in the text
# Import built deactivation thesaurus from sklearn.feature_extraction.text Import ENGLISH_STOP_WORDS # stopwords print number print ( 'Number disable words:', len (ENGLISH_STOP_WORDS)) # 20 and the print disable word before 20 print ( 'List 20 before and 20 after: \ n', list (ENGLISH_STOP_WORDS ) [: 20], list (ENGLISH_STOP_WORDS) [- 20:])
停用词个数: 318 列出前20个和后20个: ['interest', 'meanwhile', 'do', 'thereupon', 'can', 'cry', 'upon', 'then', 'first', 'six', 'except', 'our', 'noone', 'being', 'done', 'afterwards', 'any', 'even', 'after', 'otherwise'] ['seemed', 'top', 'as', 'all', 'found', 'very', 'nor', 'seem', 'via', 'these', 'been', 'beforehand', 'behind', 'becomes', 'un', 'ten', 'onto', 'ourselves', 'an', 'keep']
# Import Tfidf model from sklearn.feature_extraction.text Import TfidfVectorizer # English stopwords activation parameters TFIDF = TfidfVectorizer (smooth_idf = False, STOP_WORDS = 'Dictionary Dictionary English') # fit the training data set tfidf.fit (X_train) # training data set text into a vector X_train_tfidf = tfidf.transform (X_train) # cross-validation score scores3 = cross_val_score (LinearSVC (), X_train_tfidf, y_train, CV =. 3) clf.fit (X_train_tfidf, y_train) # test data set into a vector = tfidf.transform X_test_tfidf (X_test) # printing test set and cross-validation score score print ( 'stop words are removed after cross-validation average training set: {:} 3F'. the format (scores3.mean ())) print ( ' after removing stopwords test set model score: {: 3f}. 'format (clf.score (X_test_tfidf, y_test)))
After removing stop words cross-validation training Average: 0.723933 test set model score after removing the word disabled: 0.150920
to sum up:
In scikit-learn, there are two classes using the tf-idf method, wherein a is TfidfTransformer, characterized in that it is used CountVectorizer matrix text extracted from transformed, and the other is TfidfVectorizer, it CountVectorizer usage is the same, equivalent to the work done CountVectorizer and TfidfTransformer integrated together.
In the field of natural language most commonly used python toolkit --NLTK. It may also achieve word, filling the text labels and other functions, it can also be used to extract stem and stem reduction.
If you want to further develop insight into the topic of modeling (Topic Modeling) and document clustering (Document Clustering).
The most often used for natural language processing in the field of deep learning comes word2vec library, if interested can understand.
Article cited; "in layman's language python machine learning"