Text Data Processing (NLP basis)

Extracting characteristic text data, Chinese word and bag of words model

1. The use of the text feature extraction CountVectorizer

# Import Tool CountVectorizer quantitative tool 
from sklearn.feature_extraction.text Import CountVectorizer 
vect = CountVectorizer () 
# CountVectorizer fitting text data using 
EN = [ 'of The Quick Brown Dog Fox jumps over the lazy A'] 
vect.fit (EN) 
# print result 
print ( 'number of words: {}'. the format (len (vect.vocabulary_))) 
print ( 'word: {}'. format (vect.vocabulary_ ))
Number of Words: 8 
word: { 'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3,' over ': 5,' lazy ': 4,' dog ': 1}
Test using Chinese text # 
cn = [ 'The quick fox jumps over a comprehensive color lazy dog "] 
# fitting Chinese text data 
vect.fit (CN) 
# print result 
print (' Number of words: {} '.format (len (vect.vocabulary_))) 
Print (' word:. {} 'format (vect.vocabulary_ ))
Number of words: 1 
word: { 'The quick comprehensive color fox jumps over a lazy dog': 0}

2. Use the word tool for Chinese text word

# Import stuttered word 
Import jieba 
# use stuttered word of Chinese text word 
cn = jieba.cut ( 'The quick brown fox jumps over a lazy dog') 
# use the space as a dividing line between words 
cn = [ '' .join (CN)] 
# print result 
print (cn)
[ 'The quick brown fox jumps over a lazy dog']
# CountVectorizer use of Chinese text vectorization 
(CN) vect.fit 
# print result 
print ( 'Number of words: {}'. The format (len (vect.vocabulary_))) 
print ( 'word: {}'. Format (vect .vocabulary_))
Number of words: 6 
word: { 'agile': 2, 'Brown': 3, 'Fox': 4 'skip': 5, 'a': 0, 'lazy': 1}

3. Use a bag of words model text data into an array

Bag # define word model 
bag_of_words = vect.transform (CN) 
# print data word wherein the bag model 
print ( 'feature word into the bag: \ n {}'. Format (repr (bag_of_words)))
Bag of words into features: 
<1X6 sparse Matrix of type '<class' numpy.int64 '>' 
	with. 6 Stored Compressed Sparse Row Elements in the format>
# Printing density bag of words model expression 
print ( 'density expressed bag words: \ n {}'. Format (bag_of_words.toarray ()))
Words bag density expression: 
[[111111]]
# Enter new Chinese text 
cn_1 = jieba.cut ( 'lazy fox inferior quick fox agile, not as quick fox fox lazy lazy') 
# separated by spaces 
CN2 = [ '' .join (CN_1)] 
# Print The results 
print (cn2)
[ 'Lazy fox inferior quick fox agile, not as quick fox Fox lazy lazy']
# Create a new bag of words model 
new_bag = vect.transform (CN2) 
# print result 
print ( 'bag of words into features: \ n-{}'. The format (the repr (new_bag))) 
print ( 'word density expressed bag : \ n {} 'format ( new_bag.toarray ())).
Bag of words into features: 
<1X6 sparse Matrix of type '<class' numpy.int64 '>' 
	with Elements. 3 Stored in the format Compressed Sparse Row> 
density expressed bag words: 
[[033040]]

Further optimize the processing of text data

1. improving n-Gram model bag of words

# Improve the use of n-Gram bag of words model 
# then write a word of it 
joke = jieba.cut ( 'Li Xiaoming seen riding a bicycle Xiali') 
# insert spaces 
joke = [ '' .join (joke )] 
vector # into 
vect.fit (Joke) 
joke_feature = vect.transform (Joke) 
# print text feature data 
print ( 'feature words expression: \ n {}'. format (joke_feature.toarray ()))
Wherein the expression of this statement: 
[[11111]]
# The scrambled text just 
joke2 = jieba.cut ( 'see Xia Li Li Xiaoming bicycle ride') 
# insert spaces 
joke2 = [ '' .join (joke2)] 
# feature extraction 
joke2_feature = vect.transform (joke2) 
# feature printed text 
print ( 'feature words expression: \ n {}'. format (joke2_feature.toarray ()))
Wherein the expression of this statement: 
[[11011]]
# CountVectorizer modify the parameters ngram 
vect = CountVectorizer (ngram_range = (2,2 &)) 
# re text data feature extraction 
CV = vect.fit (Joke) 
joke_feature = cv.transform (Joke) 
# print a new result of 
print ( ' after dictionary adjustment parameters n-Gram: '. the format (cv.get_feature_names ())) {} 
Print (' new features expression: {} 'format (joke_feature.toarray ( ).))
N-Gram dictionary after adjusting the parameters: [ 'Xia Li bicycle', 'see Bob', 'riding Xia Li Li', 'Ride see Lee'] 
novel features of the expression: [[1111]]
# Adjust the text sequence 
joke2 = jieba.cut ( 'see Xia Li Li Xiaoming bicycle ride') 
# insert spaces 
joke2 = [ '' .join (joke2)] 
# text data is extracted feature 
joke2_feature = vect.transform (joke2) 
print ( 'new features expression: {}' format (joke2_feature.toarray ( )).)
New features expression: [[0000]]
  • After adjusting ngram_range CountVectorizer parameters, the machine is no longer considered the same meaning that two, so the n-Gram model text feature extraction were good optimization

2. using tf-idf text data processing model

# Display ACLIMDB dataset tree folder list 
! Tree ACLIMDB
Volume Data Folder PATH listing 
volume serial number 06B1-81F6 
D: \ JUPYTERNOTEBOOK \ ACLIMDB 
├─test 
│ ├─neg 
│ └─pos 
└─train 
    ├─neg 
    ├─pos 
    └─unsup
# Import Tool CountVectorizer quantitative tool 
from sklearn.feature_extraction.text Import CountVectorizer 
# import file loading tool 
from sklearn.datasets Import load_files 
# define training dataset 
train_set = load_files ( 'aclImdb / Train') 
X_train, y_train, = train_set.data, train_set.target 
# training data set file print number 
print ( 'number of training set file: {}' the format (len (X_train)).) 
# extract a critic just printed 
print ( 'randomly a look:', X_train [ twenty two])
训练集文件数量:75000
随机抽一个看看: b"Okay, once you get past the fact that Mitchell and Petrillo are Dean and Jerry knockoffs, you could do worse than this film. Charlita as Princess Nona is great eye candy, Lugosi does his best with the material he's given, and the production values, music especially (except for the vocals) are better than you'd think for the $50k cost of production. The final glimpses of the characters are a hoot. Written by Tim Ryan, a minor actor in late Charlie Chan films, and husband of Grannie on the Beverly Hillbillies. All in all, WAY better than many late Lugosi cheapies."
# Loading Test set 
Test = load_files ( 'aclImdb / Test /') 
X_test, android.permission.FACTOR. = The test.data, test.target 
# returns the number of test data set file 
print (len (X_test))
25000
# CountVectorizer fit the training data set with 
vect = CountVectorizer (). Fit (X_train) 
# text into a vector 
X_train_vect = vect.transform (X_train) 
# the test data set into a vector 
X_test_vect = vect.transform (X_test) 
# Print wherein the training set number of 
print ( 'number of samples wherein the training set: {}'. the format (len (vect.get_feature_names ()))) 
# print the last 10 wherein the training set 
print ( 'the last 10 wherein the training set: { .} 'format (vect.get_feature_names () [- 10:]))
Training set number of samples wherein: 124255 
the last 10 training set wherein: [ 'üvegtigris',' üwe ',' ÿou ',' ıslam ',' ōtomo ',' şey ',' дом ',' книги ',' Color ring ',' rock ']
# Import tfidf conversion tool 
from sklearn.feature_extraction.text Import TfidfTransformer 
# conversion using the training set and test set tfidf tool 
tfidf = TfidfTransformer (smooth_idf = False) 
tfidf.fit (X_train_vect) 
X_train_tfidf = tfidf.transform (X_train_vect) 
X_test_tfidf = tfidf.transform (X_test_vect) 
# feature compared before and after the printing process 
print ( 'feature without processing tfidf: \ n', X_train_vect [:. 5,:. 5] .toArray ()) 
print ( 'treatment after tfidf features: \ n ', X_train_tfidf [: 5,: 5] .toarray ())
Without treatment tfidf wherein: 
 [[00,000] 
 [00,000] 
 [00,000] 
 [00,000] 
 [00,000]] 
after treatment tfidf features: 
 [[0. 0. 0. 0. 0.] 
 [0. 0. 0. 0. 0.] 
 [0. 0. 0. 0. 0.] 
 [0. 0. 0. 0. 0.] 
 [ 0. 0. 0. 0. 0.]]
# Import SVC linear classification model 
from sklearn.svm Import LinearSVC 
# cross validation tool introduced 
from sklearn.model_selection Import cross_val_score 
# using cross validation of the model scoring 
Scores = cross_val_score (LinearSVC (), X_train_vect, y_train, CV =. 3) 
# retrains linear SVC model 
CLF = LinearSVC (). Fit (X_train_tfidf, y_train) 
# with new data for cross-validation 
scores2 = cross_val_score (LinearSVC (), X_train_tfidf, y_train, CV =. 3) 
# print a new score comparison 
print ( 'through tf processing the training set -idf cross-validation score: {:. 3f} 'format ( scores.mean ())). 
after treatment the test set tf-id score print (':. {:. 3f } 'format (clf. score (X_test_tfidf, y_test)))
After training set tf-idf treated cross-validation score: 0.660 
After test set tf-id treated Score: 0.144

3. Remove the stop word in the text

# Import built deactivation thesaurus 
from sklearn.feature_extraction.text Import ENGLISH_STOP_WORDS 
# stopwords print number 
print ( 'Number disable words:', len (ENGLISH_STOP_WORDS)) 
# 20 and the print disable word before 20 
print ( 'List 20 before and 20 after: \ n', list (ENGLISH_STOP_WORDS ) [: 20], list (ENGLISH_STOP_WORDS) [- 20:])
停用词个数: 318
列出前20个和后20个:
 ['interest', 'meanwhile', 'do', 'thereupon', 'can', 'cry', 'upon', 'then', 'first', 'six', 'except', 'our', 'noone', 'being', 'done', 'afterwards', 'any', 'even', 'after', 'otherwise'] ['seemed', 'top', 'as', 'all', 'found', 'very', 'nor', 'seem', 'via', 'these', 'been', 'beforehand', 'behind', 'becomes', 'un', 'ten', 'onto', 'ourselves', 'an', 'keep']
# Import Tfidf model 
from sklearn.feature_extraction.text Import TfidfVectorizer 
# English stopwords activation parameters 
TFIDF = TfidfVectorizer (smooth_idf = False, STOP_WORDS = 'Dictionary Dictionary English') 
# fit the training data set 
tfidf.fit (X_train) 
# training data set text into a vector 
X_train_tfidf = tfidf.transform (X_train) 
# cross-validation score 
scores3 = cross_val_score (LinearSVC (), X_train_tfidf, y_train, CV =. 3) 
clf.fit (X_train_tfidf, y_train) 
# test data set into a vector 
= tfidf.transform X_test_tfidf (X_test) 
# printing test set and cross-validation score score 
print ( 'stop words are removed after cross-validation average training set: {:} 3F'. the format (scores3.mean ())) 
print ( ' after removing stopwords test set model score: {: 3f}. 'format (clf.score (X_test_tfidf, y_test)))
After removing stop words cross-validation training Average: 0.723933 
test set model score after removing the word disabled: 0.150920

 

to sum up: 

  In scikit-learn, there are two classes using the tf-idf method, wherein a is TfidfTransformer, characterized in that it is used CountVectorizer matrix text extracted from transformed, and the other is TfidfVectorizer, it CountVectorizer usage is the same, equivalent to the work done CountVectorizer and TfidfTransformer integrated together.

  In the field of natural language most commonly used python toolkit --NLTK. It may also achieve word, filling the text labels and other functions, it can also be used to extract stem and stem reduction.

  If you want to further develop insight into the topic of modeling (Topic Modeling) and document clustering (Document Clustering).

  The most often used for natural language processing in the field of deep learning comes word2vec library, if interested can understand.

 

Article cited; "in layman's language python machine learning"

Guess you like

Origin www.cnblogs.com/weijiazheng/p/10972708.html