Talked about earlier, the library for study, lightweight, so to learn it.
Installation is not talked about, is simple. But you must first install numpy and pandas libraries to install scikit-learn library.
If the anaconda was then installed, it comes with this library.
----------------------------------------------------------------------------------------------------------
1, first dictionary feature extraction
Effect: for the dictionary data feature value extraction.
API:sklearn.feature_extraction.DictVectorizer
Flow: 1, instantiating class DictVectorizer ()
2, input data, and calls the conversion method fit_transorm
On the code:
. 1 from sklearn.feature_extraction Import DictVectorizer 2 . 3 DEF dictvec (): . 4 '' ' . 5 the dictionary data extraction . 6 : return: None . 7 ' '' . 8 # instantiated . 9 dict = DictVectorizer () 10 . 11 # call fit_transorm 12 is Data = dict .fit_transform ([{ ' name ' : ' X- ' , ' Score ' : 80}, { ' name ' : ' the Y','score': 90},{'name':'Z','score': 100}]) 13 14 print(data) 15 16 return None 17 18 if __name__ == '__main__': 19 dictvec()
Can see the output result is a Sparse matrix, which is in front of the parentheses to obtain the coordinates, the latter figure is the value of the coordinates, such as: (0,0) value of 1.0 indicates a row 0 column 0.
The other is not listed as a coordinate (0,1), (0,2), etc. The default value is 0 .
The sparse parameter DictVectorizer () is set to False so easily readable results.
2, the text feature extraction
Effect: on the text data extracted
API: sklearn.feature_extraction.text.CountVectorizer
on Code: Suppose there are two articles: 'Life IS shortm, I like the Python' and 'life is too long, i dislike Python'
. 1 from sklearn.feature_extraction.text Import CountVectorizer 2 . 3 DEF countvec (): . 4 '' ' . 5 the text feature value extraction . 6 : return: None . 7 ' '' . 8 # instantiated . 9 CV = CountVectorizer () 10 . 11 # call fit_transorm 12 is Data cv.fit_transform = ([ ' Life IS shortm, I like the Python ' , ' Life IS TOO Long, I dislike the Python ' ]) 13 is 14 Print (Data) 15 16 return None 17 18 if __name__ == '__main__': 19 countvec()
Results and extraction dictionary is the same, it is worth noting that you want to parse this matrix is converted into a two-dimensional matrix is easier to read, then, is to call in the results toarray (), instead of setting the sparse parameters
as shown below:
get_feature_names () returns a list, which is a list of all of the feature extraction (in the present embodiment extracts eight words, single letters are not counted).
The results there are two lists, each corresponding to an article. The first list of the first 0 represents the first article dislike does not appear, the first list represents the first one is there, and so on