04 _ the definition of project characteristics

1. The feature extraction: the text string, the dictionary data, is converted into a digital feature extraction.

2. The feature extraction API: sklearn.feature_extraction

3. dictionary feature extraction: for the dictionary data eigenvalue of using sklearn.feature_extraction.DictVectorizer

 . DictVectorizer Syntax: DictVectorizer fit_transform (X-) : X-is or comprises dictionary dictionary iterator;  the return value is sparse matrix

           DictVectorizer.inverse_transform (X): X is sparse array or matrix array; return value is a data format before the conversion

          DictVectorizer.get_feature_names (): Returns the name of the category

          DictVectorizer.transform (X): According to the original standard conversion

 Dictionary data extraction: the number of classes in the dictionary data, characteristic values ​​were converted to

4.one-hot encoding: each category (each column can be considered as a value in the data table) to generate a Boolean columns, which may be only a value of 1 for each sample.

   one-hot encoding: In short, in order to avoid the digital size of each column caused priority problems, facilitate data analysis and machine learning.

 

 Dictionary feature extraction: [ { "City": "Beijing", "temperature": 100}  ,

        {"city":"上海","temperature":60},

        {“city”:"深圳","temperature":30}  ]

 

Import DictVectorizer sklearn.feature_extraction from 

DEF dictvec ():
dict = DictVectorizer () # if sparse = False, to obtain a matrix

# call fit_transform
Data dict.fit_transform = ([{ "City": "Beijing", "temperature": 100} ,
{ "City": "Shanghai", "temperature": 60},
{ "City": "Shenzhen", "temperature": 30}])
  

Print (dict.get_feature_names ())
    Print (Data) 

return None


IF the __name__ == '__main__':
dictvec ()

Results: # adjacency matrix
 
    (0, 1)    1.0
   (0, 3) 100.0
   (1, 0) 1.0
   (1, 3) 60.0
   (2, 2) 1.0
   (2, 3) 30.0

   # If the sparse = False, to obtain a matrix

    [ 'city = Shanghai', 'city = Beijing', 'city = Shenzhen', 'temperature']
    [[0. 0. 1. 100. The]
      [1. 0. 0. 60. The]
      [0. 0. 1 30.]]

 

NOTE: characterization is to better understand the computer data.


English: feature characteristics, extraction extraction, Vectorizer vector is, sparse sparse, fit for

 





Guess you like

Origin www.cnblogs.com/cwj2019/p/11715623.html