预处理之特征编码方法总结

(1)one-hot编码:
独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。
已知三个feature,三个feature分别取值如下:
feature1=[“male”, “female”]
feature2=[“from Europe”, “from US”, “from Asia”]
feature3=[“uses Firefox”, “uses Chrome”, “uses Safari”, “uses Internet Explorer”]
one-hot编码后:
feature1=[01,10]
feature2=[001,010,100]
feature3=[0001,0010,0100,1000]
所以,对于前边样本[“male”,“from Asia”, “uses Chrome”],经过独热编码后,它应该为:
[01,00, 000,000,100, 0000,0010,0000,0000]

(2)sklearn中的DictVectorizer

from sklearn.feature_extraction import DictVectorizer  
measurements = [  
    {'city': 'Dubai', 'temperature': 33.},  
     {'city': 'London', 'temperature': 12.},  
     {'city': 'San Fransisco', 'temperature': 18.},  
 ]  


vec = DictVectorizer()   
print(vec.fit_transform(measurements).toarray())
""" 
输出: 
array([[  1.,   0.,   0.,  33.], 
       [  0.,   1.,   0.,  12.], 
       [  0.,   0.,   1.,  18.]]) 
"""  
print(vec.get_feature_names())
""" 
输出: 
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature'] 
"""  

(3)Python机器学习库SKLearn:数据集转换之特征提取
(4)大规模特征编码问题和工程实践
(5)特征抽取:特征字典向量化和特征哈希变换

猜你喜欢

转载自blog.csdn.net/j904538808/article/details/80731702