Binarization
Set a condition, the continuous data classification categories. Age example, greater than 30 and less than 30.
from sklearn.preprocessing import Binerize as Ber
x = data_2.iloc[:,0].values.reshpe(-1,1) #提取数据
trans = Ber(threshold = 30).fit_transform(x)
trans
This is the x> 1 30 is set, the other is set to 0.
label
Sometimes the data may require data binning process, or to different data setting different labels.
from sklearn.preprocessing import LabelEncoder as le
l = le()
l=l.fit(y)
label =l.transform(y)
L can target with classes_ Properties to see total number of classes.
l.classes_
array(['No', 'Unknown', 'Yes'], dtype=object)
the label is processed data. Direct written:
from sklearn.preprocessing import LabelEncoder
data.iloc[:,-1]=LabelEncoder().fit_transform(data.iloc[:,-1])
One-hot encoding
If the data is orderly, but can not be calculated. For example, primary, secondary and university. If alternative 1,2,3 respectively. So the calculation, may 2 will be considered as 1 + 1, ranging from two primary schools and secondary schools together, so they need to be separate taxonomic composition of such data:
stu_id | primary school | High school | the University |
---|---|---|---|
1234 | 1 | ||
1235 | 1 | ||
1236 | 1 |
This method is called one-hot encoding.
from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder(categories='auto').fit(x)
Use get_feature_names () to view the name:
enc.get_feature_names()
enc.get_feature_names()
The result is a sparse matrix, need toArray () method.
result=OneHotEncoder(categories='auto').fit_transform(x).toarray()
Finally, the result of the connection to the original data, re-extraction.
newdata=pd.concat([data, pd.DataFrame(result)],axis=1)