sklearn continuous discrete data

Binarization

Set a condition, the continuous data classification categories. Age example, greater than 30 and less than 30.

from sklearn.preprocessing import Binerize as Ber
x = data_2.iloc[:,0].values.reshpe(-1,1) #提取数据
trans = Ber(threshold = 30).fit_transform(x)
trans

This is the x> 1 30 is set, the other is set to 0.

label

Sometimes the data may require data binning process, or to different data setting different labels.

from sklearn.preprocessing import LabelEncoder as le
l = le()
l=l.fit(y)
label =l.transform(y)

L can target with classes_ Properties to see total number of classes.

l.classes_

array(['No', 'Unknown', 'Yes'], dtype=object)

the label is processed data. Direct written:

from sklearn.preprocessing import LabelEncoder
data.iloc[:,-1]=LabelEncoder().fit_transform(data.iloc[:,-1])

One-hot encoding

If the data is orderly, but can not be calculated. For example, primary, secondary and university. If alternative 1,2,3 respectively. So the calculation, may 2 will be considered as 1 + 1, ranging from two primary schools and secondary schools together, so they need to be separate taxonomic composition of such data:

stu_id primary school High school the University
1234 1
1235 1
1236 1

This method is called one-hot encoding.

from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder(categories='auto').fit(x)

Use get_feature_names () to view the name:

enc.get_feature_names()

enc.get_feature_names()

The result is a sparse matrix, need toArray () method.

result=OneHotEncoder(categories='auto').fit_transform(x).toarray()

Finally, the result of the connection to the original data, re-extraction.

newdata=pd.concat([data, pd.DataFrame(result)],axis=1)

Guess you like

Origin www.cnblogs.com/heenhui2016/p/10988059.html