Feature engineering knowledge

Extracting characteristic data variance is greater than the threshold value VarianceThreshold
Suppose a feature characteristic value only 0 and 1, and all the input samples, characterized in that 95% of the value of Example 1 is, that it can not effect this feature. If 100% is 1, then this feature no sense. When the eigenvalues ​​are discrete variables in order to use this method, if it is continuous variables, we need to discretizing continuous variables before you can use. Also practice, generally less than 95% will have taken a characteristic value is present, so this method is simple, but not easy to use. It can be characterized as a pre-selected, to remove the small variation characteristic values, and then select the appropriate selection method further feature selection from the following features mentioned.
Then we how to judge the value of a feature is not present a high degree of distinction in terms of the analysis of the results of it, VarianceThreshold is to help us to do this deal, this will reduce the complexity of data analysis, will make the results more effective.
 
such as:
from sklearn.feature_selection import VarianceThreshold
print VarianceThreshold(threshold=1).fit_transform([[1,2,3,5],[1,4,3,6],[1,8,2,9]])  
threshold represents a threshold value, wherein the data is four in the variance is less than 1 is removed, leaving only greater than 1.
 
 
Qualitative characteristics of the dummy coding OneHotEncoder
Some machine learning algorithms and quantitative characteristics of the model can only accept input, qualitative features required to convert the quantitative characteristics. The easiest way is to specify a value for each qualitative quantitative value, but is too flexible, increasing the parameter adjustment work. Coding method generally used dummy feature transform quantitative qualitative characteristics: Suppose there are N kinds of qualitative values, a characteristic which will be extended to N types of characteristic, wherein when the original is the i-th qualitative value, wherein the i-th extension assignment 1, 0 is assigned to other extended features. Dummy coding method compared to the way specified directly, without increasing the operating parameter adjustment for linear models, the feature can be achieved using a dummy encoding a nonlinear effect.
If the qualitative value very much, that is, the number of high class properties. Conversely, the number of categories of properties i.e. low. Dumb coding is not good for the applicability of the high number of categories of properties.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(n_values = [2, 3, 4])
enc.fit([[0, 0, 3],[1, 1, 0],[0, 2, 1],[1, 0, 2]])
ans = enc.transform([[0, 1, 3]]).toarray()
print (years) 
Output: [[1 0. 0. 1. 0. 0. 0. 0. 1.]]
The first feature, i.e., a first sequence [0,1,0,1], which means it has two values ​​0 or 1, then one-hot will be used to represent the two characteristic [1,0 ] represents 0, [0,1] is 1, the output on the first two embodiments [1,0 ...] indicates that the feature is zero.
The second feature, a second row [0,1,2,0], which has three values, the three one-hot will be used to represent the feature, [1,0,0] represents 0, [0, 1,0] is 1, [0,0,1] represents 2, in the output of the third embodiment to the sixth [0,1,0,0 ... ...] is represented by the characterized 1.
The third feature of the third column [3,0,1,2], which has four values, then one-hot will be used to represent the four features, [1,0,0,0] represents 0, [0,1,0,0] represents 1, [0,0,1,0] represents, [0,0,0,1] 3 represents the last four [0,0 ... output results, 0,1] indicates that the feature is 3.
 
 

Guess you like

Origin www.cnblogs.com/myshuzhimei/p/12112536.html