On 14 ------- feature works

  Characterized in Engineering: feature selection, feature and expression characteristics pretreatment.

1, feature selection

  Feature selection is also known as variable selection and attribute selection, it is possible to automatically select the most relevant data of the target problem properties.是在模型构建时中选择相关特征子集的过程。

  Selecting a different characteristic dimension reduction. Although these two methods are to reduce the number of features in the data set, but the dimension reduction is equivalent to a re-combination all features , and feature selection only retain or discard certain features, without changing the feature itself
The method of dimension reduction are common PCA, SVD, Salmon mapping, feature selection is discarded effect small features.

  Why do feature selection?

  In the limited number of samples, with a large number of features designed for classifier calculations too much overhead and poor classification performance. Characterized by selecting, deleting duplicate and redundant voting out irrelevant features further dimension reduction, obtaining the smallest possible subset of features, only a small amount of sample models can get a higher prediction accuracy. Characterized in that selection may be repeated to remove redundant, irrelevant features conducive to building a less complex, more explanatory model.  

  1, removing redundant and irrelevant features, reduced feature dimensions, help to improve the predictive ability of the model to build more efficient models. 2, a better understanding of the process of data generation, more interpretability.

  There are many feature selection is generally divided into three categories:

  (1) The first Filter (filtration) is relatively simple, it is the individual features rates according to divergence or correlation index characteristic, i.e. to the characteristics of each dimension is weighted so that the weight on behalf of the dimension feature importance, then sorted according to weight. Score threshold or set threshold number to be selected, select the appropriate characteristics.

  Use variance selection method , first calculating the variance of each feature, and then based on the threshold value, select the feature variance is greater than the threshold value, i.e. the removal of small changes in characteristic values.

  Chi-square test, the classic chi-square test is a test for the qualitative arguments qualitative variables because of the correlation . Suppose there are N kinds of values of the independent variable, dependent variable values M species, considering the independent variable and the dependent variable i equal to j is equal to the sample frequency of observations desired number of gaps, build statistics. Take the phenomenon of speculation observed results. Variance method to measure the difference between an observation frequency and theoretical frequency.

  Personal experience is that when there is no idea, priority access to the chi-square test and mutual information, information gain to do the feature selection.

  (2) The second category is the Wrapper (packaging method) , according to the objective function, is generally predicted effect score, wherein each selection portion, or exclude some of the features.

  The selected subset is seen as a search optimization problem, generating different combinations, the combination of the evaluation, then compared with other combinations. Select this will be seen as a subset of an optimization problem, there are a lot of optimization algorithms can be resolved, especially some heuristic optimization algorithms. Generally, we will build the model on different subsets, re-use predictive accuracy of the model to score different feature subset. Searching method may be a random type, such as a random climbing method, a heuristic may be, for example, the iteration to iteration front and reverse.

  (3) The third category Embedded (embedding) is slightly more complicated, and it is to use some algorithm to train the machine learning model, the weight coefficient to obtain the respective characteristics according to the feature selected in descending weight coefficient. Similar to the filtration method, but it is by training a machine learning to determine the characteristics of the merits, rather than to determine the characteristics of the statistical indicators directly from some of the features of the merits.

  Regularization, regression model, SVM, decision trees, random forest.

  Wherein the regression model learning , the more important features of the corresponding coefficient in the model will be the greater, the more the coefficient output variables with corresponding features independent of 0 will be closer.

  Regularization , Ll regularization coefficient w l1 norm as the penalty term to the loss function, because regularization term zero, forcing the coefficient corresponding features of those weak becomes zero. Therefore they tend to make L1 regularization learned model is sparse (often coefficient w 0), so that this feature L1 regularization becomes a good feature selection method. L1 regularization as a non-linear model as regularization is unstable, if the characteristic feature having an associated set of data occur when subtle changes may also lead to very different models.

  L2 regularization the L2 norm of the coefficient vector is added to the number of loss functions. Since L2 is a penalty term in the quadratic coefficient, which makes the value of L1 and L2 has many differences, the most obvious point is, L2 regularization factor will become average. For related feature, which means that they can gain more similar to the corresponding coefficient. For positive selection wherein L2 is the model of a stable, unlike L1 regularization above, because the coefficient data fluctuates slightly. Therefore, the value of L2 and L1 regularization regularization provided are different, L2 is more positive is useful for understanding features of: represents a coefficient corresponding to the characteristic strong ability is non-zero.

  Random Forests , is a very popular feature selection method, it is easy to use, generally does not require cumbersome steps feature Engineering, parameter adjustment, etc., and many kits provide an average impurity descent method. Its two main problems, one important feature is likely to score low (problems associated features), 2 is the more advantageous feature of this method variable classes and more features (bias issue).

  GBDT , gradient boosting tree.

  Kaggle competition in the algorithm and the like, the main method used by the team scores in addition to integrated learning algorithm , and the rest is mainly in the high-level features fuss about it. So look for advanced features is a necessary step model optimization. Of course, when the first model, we can not first look for advanced features, after getting the baseline model, and then look for advanced features optimized.

    Find the advanced features of the most common methods are:

    And added a number of features: Let's say you want features one week of sales based on daily sales. You can be the most recent seven days is obtained by adding sales.
    The difference between a number of features: Assuming you already have weekly sales and monthly sales of two features, you can find the sales of the week the previous month.
    Several features of the product: Suppose you have commodity prices and commodity sales features, then you can get features sales.
    In addition to a number of business features: Assume you have per-user sales and the number of pieces of merchandise purchased, the average cost per user is to get the sales merchandise.

  Of course, look for the advanced features of the method is far more than that, it requires you to give according to your needs and business model, and not just pairwise combined to form high-level features, such features easily lead to an explosion, but no way to get a better model . Personal experience is that the clustering of advanced features to minimize the time point when the Classification and Regression modest little more advanced features.

2, wherein the expression

  That is, if the performance of the specific characteristics of a particular form for processing. Including missing values, the particular features, such as the processing time and the processing location, continuous and discrete processing of discrete features, wherein the discrete continuous process aspects.

3, wherein the pretreatment

  (1) Normalized:

  0, why some machine learning model data needs to be normalized?

  1) After accelerated normalized gradient descent speed optimal solution; 2) normalized possible to improve accuracy.

  What are some ways:

  z-score normalized: it is the most common features of pretreatment, substantially all of the linear model will be standardized z-score at the time of fitting. Specific method is the mean std mean and standard deviation obtained sample feature x, then (x-mean) / std instead of the original features. Such a feature becomes zero mean, a variance of 1.

  max-min standardized: also called normalized deviation, characterized in that the pretreatment values ​​are mapped to [0,1]. Specific method is to obtain the maximum value max and minimum value min of the sample feature x, and then replace the original characterized by (x-min) / (max-min). If we want to map data to any of a range [a, b], rather than [0,1], it is also very simple. With (x-min) (ba) / (max-min) + a feature can be used instead of the original.

  We also often use centralized, mainly in the PCA dimensionality reduction of the time, this time we mean the average value of x features, instead of the original features using x-mean, which is the mean feature becomes 0, but variance does not change. This is well understood, because PCA dimension reduction is dependent on the variance, if we do a z-score standardized, features variance is 1, then would not be able to lower the dimension.

  Although most machine learning models need to do standardization and normalization, there are many models can not be standardized and normalized, mainly probability distribution based model, such as a large family tree of CART, Random forests. Of course, this time using standardized is also possible, in most cases on the generalization ability of the model has also improved.

  (2) an abnormal sample wherein washing

  (3) Unbalanced: generally two methods: sampling weight method or

 

1、数据清理中,处理缺失值的方法有两种:
删除法 1 )删除观察样本
        2 )删除变量:当某个变量缺失值较多且对研究目标影响不大时,可以将整个变量整体删除
        3 )使用完整原始数据分析:当数据存在较多缺失而其原始数据完整时,可以使用原始数据替代现有数据进行分析
        4 )改变权重:当删除缺失数据会改变数据结构时,通过对完整数据按照不同的权重进行加权,可以降低删除缺失数据带来的偏差
查补法:均值插补、回归插补、抽样填补等
  成对删除与改变权重为一类
  估算与查补法为一类
2, conventional processing methods: estimating, remove the whole embodiment, and the pair delete variable drops.
    Since the survey, and encoding input errors, there may be some invalid data and missing data, needs to be given appropriate treatment.
   Estimated (estimation). The easiest way is to use a variable sample mean, median or mode instead of the invalid or missing values. This approach is simple, but the data did not fully consider the existing information, errors may be larger. Another approach is based on the survey answers to other questions, estimated by correlation analysis between variables or logical inference. For example, the possession of a particular product may be related to family income, we can have the possibility of this product based on family income survey projections.
   Example Delete the entire (casewise deletion) is to remove the sample containing missing values. There may be as many questionnaires are missing values, the result of such an approach could lead to the effective sample is greatly reduced, unable to take full advantage of data already collected. Thus, only the key variables for deletion, or a sample containing a small proportion of missing values invalid value or a case.
   Delete variables (variable deletion). If an invalid value of a variable and a lot of missing values, and the variable on the research question is not particularly important for, consider the variable to be deleted. This approach reduces the number of variables for analysis, but did not change the sample size.
   Pairwise deletion (pairwise deletion) is a special code (usually 9,99,999, etc.) represent invalid or missing values, while preserving all the variables in the dataset and the sample. However, when specific calculation using only sample a complete answer, so different variables involved in the analysis due to the different, the effective sample size will be different. This is a conservative approach, to maximize the retention of the information available in the data set.
  Different treatment may affect the analysis results, especially when the missing values ​​are not significantly correlated random variables and between. Therefore, in the survey should be avoided invalid or missing values ​​appear to ensure data integrity.
 
Published 121 original articles · won praise 8 · views 30000 +

Guess you like

Origin blog.csdn.net/bylfsj/article/details/104824499