Data mining learning | task3 feature works

  1. Project Overview feature
    "data determines the upper limit of data mining, and algorithm just approaching this limit as much as possible," where the data refers to data obtained after the project characteristics. So, what is characteristic of engineering it? Feature engineering refers to the raw data into training data model needs to follow the process, the purpose is to obtain better training data features, making the machine learning model is more likely to be close to this limit. Visible, the importance of engineering in data mining feature is self-evident, so that the performance of the model can be enhanced, and sometimes can get good results on a simple model.
  2. Content feature works
  • Exception Handling: The boxplot (or 3-Sigma) analysis, remove outliers;
    BOX of COX-conversion (processing biased distribution); long tail cut off;
  • Characterized in normalized / standardized:
    Here Insert Picture Description
    the Z-Score standardized (converted to the standard normal distribution); min-max standardized (converted to [0,1]); for a power-law distribution, the formula may be employed:
    Here Insert Picture Description
  • Data points barrel: a frequency division like the tub; equidistant points tub; Best-KS kit of parts (similar to using the Gini index binary); chi-square tub;
  • Missing Values: no treatment (tree model for similar XGBoost, etc.); deleted (too much missing data); interpolation completion, including the number mean / median / public / modeling prediction / multiple imputation / compressed sensing completion / matrix patch congruent; bin, a box missing values;
  • Wherein structure: wherein configuration statistics, report count sum ratio, standard deviation; time characteristics, including relative time and absolute time, holidays, weekends, etc.; GIS, including bin, distributed coding method; non-linear transformation , including log / sq / root and the like; a combination of features, wherein the cross.
  • Feature selection
    1. filtering (filter): feature selection data first, and then training and learning, a common method has Relief / variance send selection / correlation coefficient / chi-square test method / Mutual Information;
    2. wraparound (wrapper): directly to the performance of the learner will eventually be used as an evaluation criterion feature subset, the common methods are the LVM (Las Vegas the Wrapper);
    3. embedded (embedding): binding and wraparound filter type, training learner automate the process of feature selection, a common lasso regression;
  • Dimensionality Reduction: PCA / LDA / ICA; Feature selection is also a dimension reduction.
Released five original articles · won praise 1 · views 56

Guess you like

Origin blog.csdn.net/weixin_39294199/article/details/105147330