Zero-based entry of data mining - used car transaction price forecast of engineering characteristics

0 Introduction

  He characterized it works? As the name suggests, the original data is ** a series of engineering process, which will refine characterized as an input for the use of algorithms and models. ** In essence, the feature works is a process of representation and presentation of data; practical work, the purpose of the project is characterized by the removal of impurities and the raw data redundancy, more efficient design features to characterize and solve the problem of prediction model the relationship between.
The importance of the project has the following features:

  1. ** feature better, stronger flexibility. ** Flexibility good features is that it allows you to select uncomplicated model, but the speed is also faster, and easier to maintain.
  2. ** features better, the simpler the model constructed. ** number of feature parameters may be under less than optimal circumstances, still get good performance, reduce the workload and time parameter adjustment, also can greatly reduce the complexity of the model.
  3. ** feature better, more excellent performance of the model. ** The purpose of the original features of the project is to improve the performance of the model.

Generally considered to be divided into three characteristic engineering work:

  • Feature Extraction
  • Feature Selection
  • Characteristics building

  Data preprocessing and features described herein will be considered in the feature works.
  In fact feature project is a great need for a better time to master this skill, simply look at the theoretical understanding is not deep enough, the practical application of the project or when the game will have a deeper understanding.

Data Preprocessing 1

First, the need for data pre-processing, two commonly used types of data:

  1. Structured data. Structured data can be viewed as a relational database table, each column has a clear definition, including numeric and categorical two basic types; each row of data represents information of a sample.

  2. Unstructured data. Mainly text, images, audio and video data, the information it contains can not be represented by a simple numerical value, there is no clear definition of the category, and the size of each data different from each other.

1.1 deal with missing values

The missing data include missing a field missing information and records recorded, both of which can cause inaccurate results.

The cause of missing values:

  • You can not get information, or to get information too costly.
  • Information is missing, the missing missing human input or data acquisition device.
  • Property does not exist, in some cases, missing data values ​​does not mean that there is an error, some objects for some property value does not exist, such as the names of unmarried spouses, children and other fixed income.

Impact of missing values:

  • Data mining modeling will lose a lot of useful information.
  • Data mining model exhibited significantly more uncertainty, the model law contains the more difficult to grasp.
  • Data modeling process will contain null values ​​into chaos, resulting in unreliable output.

The method of processing missing values:

  • Direct containing missing values ​​using the feature: When only a small sample of the missing feature may try to use;
  • Wherein remove containing missing values: This method is generally applicable to most sample lack this feature, and contain only a small valid values ​​is valid;
  • Interpolation complement missing values
    most often used to complement or third interpolation of missing values practice which in turn can have a variety of completion method.
    • Number average / median / public completion
      if the distance of the sample is measurable property, the property is used to complement the average effective value; sample property if the distance metric is not, then the mode or the median may be employed to completion.

    • Number Mean grade / median / complement all
      the samples are classified according to grade the other samples mean attribute complement missing values, of course, similarly with the first method, if the mean is not feasible, may attempt a mode or a median number of statistical data to completion.

    • Completion fixed value
      using the attribute value of the fixed value in the absence of complement.

    • Modeling prediction
      using machine learning methods, the missing attributes predicted as the predicted target, particularly for the samples according to whether the property is missing divided into a training set and test set, then regression, decision tree models, such as training a machine learning algorithm, re-use the value obtained by training the model to predict the properties of the test set samples.
      This approach is fundamentally flawed if the other property and missing property is not relevant, the predicted results meaningless; but if predictions quite accurate, then this property is no need to include the missing data set; the general situation is between the two between.

    • High dimensional map
      attributes mapped to a high-dimensional space, using the one-hot code encoder (one-hot) technology. Comprising discrete attribute values of K in the range of K + 1 extended attribute values, if the property value is missing, then the first K + 1 th attribute value is set to 1 after expansion.
      This approach is the most accurate approach, retains all of the information, nor does it add any additional information, if all the pre-treatment variables such treatment, will greatly increase the dimensions of the data. The benefit of this is to keep intact all the information of the original data, regardless of missing values; drawback is that computing is greatly improved, and the only effect of sample size is very large do.

    • Multiple imputation
      multiple imputation to be interpolated value that is random, the practical value is generally estimated to be interpolated, and the fact that the noise, to form a plurality set of alternative interpolated value, according to some basis for selection, select the most appropriate interpolation.

    • Sensing matrix completion and compression
      compressed by using a sensing signal having sparsity itself, return the original signal from a part of the observed sample. Compressed sensing measurements and perception is divided into two stages of reconstruction recovery.

      • Perception Measurement: This stage of the original signal is processed to obtain a sparse samples represent. Commonly used means of Fourier transform, wavelet transform, dictionary learning, sparse coding.
      • Reconstruction of recovery: this stage to restore the original signal from a small amount of observations based on sparsity. This is the core of compressed sensing
        matrix completion problems can almost see to know: https: //www.zhihu.com/question/47716840
    • Manual completion
      in addition to manual completion method, other interpolation completion method simply unknown value complement to our subjective estimates may not be entirely consistent with the objective facts. In many cases, the effect of interpolation according to the understanding of your field manual for missing values will be better. However, this method requires a high problem areas of awareness and understanding requirements are relatively high, if more data is missing, it would be more time-consuming.

    • Nearest Neighbor completion
      to find the closest sample to the sample, the use of the property value to its completion.

1.2 Processing outlier

  Outlier analysis test is whether there is data entry errors and data contained anomaly. Ignore the existence of outliers is very dangerous, without excluding the abnormal values ​​calculated into the data analysis process, will have a negative effect on the results. Refers to individual values ​​outlier sample, its value deviates significantly from the rest of the observations. Also known as outliers outliers, outlier analysis, also known as outlier analysis.

  • By boxplot (or 3 p 3\sigma ) analyzed remove outliers
    to this principle on one condition: normal distribution of data required. in 3 p 3\sigma principle, such as abnormal value over three times the standard deviation, it can be viewed as outliers. Positive and negative 3 p 3\sigma probability is 99.7%, so the average distance 3 p 3\sigma probability value occurs outside of P (| xu |> 3 p 3\sigma ) <= 0.003, belongs to a very few small probability event. If the data do not follow a normal distribution, it can also be described by the number of standard deviations away from the average.
  • BOX-COX conversion (processing biased distribution)
  • Long tail truncation
    specific reference, the data of the engineering process wherein

2 feature selection

Definitions: selecting a subset of relevant features of the process is called feature selection (feature selection) from a given set of features.

  • For a learning task, given a set of attributes, some of the properties may be critical for learning, but some property is not much significance.
    • The current learning tasks useful properties or characteristics, called relevant characteristics (relevant feature);
    • The current task of learning useless attributes or characteristics, known as extraneous feature (irrelevant feature).
  • Feature selection may reduce the predictive power of the model, because the knockout feature may contain valid information, the model discard this portion will reduce to some extent the performance information. But it is also calculated trade-off between complexity and performance models:
    • If the retention characteristics as much as possible, will improve the performance of the model, but the model is complicated, the computational complexity is also improved;
    • If you take out as many features, performance models will decline, but the model becomes simple, it reduces the computational complexity.
  • Common methods of feature selection into three categories:
    • Filtering (filter): feature selection data first, and then training and learning, a common method has Relief / variance send selection / correlation coefficient / chi-square test method / mutual information;
    • Wraparound (wrapper): directly to the performance of the learner will eventually be used as an evaluation criterion feature subset, the common methods LVM (Las Vegas Wrapper);
    • Embedded (embedding): combination of filtering and wrap-around, automatically learning is the process of training feature selection, a common lasso regression;

3 feature extraction

Feature extraction is generally in the prior feature selection, objects which extracts the raw data, the purpose is to automatically construct a new feature, to convert the raw data to a set of clear physical meaning with (for example on Gabor, geometry, texture features), or a statistically significant Characteristics.
Commonly used methods include dimensionality reduction (PCA, ICA, LDA, etc.), the image aspect of the SIFT, Gabor, HOG, etc., model bag of words of text areas, embedded in the word model.

4 wherein Construction

Construction features from the raw data refers to the construction of new artificial features. It takes time to observe the raw data, to think of potential forms and data structures, learning hands-on experience of the sensitivity of data and machine features can help build.

Features required to build a strong insight and analysis capabilities, it requires us to identify some of the features with physical meaning from the raw data. Assuming that the original data is table data, generally you can use to create new feature or combination of attributes mixing properties, or decomposition or splitting of the original features to create a new feature.

Features to build highly relevant to the needs of domain knowledge or practical experience in order to build a better good of useful new features, compared to the feature extraction, feature extraction is used to convert the raw data through a number of ready-made feature feature extraction method, Construction features and requires manual build our own human features, such as a combination of two characteristics, wherein a plurality of decomposed or new features.

  • Wherein configuration statistics, report count sum ratio, standard deviation;
  • Time characteristics, including relative time and absolute time, holidays, weekends and the like;
  • Geographical information, including bin, distributed coding method;
  • Non-linear transformation, including log / sq / root and the like;
  • Combinations of features, wherein the cross;
  • Eyes of the beholder, the wise see wisdom.

Code refer to my github
reference:

Published 21 original articles · won praise 1 · views 1117

Guess you like

Origin blog.csdn.net/Elenstone/article/details/105133235