[Data mining] (c) feature works

1. Project Overview feature

Data and characteristics determine the upper limit of machine learning, and models and algorithms just to approximate this limit as much as possible. Thus, the characteristics of the project plays an important role in machine learning.

Project features : to find any information related issues and converts them into a feature matrix values

Here Insert Picture Description

Application 2 feature works in this case the

2.1 outlier handling

When the data pre-processing, whether the abnormal value removed, as the case may be required, because some abnormal values ​​may also contain useful information.

Outlier processing method The method described
Delete records contain outliers Direct delete the entire record
Regarded as missing values The abnormal values ​​as missing values, using a method of processing an abnormal value of missing values
The average correction Average of two observations can be used before and after the correction of the abnormal value
Does not deal Direct mining model having a set of data on an abnormal value

In dealing with outliers, we should first analyze the possible causes of abnormal values ​​appear, and then determine whether the abnormal values ​​should be discarded, if the data is correct, can have a direct excavation modeling data set outliers.

2.2 Data normalization

Also called data normalization, data mining is a fundamental task. Due to the different evaluation indexes tend to have different dimensions, the difference between the value may be large, no treatment may affect the results of data analysis. For this purpose, it will be scaled according to the fall within a specific area, in order to comprehensive analysis.

1) The most value normalization

Also known as standardization deviation is linearly changing the original data, the value is mapped to [ 0 , 1 ] [0,1] between. Conversion formula:
x = x m i n m a x m i n x^*=\frac{x-min}{max-min}
among them: m a x max is the maximum sample data, m i n me minimum value sample. m a x m i n max-min is poor.

Drawback : if the values are concentrated to a great value, you will be close to zero after standardization, and will over-concentration. And in the event of more than [ m i n , m a x ] [min,max] data range, an error will occur, to be re-determined m i n me sum m a x max

2) normalized standard deviation

It called zero - mean normalization, the mean data after treatment is 0 0 , the standard deviation 1 1 . Conversion formula:
x = x x ˉ σ x^* = \frac{x-\bar{x}}{\sigma}
among them: x ˉ \ Bar {x} to mean the original data, σ \sigma is the normalized raw data. It is the most commonly used standardized methods

2.3 barrels of data points

Also known as continuous data discrete, continuous attribute is about to transform into a classification property. For some algorithms, such as data mining (ID3, Apriori algorithm, etc.), data is required in the form of categorical attributes.

Discrete continuous data is set within a range of several discrete data points of division, dividing the range of a discrete number of intervals, using different symbols or the last value represents the integer falling in each subinterval data values. So, discrete involves two sub-tasks: determining the number of categories and how to map the continuous attributes to classify these values.

Common methods of data packets tub: width kit of parts, and other frequency division tub, the tub based on the sub-cluster ......

1) partial width tub

The range is divided into a section having the same attribute width, the number of sections determined by the data itself may be specified by the user, similar to the frequency distribution table produced.

Disadvantages : a need to artificially distribution planning interval; 2 sensitive to outliers, the attribute value tends to be unevenly distributed to each section, such a result can seriously damage the decision model.

2) like a frequency division tub

Into the same number of records for each section.

Advantages : avoiding the disadvantages width points of the tub;

Disadvantages : a need to artificially distribution planning interval; 2 the same data value may be assigned a different number of data in each section to satisfy a fixed interval.

3) based on clustering points barrel

The method comprises a two-dimensional clustering steps, firstly the value of the attribute is continuous with the clustering algorithms (such as K-Means) clusters, then the clusters obtained by the clustering processing, continuous attributes incorporated into a cluster and do the same mark.

2.4 missing values

There are a method for processing three missing values: delete a record, does not process, the data interpolation and data binning.

1) delete records

If there is a small part of the sample are missing, delete the sample is clearly the most effective method.

Drawback : wasteful data.

2) does not deal with

Part model allows the model in the data set contains a deletion.

3) data interpolation

Interpolation Method The method described
Fixed value Be replaced by a fixed constant
Mean Median / a mode / The data type of the attribute, its direct use to fill the corresponding value
Nearest neighbor interpolation The attribute value interpolated to find the closest sample missing samples in the record
Regression Establish fitted model to estimate missing values ​​based on existing data and other variables related to the other (the dependent variable) data
Interpolation Using several known points establish appropriate interpolation function, find unknown values ​​by the corresponding function value point padding

4) Data binning

The data containing the missing data values ​​in the same group.

2.5 Construction Characteristics

1) Statistics feature

2) Characteristics Time

3) Geographic Information feature

4) non-linear transformation

5) a combination of features

2.6 feature selection

Principle : obtaining the smallest possible subset of features, does not significantly decrease the accuracy of the classification, the classification does not affect the distribution and should have a stable feature subset, adaptable characteristics.

1) filtering (the Filter)

In this method, the first feature selection, and then to train learner, so the feature selection process has nothing to do with the selector. Corresponds to the first feature of the filtering operation, then a subset of features to train the learner.

Idea : for each dimension features "points" that is given to the characteristics of each dimension weights, followed by re-sorted according to rights

Methods :

  • Chi-squared test (chi-square test)
  • Information gain (gain information)
  • Correlation coefficient scores (correlation coefficient)

Advantages : running speed, is a very common feature selection method;

Disadvantages: 1. unable to provide feedback, feature selection criteria, normative reality feature search algorithm is complete, the learning algorithm can not be delivered on demand feature to feature search algorithm; 2. may be due to any cause in dealing with certain characteristics this feature is not important, but this feature combined with other features there may be important.

2) encapsulated (the Wrapper)

Also known as wraparound method. The method to be used directly to the final classifier as an evaluation function selection feature, an optimal feature subset selection for a particular classifier.

Thought : Select the sub-set of search as an optimization problem to produce different combinations of combinations were evaluated, then compared with other combinations. At this point it can be seen as optimization problems, then you can use some optimization algorithms, especially some of the heuristic optimization algorithm, such as GA, PSO, DE, ABC and other methods.

Methods : recursive feature elimination algorithm.

Advantages : 1 . Search feature is deployed around the learning algorithm, the feature selection criteria are deployed in accordance with the needs of learning algorithms; 2 can be considered any deviation learning learning algorithm belongs to determine the optimal characteristics of child and really learning problems. itself; 3 . Because each learning algorithm must be run for a particular subset, it can pay attention to the deviation learning learning algorithm, summarized deviation, the packaging can play a great role.

Drawback : filtering algorithms run much more slowly than the practical application is not wide enough.

3) Embedded (Embedded)

Select models feature embedded in the training, their training may be the same model, but after the completion of feature selection, feature selection can give complete and features a super-model training parameters to optimize training again.

Idea : learning the best features to improve the accuracy of the model, that is, to determine the model of the process, pick out those characteristics is of great significance to training model in the model given situation.

Method : done with L1 regularization term with feature selection (penalty term may also be combined to optimize L2), Random Forests average impurity reduction method, a method to reduce average precision.

Advantages: 1 deployed around the search feature learning algorithms, learning can be considered any deviation learning algorithm belongs; 2. training times less than encapsulation approach, to save time comparison.

Disadvantages : filtering approach speed with respect to the still slow.

2.7 dimensionality reduction

1) principal component analysis

Principal component analysis (Principal Component Analysis, PCA) using the orthogonal transform algorithm to convert the linear series of observations may be related variables to project a series of values ​​for the variables linearly independent, uncorrelated variables called the main ingredient. It is a very basic kind of dimensionality reduction algorithm.

Note : PCA positive or raw data is pre-sensitive.

2) Linear Discriminant dimensionality reduction

Linear Discriminant dimensionality reduction (Linear Discriminant Analysis, LDA) algorithm is then projected onto the low-dimensional data space, so that the same type of data as compact as possible, as different types of data dispersion, a machine learning algorithm supervised.

3) independent component analysis

ICA (Independent Component Analysis, ICA) is to find a method for its intrinsic factors or components from the multidimensional statistics.

To be continued

reference:

  1. Data mining combat (used car price forecast)
  2. "Python Data Analysis and Mining combat" - Machinery Industry Press
  3. Engineering Series features: feature selection and implementation of the principle of
  4. Wikipedia: Principal Component Analysis
  5. Know almost Rubric: Machine Learning -LDA
Released three original articles · won praise 1 · views 210

Guess you like

Origin blog.csdn.net/ocean_R/article/details/105167580