Data preprocessing data mining Python


Here Insert Picture Description

The main function of the data pre-processing Python

In data mining, vast amounts of raw data exist in a large number of incomplete (missing value), inconsistent, abnormal data, seriously affecting the efficiency of data mining modeling, and may even lead to deviation mining results, the data cleaning is particularly important, followed by or simultaneously integrated data, convert the series of processing, the data after completion of washing the statute and the like, the process is data preprocessing. Data preprocessing on the one hand to improve the quality of data, on the other hand is to make the data better adapted to the particular mining techniques or tools. Statistics found that in the process of data mining, data preprocessing workload accounted for 60% of the entire process.

The main content data preprocessing comprises : data cleaning, data integration, data conversion and data statute

Function name Function Function Belongs to expand library
interpolate One-dimensional, high-dimensional data interpolation Scipy
unique Remove duplicate data element values ​​to obtain a single element list, name of the method which is the object Pandas/Numpy
isnull Judging whether or not empty Pandas
notnull Determine whether the non-empty Pandas
PCA Matrix of indicator variables principal component analysis Scikit-Learn
random Generating a random matrix Numpy

1、interpolate

Function: the interpolate a Scipy one sub-library comprising a large number of interpolation function, such as Lagrange interpolation, spline interpolation, high-dimensional interpolation. Before use from scipy.interpolate import * introduced by the corresponding interpolation function can find the corresponding function name to the official website as needed.

Using the format:

f = scipy.interpolate.lagrange(x,y)

This is just one-dimensional order of Lagrange interpolation of data, where x, y is the independent variable and the dependent variable corresponding to the data. After completion of interpolation, interpolation results may be calculated by the new f (a). There are similar spline interpolation, multi-dimensional data interpolation, not 11 show here.

2、unique

Function **: ** repeating data elements removed to give a single value of the element list. It is a method Numpy a function library (np.unique ()), is also an object Series.

Using the format:

  • np.unique (D), D is a one-dimensional data, may be a list, array, Series
  • D.unique (), D is the Series object Pandas

Example:

A single demand value element in the vector, and returns the index associated

D = pd.Series([1, 1, 2, 3, 5])
print(D.unique())
print(np.unique(D))

result:

[1 2 3 5]
[1 2 3 5]

Process finished with exit code 0

3、isnull / notnull

Function: determining the value of each element is empty / non-null value

Using the format: D.isnull () / D.notnull (). Series D requirement here is that the object and returns a Boolean Series. You can find null / non-null values of D by D [D.isnull ()] or D [D.notnull ()].

4、andom

Function: Random is a sub-library of Numpy (Python itself comes random, but more powerful Numpy), may generate a random matrix is subject to certain distribution with various functions in the library can be used when sampling.

Use the format

  • np.random.randn (k, m, n, ... generate a k * m * n * ... random matrix whose elements are uniformly distributed in the interval (0,1)
  • np.random.randn (k, m, n ...) _. generates a k * m * n * ... random matrix, whose elements are subject to the standard normal distribution

5、PCA

Function: indicators variable matrix principal component analysis, before use need from sklearn.decomposition import PCAbe introduced into the function.

Use the format: Model = the PCA (). Note, PCA under Scikit-Learn is to build a model of the object, that is to say, the general process is modeled, then the training model.fit (D), D is the principal component analysis to the data matrix, after training the reference model acquired .components_ obtain feature vectors, and .explained_ variance. _ratio_ obtain the contribution of each attribute and the like.

Example:

Using the PCA () for a random matrix of dimension 4 * 10 principal component analysis

from sklearn.decomposition import PCA

D = np.random.randn(10, 4)
pca = PCA()
pca.fit(D)
PCA(copy=True, n_components=None, whiten=False)
print(pca.components_)  # 返回模型的各个特征向量
print("*" * 50)
print(pca.explained_variance_ratio_)  # 返回各个成分个字的方差百分比

result:

[[-0.73391691  0.22922579 -0.13039917  0.62595332]
 [-0.41771778  0.57241446 -0.02724733 -0.70506108]
 [ 0.22012336  0.49807219  0.80277934  0.24293029]
 [-0.48828633 -0.60968952  0.58120475 -0.22815825]]
**************************************************
[0.50297117 0.28709267 0.14575757 0.06417859]

Process finished with exit code 0
Published 29 original articles · won praise 379 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_43656359/article/details/104689096