table of Contents
The main function of the data pre-processing Python
In data mining, vast amounts of raw data exist in a large number of incomplete (missing value), inconsistent, abnormal data, seriously affecting the efficiency of data mining modeling, and may even lead to deviation mining results, the data cleaning is particularly important, followed by or simultaneously integrated data, convert the series of processing, the data after completion of washing the statute and the like, the process is data preprocessing. Data preprocessing on the one hand to improve the quality of data, on the other hand is to make the data better adapted to the particular mining techniques or tools. Statistics found that in the process of data mining, data preprocessing workload accounted for 60% of the entire process.
The main content data preprocessing comprises : data cleaning, data integration, data conversion and data statute
Function name | Function Function | Belongs to expand library |
---|---|---|
interpolate | One-dimensional, high-dimensional data interpolation | Scipy |
unique | Remove duplicate data element values to obtain a single element list, name of the method which is the object | Pandas/Numpy |
isnull | Judging whether or not empty | Pandas |
notnull | Determine whether the non-empty | Pandas |
PCA | Matrix of indicator variables principal component analysis | Scikit-Learn |
random | Generating a random matrix | Numpy |
1、interpolate
Function: the interpolate a Scipy one sub-library comprising a large number of interpolation function, such as Lagrange interpolation, spline interpolation, high-dimensional interpolation. Before use from scipy.interpolate import * introduced by the corresponding interpolation function can find the corresponding function name to the official website as needed.
Using the format:
f = scipy.interpolate.lagrange(x,y)
This is just one-dimensional order of Lagrange interpolation of data, where x, y is the independent variable and the dependent variable corresponding to the data. After completion of interpolation, interpolation results may be calculated by the new f (a). There are similar spline interpolation, multi-dimensional data interpolation, not 11 show here.
2、unique
Function **: ** repeating data elements removed to give a single value of the element list. It is a method Numpy a function library (np.unique ()), is also an object Series.
Using the format:
- np.unique (D), D is a one-dimensional data, may be a list, array, Series
- D.unique (), D is the Series object Pandas
Example:
A single demand value element in the vector, and returns the index associated
D = pd.Series([1, 1, 2, 3, 5])
print(D.unique())
print(np.unique(D))
result:
[1 2 3 5]
[1 2 3 5]
Process finished with exit code 0
3、isnull / notnull
Function: determining the value of each element is empty / non-null value
Using the format: D.isnull () / D.notnull (). Series D requirement here is that the object and returns a Boolean Series. You can find null / non-null values of D by D [D.isnull ()] or D [D.notnull ()].
4、andom
Function: Random is a sub-library of Numpy (Python itself comes random, but more powerful Numpy), may generate a random matrix is subject to certain distribution with various functions in the library can be used when sampling.
Use the format
- np.random.randn (k, m, n, ... generate a k * m * n * ... random matrix whose elements are uniformly distributed in the interval (0,1)
- np.random.randn (k, m, n ...) _. generates a k * m * n * ... random matrix, whose elements are subject to the standard normal distribution
5、PCA
Function: indicators variable matrix principal component analysis, before use need from sklearn.decomposition import PCA
be introduced into the function.
Use the format: Model = the PCA (). Note, PCA under Scikit-Learn is to build a model of the object, that is to say, the general process is modeled, then the training model.fit (D), D is the principal component analysis to the data matrix, after training the reference model acquired .components_ obtain feature vectors, and .explained_ variance. _ratio_ obtain the contribution of each attribute and the like.
Example:
Using the PCA () for a random matrix of dimension 4 * 10 principal component analysis
from sklearn.decomposition import PCA
D = np.random.randn(10, 4)
pca = PCA()
pca.fit(D)
PCA(copy=True, n_components=None, whiten=False)
print(pca.components_) # 返回模型的各个特征向量
print("*" * 50)
print(pca.explained_variance_ratio_) # 返回各个成分个字的方差百分比
result:
[[-0.73391691 0.22922579 -0.13039917 0.62595332]
[-0.41771778 0.57241446 -0.02724733 -0.70506108]
[ 0.22012336 0.49807219 0.80277934 0.24293029]
[-0.48828633 -0.60968952 0.58120475 -0.22815825]]
**************************************************
[0.50297117 0.28709267 0.14575757 0.06417859]
Process finished with exit code 0