Data pretreatment python: dimensionality reduction

Today small for everyone to share a python data preprocessing methods: data reduction, a good reference value, we want to help. Come and see, to follow the small series together
data Why dimensionality reduction

Data reduction can reduce the amount of calculation models and reduce the run time of the model, to reduce the influence of noise model results for variable information, dimension information to facilitate the reduction by visually display data and reduce storage space. Therefore, in most cases, when we are faced with high dimensional data, we need to do data dimensionality reduction.

Data reduction two ways: feature selection, dimension conversion

Feature Selection

Feature selection means according to certain rules and experience, directly involved in the selection of a part of the modeling process and the calculation of the original dimension, instead of all features of the feature selection, without changing the features, the new feature value is not generated.

Dimension reduction way the benefits of feature selection is the foundation can retain the original dimensions of a feature on the dimensionality reduction, both to meet the subsequent data processing and modeling requirements, and to retain the original dimensions of business meaning, in order to facilitate the business to understand and apply. For analytical applications business, understandability and usability model of the model itself many times have limited accuracy, efficiency and other technical indicators. For example, the rules of the decision tree obtained feature, the user can be used as the basis of selection conditions of the sample, which is characterized in rules generated based on the dimensions of the input.

Conversion dimensions

The method according to certain mathematical transformations, to a given set of related variables (dimension) by a mathematical model data points are mapped to the spatial high latitude low latitudes space, and then using the mapped feature variable to represent an overall original variables feature. This approach is the process of producing a new dimension, a dimension not converted the original features, but the expression before the conversion feature of the new features of the original meaning lost business data. Dimensionality Reduction by converting data dimensions are very important dimension reduction methods, such dimensions and dimension reduction method is divided into two kinds of nonlinear dimensionality reduction Linearly, representative of which are commonly used include independent component analysis algorithm (the ICA), Principal Component Analysis ( PCA), factor analysis (factor analysis, FA), linear discriminant analysis (LDA), locally linear embedding (LLE), Kernel Principal component analysis (Kernel PCA) and the like.

Use python do dimensionality reduction

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
 
# 数据导入
df = pd.read_csv('https://raw.githubusercontent.com/ffzs/dataset/master/glass.csv')
 
# 看一下数据是
df.head()
 
 
   RI   Na Mg Al   Si   K   Ca   Ba   Fe  Type
0 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
1 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0.0 0.0 1
2 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0.0 0.0 1
3 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0.0 0.0 1
4 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0.0 0.0 1
 
# 有无缺失值
df.isna().values.any()
# False 没有缺失值
 
# 获取特征值
X = df.iloc[:, :-1].values
# 获取标签值
Y = df.iloc[:,[-1]].values
# 使用sklearn 的DecisionTreeClassifier判断变量重要性
# 建立分类决策树模型对象
dt_model = DecisionTreeClassifier(random_state=1)
# 将数据集的维度和目标变量输入模型
dt_model.fit(X, Y)
# 获取所有变量的重要性
feature_importance = dt_model.feature_importances_
feature_importance
# 结果如下
# array([0.20462132, 0.06426227, 0.16799114, 0.15372793, 0.07410088, 0.02786222, 0.09301948, 0.16519298, 0.04922178])
# 做可视化
import matplotlib.pyplot as plt
 
x = range(len(df.columns[:-1])) 
plt.bar(left= x, height=feature_importance)
plt.xticks(x, df.columns[:-1])

Visible Rl, Mg, Al, Ba importance is relatively high, generally the variable importance score close to 80%, most basically characteristic change has been explained.

PCA dimensionality reduction

# 使用sklearn的PCA进行维度转换
# 建立PCA模型对象 n_components控制输出特征个数
pca_model = PCA(n_components=3)
# 将数据集输入模型
pca_model.fit(X)
# 对数据集进行转换映射
pca_model.transform(X)
# 获得转换后的所有主成分
components = pca_model.components_
# 获得各主成分的方差
components_var = pca_model.explained_variance_
# 获取主成分的方差占比
components_var_ratio = pca_model.explained_variance_ratio_
# 打印方差
print(np.round(components_var,3))
# [3.002 1.659 0.68 ]
# 打印方差占比
print(np.round(components_var_ratio,3))
# [0.476 0.263 0.108]

We recommend the python learning sites, click to enter , to see how old the program is to learn! From basic python script, reptiles, django

Zero-based data compilation, data mining, programming techniques, work experience, as well as senior careful study of small python project partners to combat

,! Every day, Python programmers explain the timing of technology, sharing some learning methods and the need to pay attention to small details
than this python data preprocessing methods: data reduction is small series to share the entire contents of all of the

Published 48 original articles · won praise 21 · views 60000 +

Guess you like

Origin blog.csdn.net/haoxun11/article/details/105082185