06 Data dimensionality reduction
- Reducing the number of features: dimensionality reduction
- Feature Selection
- Principal component analysis
Feature selection:
The reason of feature selection
- Redundancy: the features of the high correlation, easy to consume computing performance
- Noise: computing partial feature affecting the structure
Feature selection is what?
Before feature selection is simply to select some of the characteristics as a training set all features from the extracted features can change the value prior to selection and post-selection, you can not change the value, but the dimension of feature after selecting definitely better than selection: Definition small. Because we select only the part of the feature of them.
- Main methods:
- Filter (filtering): Variance Threshold (filtered variance)
- Embedded (embedded): regularization, decision tree
- Warpper (wraparound)
VarianceThreshold module
from sklearn.feature_selection import VarianceThreshold
def var():
"""
特征选择-删除低方差的特征
:return: None
"""
var = VarianceThreshold(threshold=0.0) #取值根据实际情况
data = var.fit_transform([[0,2,0,3],[0,1,4,3],[0,1,1,3]])
print(data)
return None
if __name__ == '__main__':
var()
Principal component analysis (PCA Principal Component Analysis)
What is the PCA
- Essence: An Analysis, Technical simplified data set
- Objective: Dimension data compression, to reduce the dimension of the original data (complexity) as possible, a small loss of information
- Quantity (number of features hundreds) may reduce regression analysis or cluster analysis feature: the role of
High-dimensional data prone to problems
- Between features are often related
PCA grammar
- Decomposition
- PCA(n_components = None)
- n_components - Decimal: retention feature dimension ratio 90% to 95%
- n_components - Integer: retention features number of dimensions
from sklearn.decomposition import PCA
def pca():
"""
主成分分析进行特征降维
:return: None
"""
pca = PCA(n_components=0.9)
data = pca.fit_transform([[2,8,4,5],[6,3,0,8],[5,4,9,1]])
print(data)
return None
if __name__ == '__main__':
pca()
Dimensionality reduction Case
The users into categories instacart
- Inquiry: user preferences for goods categories subdivided dimensionality reduction
- products.csv product information
- order_products_prior.csv orders and product information
- orders.csv customer order information
- Specific items belonging to the category of goods aisles.csv
Consolidate the information into a table
- Principal component analysis
Kaggle instacart database Web links
import pandas as pd
from sklearn.decomposition import PCA
# 读取4张表的数据
prior = pd.read_csv('./order_products__prior.csv')
products = pd.read_csv('./products.csv')
orders = pd.read_csv('./orders.csv')
aisles = pd.read_csv('./aisles.csv')
# 合并4张表到一张表(用户-物品)
_mg = pd.merge(prior,products, on=['product_id','product_id'])
_mg = pd.merge(_mg, orders, on=['order_id','order_id'])
mt = pd.merge(_mg, aisles, on=['aisle_id','aisle_id'])
# 交叉表(特殊的分组工具)
cross = pd.crosstab(mt['user_id'],mt['aisle'])
# 进行主成分分析
pca = PCA(n_components=0.9)
data = pca.fit_transform(cross)