04 _ dimensionality reduction

06 Data dimensionality reduction

  • Reducing the number of features: dimensionality reduction
  • Feature Selection
  • Principal component analysis

Feature selection:

The reason of feature selection

  • Redundancy: the features of the high correlation, easy to consume computing performance
  • Noise: computing partial feature affecting the structure

Feature selection is what?

  1. Before feature selection is simply to select some of the characteristics as a training set all features from the extracted features can change the value prior to selection and post-selection, you can not change the value, but the dimension of feature after selecting definitely better than selection: Definition small. Because we select only the part of the feature of them.

  2. Main methods:
    • Filter (filtering): Variance Threshold (filtered variance)
    • Embedded (embedded): regularization, decision tree
    • Warpper (wraparound)
  3. VarianceThreshold module

from sklearn.feature_selection import VarianceThreshold


def var():
    """
    特征选择-删除低方差的特征
    :return: None
    """
    var = VarianceThreshold(threshold=0.0)  #取值根据实际情况
    data = var.fit_transform([[0,2,0,3],[0,1,4,3],[0,1,1,3]])
    print(data)
    return None


if __name__ == '__main__':
    var()

Principal component analysis (PCA Principal Component Analysis)

What is the PCA

  1. Essence: An Analysis, Technical simplified data set
  2. Objective: Dimension data compression, to reduce the dimension of the original data (complexity) as possible, a small loss of information
  3. Quantity (number of features hundreds) may reduce regression analysis or cluster analysis feature: the role of

High-dimensional data prone to problems

  • Between features are often related

PCA grammar

  1. Decomposition
  2. PCA(n_components = None)
  • n_components - Decimal: retention feature dimension ratio 90% to 95%
  • n_components - Integer: retention features number of dimensions
from sklearn.decomposition import PCA

def pca():
    """
    主成分分析进行特征降维
    :return: None
    """
    pca = PCA(n_components=0.9)
    data = pca.fit_transform([[2,8,4,5],[6,3,0,8],[5,4,9,1]])
    print(data)

    return None


if __name__ == '__main__':
    pca()

Dimensionality reduction Case

The users into categories instacart

  1. Inquiry: user preferences for goods categories subdivided dimensionality reduction
  • products.csv product information
  • order_products_prior.csv orders and product information
  • orders.csv customer order information
  • Specific items belonging to the category of goods aisles.csv
  1. Consolidate the information into a table
    Information merger

  2. Principal component analysis
    Kaggle instacart database Web links
import pandas as pd
from sklearn.decomposition import PCA

# 读取4张表的数据
prior = pd.read_csv('./order_products__prior.csv')
products = pd.read_csv('./products.csv')
orders = pd.read_csv('./orders.csv')
aisles = pd.read_csv('./aisles.csv')

# 合并4张表到一张表(用户-物品)
_mg = pd.merge(prior,products, on=['product_id','product_id'])
_mg = pd.merge(_mg, orders, on=['order_id','order_id'])
mt = pd.merge(_mg, aisles, on=['aisle_id','aisle_id'])

# 交叉表(特殊的分组工具)
cross = pd.crosstab(mt['user_id'],mt['aisle'])

# 进行主成分分析
pca = PCA(n_components=0.9)
data = pca.fit_transform(cross)

Guess you like

Origin www.cnblogs.com/hp-lake/p/11831224.html