4. Dataset Transformation

A library of transformers, which may clean (see Preprocessing data), reduce (see unsupervised dimensionality reduction), expand (see kernel approximation) for generate (see feature extraction) feature representations.

Like other estimators, these are represented by classes with:

Fit method: learns models parameters from a training set
Transform method: applies this transformation model to unseen data
Fit_transform method

4.1 Pipelines and composite estimators

This module is used to combine such transformers, either in parallel or series

4.1.1 Pipeline: chaining estimators

Pipeline: used to chain multiple estimators into one. The advantages of using pipeline are as follows:

Convenience & encapsulation: only have to call fit and predict once on your data to fit a whole sequence of estimators
Joint parameter selection: grid search over parameters of all estimators in the pipeline at once
Safety

4.1.1.1 Usage
- Pipeline - constructor
- make_pipeline: a shorthand for constructing pipelines. it takes a variable number of estimators and returns a pipeline, filling in the names automatically:
  - named_steps: bunch object, a dictionary with attribute access. Read-only attribute to access any step parameter by user given name. keys are step names, and values are steps parameters
  - steps: list, list of (name,transform) tuples
4.1.1.2 Notes:
- Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step.
4.1.1.3 Caching transformers: avoid repeated computation
- Side effect of caching transformers
  - Using a pipeline without cache enabled, it's possible to inspect the original instance. That means the submodule in pipeline couldn't been inspected if you don’t use cache, otherwise, it could be inspected.

4.1.2 Transforming target in regression

4.1.3 FeatureUnion: composite feature spaces

Feature union combines several transformer objects into a new transformer that combines the output.
A FeatureUniontakes a list of transformer objects. During fitting, each of these is fit to the data independently. The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

4.1.4 ColumnTransformer for heterogeneous data

ColumnTransformer helps performing different transformations for different columns of the data, within a pipeline, that is safe from data leakage and that can be parametrized. ColumnTransformer works on arrays, sparse matrices, and pandas Dataframes.

4.2 Feature extraction

sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

4.2.1 Loading features from dicts

4.3 Preprocessing Data

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

4.3.1 Standardization, or mean removal, and variance scaling

Standardization: scaled data has zero mean and unit variance
Scaling features to a range: An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. --MinMaxScaler, MaxAbsScaler
Scaling sparse data:

4.3.1.3. Scaling data with outliers --robust_scale//RobustScaler

4.3.2 Non-linear transformation

Mapping to a uniform distribution: QuantileTransformer:
Mapping to a Gaussian distribution: PowerTransformer

4.3.3 Normalization

Normalization: is the process of scaling individual samples to have unit norm. Normalize

4.3.4 Encoding categorical features

OrdinalEncoder. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1):
OneHotEncoder

4.3.5 Discretization

Provides a way to partition continuous features into discrete values.

K-bins discretization: KBinsDiscretizer, discretizers features into k equal width bins
Feature binarization: is the process of thresholding numerical features to get boolean values. //Binarizer

4.3.6 Imputsation of missing values

SimilyImputer: this class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.
MissingIndicator: transformer is useful to transform a dataset into corresponding binary matrix indicating the presence of missing values in the dataset. This transformation is useful in conjunction with imputation. When using imputation, preserving the information about which values had been missing can be informative.

4.3.7 Generating polynomial features: PolynomialFeatures

4.3.8 Custom tranformers

FunctionTransformer can be used to build a transformation

4.4 Imputation of missing values -->4.3.6

4.5 Unsupervised dimensionality reduction

PCA - principle component analysis sklearn.decomposition.PCA , could be used to do feature reduction
Random projections: random_projection module
Feature agglomeration: FeatureAgglomeration

4.6 random projection

https://scikit-learn.org/stable/modules/random_projection.html

Implements a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy for faster processing times and smaller model sizes.
Gaussian random matrix
Sparse random matrix

4.6.1 The Johnson-Alindenstrauss lemma

The lemma states that a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way the distance between the points are nearly preserved.

4.6.2 Gaussian random projection

Reduces the dimensionlity by projecting the original input space on a randomly generated matrix where components are drawn from the following distribution

4.6.3 Sparse random projection

Reduces the dimensionality by projecting the original input space using a sparse random matrix

4.7 Kernel approximation

This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in SVM.

Nystroem method for kernel approximation
Radial Basis function kernel

RBFSampler constructs an approximate mapping for the radial basis function kernel
Additive Chi Squared Kernel: a kernel on histograms, often used in computer vision

4.8 Pairwise metrics, affinities, and kernels

This submodule implements utilities to evaluate pairwise distances or affinity of sets of samples.

Cosine similarity: computes the L2-normalized dot product of vectors
Linear kernel:
Polynomial kernel
Sigmoid kernel
Rbf kernel
Laplacian kernel
Chi-squared kernel

Sklearn用户手册学习笔记 -- Dataset Transformation