sklearn datasets and feature engineering

scikit-learn dataset

Organized from the dark horse machine learning tutorial

Available datasets in the learning phase

  • learn
  • kaggle
  • scikit-learn

sklenran dataset

load

  • datasets.load_*(): get small-scale datasets
  • datasets.fetch_*(data_home=None): To obtain large-scale data sets, you need to download them from the Internet. data_home is the data set download directory.
from sklearn import datasets

iris = datasets.load_iris()

return value

Both return datasets.base.Bunchtype data (inherited from dictionary)

  • data: feature data array, two-dimensional numpy.ndarrayarray
  • target: array of tags
  • DESCR: data description
  • feature_names: feature names,
  • target_name: tag name
print(iris)
print(iris["DESCR"])# 数据集描述
print(iris.feature_names)
print(iris.data, iris.data.shape)

Data set partitioning

Generally divided into 2 parts:

  • Training data: used for training and building models , generally 70%~80%
  • Test data: used in model testing to evaluate whether the model is reasonable

sklearn divides the dataset

sklearn.model_selection.train_test_split()

parameter meaning
x Eigenvalues ​​of the dataset
Y the label value of the dataset
test_size The size of the test machine, generally float
random_state Random number seed, the sampling result of the same seed is the same
return The return order is training set feature value, test set feature value, training set target value, test set target value

feature engineering

Process the data with professional background knowledge and skills to determine the upper limit of the final effect

Feature extraction

Convert images, text, etc. into numerical features that can be used for machine learning.

For example, article classification, converting text into numerical values, and then processing them through algorithms. The following describes the dictionary and text feature extraction respectively

Dictionary diagnosis

sklearn.feature_extraction

from sklearn import feature_extraction
data = [{
    
    'city': 'bj', 'temp': 100}, {
    
    'city': 'sh', 'temp': 200}, {
    
    'city': 'sz', 'temp': 30}]
transfer = feature_extraction.DictVectorizer()# 实例化一个转换器
data_new = transfer.fit_transform(data)
print(transfer.get_feature_names()) # 返回特征值名称

By default, a sparse matrix is ​​returned (only locations with non-zero values ​​are included, which saves memory and improves loading efficiency). If you want to return a matrix, call
transfer = feature_extraction.DictVectorizer(sparse=False)

The above code returns:

(0, 0)	1.0
(0, 3)	100.0
(1, 1)	1.0
(1, 3)	200.0
(2, 2)	1.0
(2, 3)	30.0
['city=bj', 'city=sh', 'city=sz', 'temp']

It can be found that the category converts the bit number through one-hot encoding, and then records the position of each non-zero value in the matrix to improve the loading efficiency.

Application scenarios:

  1. There are many categorical features in the dataset
    1. Convert dataset features to dictionary type
    2. DictVectorizer Transformation
  2. The data itself is a dictionary type

Text Feature Extraction

For example, to analyze the type of an article, words, phrases, sentences, etc. can be used as features.

data = ["life is short", "life is too long"]
transfer = feature_extraction.text.CountVectorizer()
data_new = transfer.fit_transform(data)
print(data_new)
print(data_new.toarray())
print(transfer.get_feature_names())

The output is:

(0, 1)	1
(0, 0)	1
(0, 5)	1
(0, 2)	1
(0, 4)	1
(1, 1)	1
(1, 0)	1
(1, 6)	1
(1, 3)	1
[[1 1 1 0 1 1 0]
 [1 1 0 1 0 0 1]]
['is', 'life', 'like', 'long', 'python', 'short', 'too']

The upper part is in sparse matrix form, and the lower part is in matrix form (note that the conversion to matrix form is different from that of a dictionary), and the number of occurrences of each text is stored in the matrix.

Feature preprocessing

The process of converting the feature data into a more suitable algorithm model through some conversion functions.

Normalized

By transforming the original data, the data is mapped to the
sklearn api between [0,1]:

from sklearn import preprocessing

transfer = preprocessing.minmax_scale()

There are some problems with normalization:
the data is easily affected by outliers, the robustness is poor, and it is only suitable for traditional accurate small data scenarios.

standardization

The processing method is: x , = x − E ( x ) σ \displaystyle x^, = \frac{xE(x)}{\sigma}x,=σxE ( x )
E(x) is the sample mean, σ \sigmaσ is the sample variance
sklearn's api:

from sklearn import preprocessing

transfer = preprocessing.StandardScaler()

It is suitable for modern noisy big data scenarios when there are enough samples

Feature dimensionality reduction

Dimensionality reduction

Dimensionality reduction: reduce the number of random variables (features, that is, the number of matrix columns) , so as to achieve no correlation between features (knowledge related to probability theory).
There are mainly two ways:

  1. Filter filter type: mainly explore the characteristics of the feature itself, the relationship between the feature and the feature and the target value
    1. Variance selection method: low variance feature filtering
    2. Correlation coefficient: the degree of correlation between features
  2. Embedded: The algorithm automatically selects features (association between features and target values)
    1. Decision tree: information entropy, information gain
    2. Regularization: L1, L2
    3. deep learning

First understand the two types of filters:

  1. Variance selection method:
from sklearn import feature_selection

# 实例化
filters = feature_selection.VarianceThreshold(threshold=0.2)
  1. 相关系数
    r = n ∑ x y   −   ∑ x ∑ y n ∑ x 2   −   ( ∑ x ) 2 n ∑ y 2   −   ( ∑ y ) 2 \displaystyle r = \frac{n\sum xy\,-\,\sum x \sum y}{\sqrt{n\sum x^2\,-\,(\sum x)^2} \sqrt{n\sum y^2\,-\,(\sum y)^2}} r=nx2(x)2 nY2(and )2 nxyxand
    You can also call related APIs

Principal Component Analysis (PCA)

The process of converting high-dimensional data into low-dimensional data may discard the original data and create new variables in the process.
Compress the dimension of the dimensional data, reduce the dimension (complexity) of the original data as much as possible, and lose a small amount of information.

from sklearn import decomposition
pca = decomposition.PCA(n_components=0.5)

n_components:

  • Decimal: how much information to keep
  • Integer: how many features to reduce to
  • 'mle': The mle algorithm automatically selects the dimension (mle maximum likelihood estimation)

When I run it locally, the results of pycharm and jupyter notebook are different. Pycharm reduces the dimension of the data to 1 dimension, and the result of jupytor notebook is as expected.

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_43550173/article/details/116407116