scikit-learn dataset
Organized from the dark horse machine learning tutorial
Available datasets in the learning phase
- learn
- kaggle
- scikit-learn
sklenran dataset
load
datasets.load_*()
: get small-scale datasetsdatasets.fetch_*(data_home=None)
: To obtain large-scale data sets, you need to download them from the Internet. data_home is the data set download directory.
from sklearn import datasets
iris = datasets.load_iris()
return value
Both return datasets.base.Bunch
type data (inherited from dictionary)
- data: feature data array, two-dimensional
numpy.ndarray
array - target: array of tags
- DESCR: data description
- feature_names: feature names,
- target_name: tag name
print(iris)
print(iris["DESCR"])# 数据集描述
print(iris.feature_names)
print(iris.data, iris.data.shape)
Data set partitioning
Generally divided into 2 parts:
- Training data: used for training and building models , generally 70%~80%
- Test data: used in model testing to evaluate whether the model is reasonable
sklearn divides the dataset
sklearn.model_selection.train_test_split()
parameter | meaning |
---|---|
x | Eigenvalues of the dataset |
Y | the label value of the dataset |
test_size | The size of the test machine, generally float |
random_state | Random number seed, the sampling result of the same seed is the same |
return | The return order is training set feature value, test set feature value, training set target value, test set target value |
feature engineering
Process the data with professional background knowledge and skills to determine the upper limit of the final effect
Feature extraction
Convert images, text, etc. into numerical features that can be used for machine learning.
For example, article classification, converting text into numerical values, and then processing them through algorithms. The following describes the dictionary and text feature extraction respectively
Dictionary diagnosis
sklearn.feature_extraction
from sklearn import feature_extraction
data = [{
'city': 'bj', 'temp': 100}, {
'city': 'sh', 'temp': 200}, {
'city': 'sz', 'temp': 30}]
transfer = feature_extraction.DictVectorizer()# 实例化一个转换器
data_new = transfer.fit_transform(data)
print(transfer.get_feature_names()) # 返回特征值名称
By default, a sparse matrix is returned (only locations with non-zero values are included, which saves memory and improves loading efficiency). If you want to return a matrix, call
transfer = feature_extraction.DictVectorizer(sparse=False)
The above code returns:
(0, 0) 1.0
(0, 3) 100.0
(1, 1) 1.0
(1, 3) 200.0
(2, 2) 1.0
(2, 3) 30.0
['city=bj', 'city=sh', 'city=sz', 'temp']
It can be found that the category converts the bit number through one-hot encoding, and then records the position of each non-zero value in the matrix to improve the loading efficiency.
Application scenarios:
- There are many categorical features in the dataset
- Convert dataset features to dictionary type
- DictVectorizer Transformation
- The data itself is a dictionary type
Text Feature Extraction
For example, to analyze the type of an article, words, phrases, sentences, etc. can be used as features.
data = ["life is short", "life is too long"]
transfer = feature_extraction.text.CountVectorizer()
data_new = transfer.fit_transform(data)
print(data_new)
print(data_new.toarray())
print(transfer.get_feature_names())
The output is:
(0, 1) 1
(0, 0) 1
(0, 5) 1
(0, 2) 1
(0, 4) 1
(1, 1) 1
(1, 0) 1
(1, 6) 1
(1, 3) 1
[[1 1 1 0 1 1 0]
[1 1 0 1 0 0 1]]
['is', 'life', 'like', 'long', 'python', 'short', 'too']
The upper part is in sparse matrix form, and the lower part is in matrix form (note that the conversion to matrix form is different from that of a dictionary), and the number of occurrences of each text is stored in the matrix.
Feature preprocessing
The process of converting the feature data into a more suitable algorithm model through some conversion functions.
Normalized
By transforming the original data, the data is mapped to the
sklearn api between [0,1]:
from sklearn import preprocessing
transfer = preprocessing.minmax_scale()
There are some problems with normalization:
the data is easily affected by outliers, the robustness is poor, and it is only suitable for traditional accurate small data scenarios.
standardization
The processing method is: x , = x − E ( x ) σ \displaystyle x^, = \frac{xE(x)}{\sigma}x,=σx−E ( x )
E(x) is the sample mean, σ \sigmaσ is the sample variance
sklearn's api:
from sklearn import preprocessing
transfer = preprocessing.StandardScaler()
It is suitable for modern noisy big data scenarios when there are enough samples
Feature dimensionality reduction
Dimensionality reduction
Dimensionality reduction: reduce the number of random variables (features, that is, the number of matrix columns) , so as to achieve no correlation between features (knowledge related to probability theory).
There are mainly two ways:
- Filter filter type: mainly explore the characteristics of the feature itself, the relationship between the feature and the feature and the target value
- Variance selection method: low variance feature filtering
- Correlation coefficient: the degree of correlation between features
- Embedded: The algorithm automatically selects features (association between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- deep learning
First understand the two types of filters:
- Variance selection method:
from sklearn import feature_selection
# 实例化
filters = feature_selection.VarianceThreshold(threshold=0.2)
- 相关系数
r = n ∑ x y − ∑ x ∑ y n ∑ x 2 − ( ∑ x ) 2 n ∑ y 2 − ( ∑ y ) 2 \displaystyle r = \frac{n\sum xy\,-\,\sum x \sum y}{\sqrt{n\sum x^2\,-\,(\sum x)^2} \sqrt{n\sum y^2\,-\,(\sum y)^2}} r=n∑x2−(∑x)2n∑Y2−(∑and )2n∑xy−∑x∑and
You can also call related APIs
Principal Component Analysis (PCA)
The process of converting high-dimensional data into low-dimensional data may discard the original data and create new variables in the process.
Compress the dimension of the dimensional data, reduce the dimension (complexity) of the original data as much as possible, and lose a small amount of information.
from sklearn import decomposition
pca = decomposition.PCA(n_components=0.5)
n_components:
- Decimal: how much information to keep
- Integer: how many features to reduce to
- 'mle': The mle algorithm automatically selects the dimension (mle maximum likelihood estimation)
When I run it locally, the results of pycharm and jupyter notebook are different. Pycharm reduces the dimension of the data to 1 dimension, and the result of jupytor notebook is as expected.