Machine Learning - Basic Concepts - scikit-learn - Data Preprocessing - 1

Table of contents

Install scikit-learn

Remember to install in a virtual environment, anaconda3 is recommended here

pip install scikit-learn

Link: Windows 10 - Python's virtual environment Virtualenv - global python environment switching problem

Here the core module of the scikit-learn framework - sklearn, not scikit

import sklearn

Test environment: (please note that this is a virtual environment Virtualenv)

操作系统: Window 10
工具:Pycharm
Python: 3.7
scikit-learn: 1.0.2
numpy: 1.21.6
scipy: 1.7.3
threadpoolctl: 3.1.0
joblib: 1.1.0

Terminology understanding

1. What is the difference between feature (feature) and sample (sample/demo)?

  1. A sample consists of multiple features, and a feature is an element of a sample;
  2. For data processing, by setting the axis parameter axis to 0 or 1, you can choose to perform feature vector operations (vertical) or sample feature operations (horizontal) on the samples;
  3. Samples refer to horizontal elements, and features refer to vertical elements. This sentence means that when you set axis = 0 or axis = 1, then when it is 0, it points to the vertical feature element, and when it is 1, it points to the horizontal sample element. For example, if there is a sample A and sample B, where both sample A and sample B have features a, b, and c, then when axis = 0, the features of sample A and sample B are taken in order [Aa, Ba], [Ab, Bb], [ Ac, Bc], when axis = 1, the features [Aa, Ab, Ac] of sample A are taken in order, and then [Ba, Bb, Bc] of sample B are taken

Specific demonstration:

0|1		  	a   	b		c   
样本 A		Aa		Ab		Ac
样本 B		Ba		Bb		Bc

  1. All input values ​​x in a certain matrix are finally transformed by an algorithm to obtain the output value feature y

2. About the concept of model

The so-called machine learning model is essentially a function whose role is to achieveX to the labeled value of the samplef ( x ) → xf(x) \rightarrow xf(x)The mapping of x
in general: a function that can be learned from data and can achieve a specific function (mapping).
Further professional summary: the model is the mapping from input to output learned by determining the learning strategy in the specified hypothesis space and learning through the optimization algorithm.

In reality, we can see some models of characters and machines made of plastic, which is equivalent to a mapping. From the idea x in our mind, it is mapped to a plastic model y, and there are 3D models. The same is true. Through Build a model y to map x in your mind, but is the model of machine learning the same?
that's naturalMap the unknown data y through the known data x,to construct a predictive model,The model is to build a model through supervised, unsupervised, reinforcement and other learning strategies, using algorithms as tools.

Practical understanding: the function f ( x ) f(x) of a programming languagef ( x ) , input matrixXXX , which is the sampleXXX , the return value is a model-to-sample transformed mappingYYYYYY is a predicted value.

1. The concept of machine learning

Machine learning is divided into three types of learning:

  1. supervised learning
  2. unsupervised learning
  3. reinforcement learning

1. Supervised learning

supervised learningThe task of ( Supervised Learning) is to learn a model that enables the model to make a good prediction of its corresponding output for any given input.
That is: use the training data set to learn a model, and then use the model to predict the test sample set.

Popular understanding: Each data point is labeled or associated with a category or score.

Example (category): Input a picture and judge whether the animal in the picture is a cat or a dog;

Example (score): Predict the sale price of a used car through a large amount of data;

The purpose of supervised learning is to learn a large number of samples (called training data) to make predictions about future data points (called test data).

Classification and regression, fundamentally speaking, classification is to predict a label , and regression is to predict a quantity .

  • Classification is the problem of predicting a discrete class label for a sample.

  • Regression is the problem of predicting a continuous output quantity given a sample.

This quote, the author's personal understanding is that the model has input values ​​and output values, that is, XXXYYY , when the general law in this model is counted, you can use this statistical general law to predict other input values​​XXLikelihood of X , i.e. output valueYYY , of course, this point is actually very troublesome. In the real world, the results cannot be predicted only through the laws of a model, so it can only be said that the more models the better.

forClassification, I think it ispoint forecast, predicted point by point, not like a line;
returnislinear prediction, for example, the linear change of the stock can be predicted, the author personally thinks so.

Summarize:

supervised learning, people need to find the model to feed it, and observe the accuracy of the model more, that is, to supervise and observe the performance and accuracy of the algorithm, just like some children need us adults to supervise their learning , so that they will study hard and improve their performance under our supervision, which refers to improving the performance of the algorithm and the accuracy of the model.
So those that require a model are all supervised learning.

2. Unsupervised learning

unsupervised learning( Unsupervised Learning) is to directly model the data. Given no pre-labeled training examples, the data used has no concept of attributes or labels. The output corresponding to the input data is not known in advance.

Automatically classify or group the input data to find patterns and rules of the data.

Example: Clustering

Summarize:

unsupervised learning, children need to be self-taught, they should not need us to supervise their learning, so that they can be independent, due to changes in real life, it is unlikely that we have a full model of reality, and in a certain situation, it is unlikely that we will always supervise They learn, so it needs to have the ability of self-learning, by autonomously collecting realistic sample features, automatically inputting variables to itself, so as to obtain one model after another, and then evaluate the performance or accuracy of the model and so on.

3. Reinforcement Learning

Reinforcement learning ( Reinforcement Learning) is a field in machine learning that emphasizes how to act based on the environment in order to maximize the expected benefits. Its inspiration comes from the theory of behaviorism in psychology, that is, how organisms gradually form expectations for stimuli under the stimulation of rewards or punishments given by the environment, and produce habitual behaviors that can obtain the greatest benefit.

Summarize:

reinforcement learning, after children have learned to learn independently, give them an incentive learning reward measure, then they may form a stressful behavior, and then the children will easily do something that is beneficial to them, such as designing a A reinforcement learning algorithm that hits the wall, then a favorable choice for dodge will be executed for hitting the wall, so we can use this to design a machine learning algorithm that meets this idea, that is, reinforcement learning.

Summary of the characteristics of the three learning

Supervised learning, unsupervised learning, and reinforcement learning have different characteristics:

supervised learningThere is a label( mark ), which labeltells the algorithm what kind of input corresponds to what kind of output, the common algorithm isclassification, regressionwait;
unsupervised learningis none label( marked ), a common algorithm isclustering
reinforcement learningEmphasizes how to act based on circumstances to maximize expected benefits.

scikit-learn instructions

scikit-learnThe main functions of the library are divided into six parts:Classification, regression, clustering, dimensionality reduction, model selection, data preprocessing

Classification, regression -->supervised learning
Clustering -->unsupervised learning

2. The basic practical logic of machine learning

1. Collect data

The collection of data will not be introduced here, and the author has not dabbled in it yet.

2. Data preprocessing (Preprocessing)

In the real world, it is often necessary to process a large amount of raw data, which cannot be understood by machine learning algorithms. In order for machine learning algorithms to understand the raw data, the data needs to be preprocessed.

The so-called preprocessing, also known as normalization, is actually extracting valuable content from complex data. Normalization or standardization is used here:

  • Data normalization/standardization - divide the original data into training data and test data, and the test data is a part of the original data used as test data (this is very common in supervised learning)

Preprocessing algorithm:

Normalized:

1.normalize()

3. Dimensionality reduction

Generally speaking, when doing machine learning, the larger the amount of data you have, the more dimensions, and the more factors you consider, the more accurate your classification and regression predictions will be. However, because you have considered too much, your The calculation will be slower, so here we will consider how to weigh the accuracy of the prediction and the calculation speed.
In the case of ensuring the maximum amount of information, reduce the dimension and reduce the calculation time.

Reducing the dimension allows for better visualization. If it exceeds three dimensions, it will be difficult for people to understand, so dimensionality reduction can better visualize the data, andImprove computing efficiency (the most fundamental point of machine learning)

Dimensionality reduction algorithm:

4. Choose one of the three processing methods of Classification, Regression, and Clustering

Classification algorithm:

Regression algorithm:

Clustering Algorithm:

5. Model selection

3. Data preprocessing - data analysis

Detailed data normalization - normalization / standardization

Data normalization is a basic work of data mining. Different evaluation indicators often have different dimensions, and the difference in numerical value may be very large. If they are not processed, the results of data analysis may be affected. In order to eliminate the influence of the dimension (different physical quantities) and value range differences between indicators, it is necessary to standardize ( preprocess the data ), scale the data in proportion (normalize ) , and make it fall into A specific area for comprehensive analysis. For example, the attribute value of salary income is mapped to [ − 1 , 1 ] [-1,1][1,1 ] or[0, 1] [0, 1][0,1 ] within (Here is an example of normalization

Data normalization is especially important for distance-based mining algorithms. ( Based on distance here refers to the distance between variable output values )

Normalization refers to normalization or standardization

How to distinguish between normalization and standardization

NormalizedandstandardizationThey are all methods of transforming data, converting the original column of data into a certain range, or a certain form, specifically:

Normalization: Data normalization is used when the value of the eigenvector needs to be adjusted to ensure that the value of each eigenvector is scaled to the same value range, and a column of data is changed to a fixed interval (range). Usually, this The interval is [ 0 , 1 ] [0, 1][0,1 ] , broadly speaking, can be various intervals, such as mapping to[ 0 , 1 ] [0,1][0,1 ] can continue to be mapped to other ranges, and the image may be mapped to[0, 255] [0,255][0,255 ] , other cases may map to[ − 1 , 1 ] [-1,1][1,1]
Standardization: Transform the data into a distribution with mean 0 and standard deviation 1 [ 0 , 1 ] [0, 1][0,1 ] , remember, not necessarily normal;
centralized: In addition, there is another kind of processing called centralization, also called zero-mean processing, which is to subtract the mean value of these data from each original data. (actuallynormalization above)

Sometimes you will see standard normalization, which is actually similar. It is said to be standardization. In fact, this definition has long been covered by the concept of normalization. Standardization and normalization can be called this, but the specific implementation formulas for them are You have to think clearly, the names are arbitrary, and when you implement them, you have to see how to deal with them.
One is proportional scaling, and the other is decentralized scaling.

Data normalization - range scaling (scale) and mapping

scale n. balance, scale;; scale, scale; ruler, scale; v zoom

Broadly speaking,standardizationandNormalizedThe same is a linear change of the data, so we don't need to stipulate death, does the normalization have to go to [0, 1] [0,1][0,1 ] , i to[0,1][0,1][0,1 ] between, and then multiply a 255, what can I do?So remember not to be bound by concepts, the common ones are as follows:

1. Maximum and minimum normalization (normalization) (Min-Max Normalization) [0,1] / range scaling (Scaling)

Function:

The most common mode of normalization Normalization, also known as linear normalization, min-max normalization, also known as discrete normalization , is a linear transformation of the original data, mapping data values ​​​​to [ 0 , 1 ] [0, 1][0,1 ] Between

The conversion formula is as follows:

X n e w = X i − X m i n X m a x − X m i n X_{new}=\frac{X_{i}-X_{min}}{X_{max}-X_{min}} Xnew=XmaxXminXiXmin, range [0,1][0,1][0,1]

  • X i X_{i} Xi: Refers to the data to be normalized, usually a two-dimensional matrix
  • X m a x X_{max} Xmax: row vector of maximum values ​​in each column
  • X m i n X_{min} Xmin: row vector of minimum values ​​in each column
  • X n e w X_{new} Xnew: Refers to the proportion result, which is not complete at this point, see X scaled X_{scaled} in the formula belowXscaled

or

X s t d = X − X . m i n ( a x i s = 0 ) X . m a x ( a x i s = 0 ) − X . m i n ( a x i s = 0 ) X_{std}=\frac{X_{}-X_{.}min(axis=0)}{X_{.}max(axis=0)-X_{.}min(axis=0)} Xstd=X.max(axis=0)X.min(axis=0)XX.min(axis=0)

X s c a l e d = X s t d × ( m a x − m i n ) + m i n X_{scaled}=X_{std}\times(max-min)+min Xscaled=Xstd×(maxmin)+min , range[0,1][0,1][0,1]

It sounds confusing at first, so let me explain:

  • X X X : the data to be normalized, usually a two-dimensional matrix, such as
[[4,2,3]
[1,5,6]]
  • X . m i n ( a x i s = 0 ) X.min(axis=0) X.min(axis=0 ) : a row vector of the minimum values ​​in each column, as in the above example it should be[1,2,3]

  • X . m a x ( a x i s = 0 ) X.max(axis=0) X.max(axis=0 ) : a row vector of the maximum values ​​in each column, as in the above example it should be[4,5,6]

  • m a x max max x : The maximum value of the interval to be mapped to, the default is 1, it can be changed according to the situation, don't be bound

  • m i n min min : The minimum value of the interval to be mapped to, the default is 0, it can be changed according to the situation, don't be bound

  • X s t d X_{std} Xstd: Proportion result

  • X s c a l e d X_{scaled} Xscaled: The final normalized result, mapped to the range [ 0 , 1 ] [0,1][0,1 ] That's it, withX std X_{std}XstdX scaled X_{scaled} to complete the last stepXscaled m i n min min is 0.

Let’s describe what the above formula does in plain language:

  • The first step is to calculate the ratio of the distance from the element to the minimum value in each column to the distance between the maximum value and the minimum value of the column , which is actually scaling the data to [ 0 , 1 ] [0,1][0,1 ] on the interval
  • The second step is to scale and map the proportion result data to the specified [ min , max ] [min,max] in the same proportion[min,max ] interval _

2. Mean normalization (mean normalization)[-1,1]:

To convert to [ − 1 , 1 ] [-1,1][1,1 ] , then

X s t d = X − X m e a n X m a x − X m i n X_{std}=\frac{X-X_{mean}}{X_{max}-X_{min}} Xstd=XmaxXminXXmean
X s c a l e d = X s t d × ( m a x − m i n ) + m i n X_{scaled}=X_{std}\times(max-min)+min Xscaled=Xstd×(maxmin)+min , range[ − 1 , 1 ] [-1,1][1,1]

  • X s t d X_{std} Xstd: The result of removing the proportion of the mean
  • X m e a n X_{mean} Xmeanrepresents XXthe mean of each column of X
  • X . m i n ( a x i s = 0 ) X.min(axis=0) X.min(axis=0 ) : row vector of minimum values ​​in each column
  • X . m a x ( a x i s = 0 ) X.max(axis=0) X.max(axis=0 ) : row vector of maximum values ​​in each column
  • X s c a l e d X_{scaled} Xscaled: The final normalized result, the so-called mapping, essentially enlarges X std X_{std}XstdNumerical values, mapped to the range [ − 1 , 1 ] [-1,1][1,1 ] only
  • m a x max max x : The maximum value of the interval to be mapped to, the default is 1, it can be changed according to the situation, don't be bound
  • m i n min min : The minimum value of the interval to be mapped to, the default is 0, it can be changed according to the situation, don't be bound

Application scenarios of the first two normalization methods:

  • This method or other normalization methods (excluding the Z-score method ) can be used when distance measures, covariance calculations, and data do not conform to a normal distribution are not involved. For example, in image processing, after converting an RGB image to a grayscale image, its value is limited to [ 0 − 255 ] [0 - 255][0255 ] range

The first two normalization methods do not apply to scenarios:

  • When the original data has a small part of very large or small data, most of the data will be close to 0 or 1 after normalization, and the degree of discrimination is not large, such as ( ) 1, 1.2, 1.3, 1.4, 1.5, 1.6,8.4this group of data. If the value exceeds the current attribute [min, max]value range in the future, the system will report an error, and min min needs to be re-determinedmin andmax maxmax —— Standardize(normalizethis set of data, and then obtain a set of data whose values ​​are close to 0 after normalization. If new data is added in the future, it may exceed the maximum and minimum range of the data after normalization , you need to re-determinemin minmin andmax maxmax

3. Normalization by decimal scaling (normalization by decimal scaling)

Function:

By moving the decimal place of the attribute value, the attribute value is mapped to [-1, 1], and the moving decimal place depends on the maximum value of the absolute value of the attribute value.

The conversion formula is: X new = X 1 0 k X_{new}=\frac{X}{10^k}原始值 / 10^k
insert image description here
Xnew=10kX

  • k depends on XXThe maximum absolute value of the attribute values ​​​​in X
  • Decimal scaling normalization is to normalize by moving the position of the decimal point.
  • How many places to move the decimal point depends on XXThe maximum absolute value among the values ​​of the attributes in X.

XX hereThe attributes in X refer to certain attributes of the sample instance, such as length, width, quantity, etc.
That is to say, we are looking for the largest input value element x after absolute value in the matrix, and use the appropriate logarithmic function method log10, with base 10, the value is the maximum value max(x) of the absolute value, that islog 10 max ( X ) = k log_{10} max(X) = klog10max(X)=k , get the value of k, and one thing to note is thatkkThe value of k must be rounded up. One method provided here isthe methodnumpy of the moduleceil(k)Full formula:X new = ceil ( log 10 max ( abs ( X ) ) ) X_{new} = ceil(log_{10} max(abs(X)))Xnew=ceil(log10max x ( ab s ( X ))) ,abs()is the absolute value function

When to use normalization?

  • If there is a requirement for the range of output results , use normalization.
  • If the data is relatively stable and there are no extreme maximum and minimum values , use normalization.

Data normalization std

1. Zero-mean normalization (standardization) (z-score standardization) / mean removal (Mean removal)

Usually we will remove the mean value of each feature to ensure that the mean value of the feature is 0 (that is, standardized processing). Doing so removes the deviation of the features from each other ( bias).

Function:

Zero-mean normalization is also called standard deviation standardization . The mean value of the processed data is 0 and the standard deviation is 1. It is currently the most commonly used data standardization method .

The conversion formula is: (原始值 - 均值)/ 标准差

X n e w = X − X m e a n X s t d X_{new}=\frac{X-X_{mean}}{X_{std}} Xnew=XstdXXmean

Explanation of symbols:
X new X_{new}Xnewis the standardized value
X mean X_{mean}Xmeanfor XXmean of X
X std X_{std}Xstdfor XXStandard Deviation of X

significance:

  • The transformed data has a mean of 0 and a variance of 1
  • The results have no practical significance and are only used for comparison

Application scenario:

  • In classification and clustering algorithms, when it is necessary to use distance to measure similarity, or use PCA technology for dimensionality reduction, Z-score standardizationit performs better.

When to use normalization?

  • If there are outliers and more noise in the data, using standardization can indirectly avoid the influence of outliers and extreme values ​​through centralization.

Normalization and standardization data links:

How to understand normalization?

max-min normalization

Standardization and normalization, do not confuse them, thoroughly understand data transformation

Commonly used data normalization methods: min-max normalization, zero-mean normalization, etc.

Data Mining Experiment (1) Data Normalization [Minimum-Maximum Normalization, Zero-Mean Normalization, Decimal Scaling Normalization]

[Machine Learning] Data Normalization - MinMaxScaler Understanding

The understanding of axis=0 axis=1 in python

Reference link:

6_Python Machine Learning Library Scikit-Learn Introduction

Next chapter jump link

Machine learning - scikit-learn - data preprocessing - 2

Guess you like

Origin blog.csdn.net/qq_42701659/article/details/124446453