[Machine Learning] Classification Algorithm Series ①: First Concept

Table of contents

1. Concept

2. Dataset introduction and division

2.1, the division of data sets

2.2. Introduction to sklearn dataset

2.2.1、API

2.2.2. Classification and regression datasets

classification dataset

regression dataset

return type

3. sklearn converter and estimator

3.1. Converter

The difference between the three methods

3.2. Estimator

3.2.1 Introduction

3.2.2、API

3.3. Workflow


1. Concept

The total content that needs to be mastered in the entire series:

知道数据集的分为训练集和测试集
知道sklearn的转换器和估计器流程
了解sklearn的分类、回归数据集
说明K-近邻算法的距离公式
说明K-近邻算法的超参数K值以及取值问题
说明K-近邻算法的优缺点
应用KNeighborsClassifier实现分类
了解分类算法的评估标准准确率
说明朴素贝叶斯算法的原理
说明朴素贝叶斯算法的优缺点
应用MultinomialNB实现文本分类
应用模型选择与调优
说明决策树算法的原理
说明决策树算法的优缺点
应用DecisionTreeClassifier实现分类
说明随机森林算法的原理
说明随机森林算法的优缺点
应用RandomForestClassifier实现分类

When it comes to classifying algorithms for machine learning, we can generally divide them into the following main categories: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Each category has its unique characteristics and applicable scenarios.

1. Supervised Learning : In supervised learning, the model learns from labeled training data, and the goal is to predict the output label based on the input features. The most common supervised learning algorithms include:

  • Regression : Used to predict continuous-valued output, such as linear regression, ridge regression, Lasso regression, etc.
  • Classification : Used to predict discrete category outputs, such as logistic regression, decision trees, support vector machines, random forests, etc.

2. Unsupervised Learning (Unsupervised Learning) : In unsupervised learning, the model finds patterns and structures from unlabeled data to help us understand the internal relationship of the data. Common unsupervised learning algorithms include:

  • Clustering : Divide data into groups, such as K-means clustering, hierarchical clustering.
  • Dimensionality Reduction : Map high-dimensional data to low-dimensional space, such as principal component analysis (PCA), independent component analysis (ICA).

3. Semi-Supervised Learning : Semi-supervised learning combines supervised learning and unsupervised learning, using labeled and unlabeled data to train the model. This can be useful in cases where data labeling is difficult.

4. Reinforcement Learning : Reinforcement learning is to let the model learn by interacting with the environment to maximize the cumulative reward. It applies to problems that require a series of decisions to be made. It mainly includes agent, environment, action and reward signal.

2. Dataset introduction and division

learning target

Target

Knowing that the data set is divided into training set and test set

Know the classification and regression data sets of sklearn

Is all the data obtained used to train a model?

2.1, the division of data sets

The general data set of machine learning will be divided into two parts:

  • Training data: used for training, building models
  • Test data: used in model checking to evaluate whether the model is valid

Division ratio:

  • Training set: 70% 80% 75%
  • Test set: 30% 20% 30%

API:

sklearn.model_selection.train_test_split(arrays, *options)

  1. Eigenvalues ​​of the x dataset
  2. Label values ​​for the y dataset
  3. test_size The size of the test set, generally float
  4. random_state Random number seed, different seeds will cause different random sampling results. The same seed sampling results in the same.
  5. return, test set feature training set feature value, training label, test label (random selection by default)

Combined with the following data sets for introduction

2.2. Introduction to sklearn dataset

2.2.1、API

sklearn.datasets:

  1. Load Get Popular Datasets
  2. datasets.load_*()
    1. Get small-scale datasets, the data is included in datasets
  1. datasets.fetch_*(data_home=None)
    1. To obtain a large-scale dataset, it needs to be downloaded from the Internet. The first parameter of the function is data_home, which indicates the directory where the dataset is downloaded. The default is ~/scikit_learn_data/

2.2.2. Classification and regression datasets

classification dataset

sklearn.datasets.load_iris(): load and return the iris dataset

sklearn.datasets.load_digits(): load and return a digital dataset

sklearn.datasets.fetch_20newsgroups(data_home=None,subset=‘train’)

subset: 'train' or 'test', 'all', optional, select the dataset to load.

"train" for the training set, "test" for the test set, "both" for both

regression dataset

sklearn.datasets.load_boston(): Load and return the Boston house price dataset

sklearn.datasets.load_diabetes(): load and return the diabetes dataset

return type

The data type datasets.base.Bunch returned by load and fetch (dictionary format):

  1. data: feature data array, which is a two-dimensional numpy.ndarray array of [n_samples * n_features]
  2. target: label array, which is a one-dimensional numpy.ndarray array of n_samples
  3. DESCR: data description
  4. feature_names: feature name, news data, handwritten numbers, no regression data set
  5. target_names: label name

3. sklearn converter and estimator

3.1. Converter

Think about the steps of feature engineering done before?

  1. Instantiation (the instantiation is a converter class (Transformer))
  2. Call fit_transform (to establish a classifier frequency matrix for documents, it cannot be called at the same time)

We call the interface of feature engineering a converter , and there are several forms of converter calls

  1. fit_transform
  2. fit
  3. transform

What is the difference between these methods?

The difference between the three methods

StandardScaler is a class in the scikit-learn library for data normalization. It has three main methods: fit , transform and fit_transform . These methods differ as follows:

1. Fit method :

  • The fit method is used to calculate the mean and standard deviation of the data.
  • When the fit method is called, StandardScaler analyzes the data, calculates the mean and standard deviation of each feature, and stores these values ​​in the internal state of the StandardScaler object.
  • This method is usually called once on the training data to calculate the parameters used for normalization.
  • Example: std_scaler.fit(X_train) , where X_train is the training data.

2. Transform method :

  1. The transform method is used to normalize the data applying the previously calculated mean and standard deviation.
  2. When the transform method is called , the StandardScaler will standardize the incoming data using the mean and standard deviation stored inside the object.
  3. This method is usually called separately on the training data and the test data to ensure that the datasets are in the same normalized range.
  4. 示例:X_train_scaled = std_scaler.transform(X_train)

3. The fit_transform method :

  • The fit_transform method is a combined method, equivalent to calling fit first and then transform .
  • It performs operations on the data to calculate the mean and standard deviation, and then uses the results of these calculations to standardize the data.
  • This method is usually called once on the training data to obtain the mean and standard deviation, and return the normalized results of the training data.
  • 示例:X_train_scaled = std_scaler.fit_transform(X_train)

Normally, the fit method should be called once on the training data, and then use the transform method to normalize the training and test data. The fit_transform method is called once on the training data. This ensures that all data is normalized using the same mean and standard deviation, avoiding data leaks and inconsistencies.

The function of fit_transform is equivalent to transform plus fit. But why provide a separate fit?

Although the role of fit_transform is in many cases the same as calling fit and transform separately , the reason for providing a separate fit method is flexibility and applicability.

Here are some reasons why a separate fit method is provided:

  1. Step-by-step : Sometimes you may need to check the calculated mean and standard deviation before standardizing. The stand-alone fit method allows you to look at these parameters before performing normalization to better understand the data.
  2. Use across datasets : In practice, you may use the same normalization parameters on multiple different datasets. For example, if you train a model and save it for use in production, you may want to use the same normalization parameters as the training data. A separate fit method allows you to store normalized parameters and reuse them on different datasets.
  3. Controlling Normalization Parameters : Sometimes, you may wish to manually adjust normalization parameters, for example by adding an offset or scaling factor. Using a separate fit method allows you to tune the parameters before normalization.
  4. Customized processing : The independent fit method provides developers with greater freedom and can perform customized processing according to specific needs.

While fit_transform will be more convenient in most cases , the stand-alone fit method ensures the flexibility and adaptability of the library, making it able to cope with a wider range of use cases. This design philosophy allows developers to choose the appropriate method to achieve the best results according to the needs.

3.2. Estimator

3.2.1 Introduction

"Estimator" (Estimator) is an important concept in scikit-learn, which is a general interface for machine learning models. The goal of an estimator is to encapsulate the training and prediction process of a model so that it can use similar methods uniformly, whether it is classification, regression, or some other type of task.

Estimators have two basic roles in scikit-learn:

  1. Transformer : A transformer is an estimator that computes features from input data, filters or transforms the data. For example, StandardScaler is a transformer that normalizes data. Transformers typically have a fit method for learning the parameters needed for the transformation, and a transform method for applying the learned transformation.
  2. Predictor : A predictor is an estimator that makes predictions based on input data. For example, a linear regression model is a predictor that predicts a target variable based on input features. Predictors typically have a fit method for training the model, and a predict method for making predictions.

The general steps for using an estimator include:

  1. Create an estimator object: by instantiating an estimator class such as LinearRegression() or RandomForestClassifier() .
  2. Using the fit method: call the fit method with the training data to train the model (for predictors) or compute transformation parameters (for transformers).
  3. Using an estimator object: Use other methods of the estimator, such as predict (for predictors) or transform (for transformers), to make predictions or transforms as needed.
  4. Evaluate and optimize: Evaluation is performed based on model performance, and model parameters may need to be adjusted to optimize performance.

This unified interface makes it easy to switch between different estimators in scikit-learn and combine them together to build complex machine learning pipelines. At the same time, it also helps to keep the code clean and consistent, making it easier to compare and experiment with different algorithms.

3.2.2、API

In sklearn, the estimator (estimator) is an important role, is a class of API that implements the algorithm

1. Estimator for classification:

  • sklearn.neighbors k-nearest neighbor algorithm
  • sklearn.naive_bayes Bayesian
  • sklearn.linear_model.LogisticRegression Logistic regression
  • sklearn.tree decision tree and random forest

2. Estimator for regression:

  • sklearn.linear_model.LinearRegression Linear regression
  • sklearn.linear_model.Ridge ridge regression

3. Estimators for unsupervised learning

  • sklearn.cluster.KMeans Clustering¶

3.3. Workflow

Estimator is a unified interface in scikit-learn for training models and making predictions. The following is the basic workflow of an estimator:

  1. Choosing an estimator class : First, choose an appropriate estimator class for your task. The choice of estimator class depends on the problem you want to solve, such as classification, regression, clustering, etc. You can choose a suitable class from scikit-learn's list of estimators, such as LinearRegression , RandomForestClassifier , etc.
  2. Instantiate an estimator object : Create an estimator object by instantiating the selected estimator class. This object will contain the parameters and methods of the model.
  3. Fitting (training) the model : For the predictor class (Predictor), use the training data to call the fit method of the estimator object to fit the model to the training data. This process involves learning the parameters of a model such that it can predict a target value on input features.
  4. Making predictions : For a trained predictor, you can use the predict method to make predictions. Pass the input features to the predict method, which returns the model's predictions for those features.
  5. Transform data (for transformer classes): For transformer classes (Transformer), use the training data to call the fit method of the estimator object to learn the parameters needed for data transformation. Then, use the transform method to transform the new data to apply the learned transformation rules.
  6. Evaluate and tune : Measure the quality of a model by evaluating its performance on test data. You can use various evaluation metrics like accuracy, mean square error, etc. If necessary, you can tune the parameters of the estimator to optimize the performance of the model.

To sum up, the workflow of an estimator involves choosing an appropriate class, instantiating an estimator object, fitting (training) the model, making predictions or transforming the data, and making adjustments based on the evaluation results. This unified interface makes it easy to use different estimators in scikit-learn, build complex machine learning pipelines, and perform model selection and performance optimization.

Guess you like

Origin blog.csdn.net/qq_60735796/article/details/132593659