Python machine learning entry notes (1) - Scikit-learn and feature engineering

Table of contents

Classification of Machine Learning Algorithms

Dataset tool Scikit-learn

Scikit-learn installation

Introduction to scikit-learn dataset API

bunch object

datasets module

Dataset partitioning

train_test_split

code example

feature engineering

feature extraction

sklearn.feature_extraction API

Example of dictionary feature extraction

Text feature extraction case

jieba participle Chinese processing

 Tf-idf text feature extraction

feature preprocessing 

Normalized

 standardization

Data dimensionality reduction 

Low variance feature filtering

Principal Component Analysis (PCA)

correlation coefficient


Classification of Machine Learning Algorithms

  • supervised learning (prediction)
    • Definition: The input data is composed of input feature values ​​and target values . The output of the function can be a continuous value (called regression), or the output can be a finite number of discrete values ​​(called classification).
    • Classification k-nearest neighbor algorithm, Bayesian classification, decision tree and random forest, logistic regression, neural network
    • Regression Linear Regression, Ridge Regression
  • Unsupervised learning
    • Definition: The input data is composed of input feature values.
    • Clustering k-means

Dataset tool Scikit-learn

        Datasets are an important part of machine learning, data mining, and other data analysis fields. They contain a set of data samples that are used as the basis for training, testing, and evaluating models.

        Scikit-learn is an excellent machine learning library, featuring ease of use, versatility, high performance, open source and free, integration with the Python ecosystem, and rich community support. It is widely used in academia and industry.

Scikit-learn installation

  1. Open a terminal and update the package index:

    sudo apt update

  2. Make sure you have installed Python3. Enter the following command in the terminal:

    python3 --version

    If Python3 is not installed, you can enter the following command to install it:

    sudo apt install python3

  3. Make sure you have installed pip3, the package manager for Python3. Enter the following command in the terminal:

    pip3 --version

    If it is not installed, you can enter the following command to install it:

    sudo apt install python3-pip

  4. Install scikit-learn via pip3:

    pip3 install -U scikit-learn

    This will install the latest version of the scikit-learn package.

  5. After the installation is complete, you can verify that scikit-learn was successfully installed by:

    python3 -c "import sklearn; print(sklearn.__version__)"

    If installed successfully, the version number of Scikit-learn will be displayed

Introduction to scikit-learn dataset API

bunch object

In Scikit-learn, Bunch is a dictionary-like object used to store datasets and related information in machine learning. The structure of a Bunch object usually consists of the following three attributes:

  • data: Feature data, which is a matrix of n_samples * n_features.

  • target: Target data, which is an array of n_samples, usually used for supervised learning.

  • feature_names: The name of the feature, which is a list of strings of length n_features.

In addition, Bunch objects may also include the following properties:

  • target_names: The name of the target data, which is a list of strings of length n_classes, usually used for classification problems.

  • DESCR: The description information of the dataset.

  • filename: The filename of the dataset, if the dataset comes from a file.

  • Other custom attributes: For example, you can save some other information in the dataset by setting custom attributes.

Bunch objects are a common format for datasets in Scikit-learn, which facilitates data reading and processing, as well as feature and target data access and conversion.

datasets module

  1. Load a standard dataset:

    from sklearn.datasets import load_iris

    iris = load_iris()

    The function here load_iris()can load the iris data set and return a Bunch object similar to a dictionary, which includes the attributes of the data set, such as data, labels, attributes and other information.

  2. Load a small dataset:

    from sklearn.datasets import load_boston

    boston = load_boston()

    The function here load_boston()can load the Boston housing price dataset and return a Bunch object, including data, labels, attributes and other information.

  3. Load large datasets:

    from sklearn.datasets import fetch_openml

    mnist = fetch_openml('mnist_784')

    The function here fetch_openml()can load the MNIST handwritten digit dataset and return a Bunch object, including data, labels, attributes and other information. In this way, datasets can be downloaded from remote servers, so larger datasets can be loaded.

  4. Create an artificial dataset:

    from sklearn.datasets import make_classification

    X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

  5. The function here make_classification()can generate a random binary classification data set, return data and labels, and can be used to test and verify machine learning models.

By loading and generating datasets, we can directly use these datasets for model training and testing, and also learn and explore the performance and characteristics of machine learning algorithms through these datasets.

Dataset partitioning

The general data set of machine learning will be divided into two parts:

  • Training data: used for training, building models
  • Test data: used in model checking to evaluate whether the model is valid

Division ratio:

  • Training set: 70% 80% 75%
  • Test set: 30% 20% 30%

train_test_split

train_test_split is a commonly used data preprocessing tool in machine learning, which is used to divide the original data set into training set and test set for use in the process of model training and evaluation.

The train_test_split function is a function in the scikit-learn machine learning library, which can randomly divide the original data set into a training set and a test set. The commonly used function calls are as follows:

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Among them, X represents the feature set of the original data, and y represents the label set of the original data. The test_size parameter specifies the proportion of the test set, and the random_state parameter can set a random number seed to ensure the consistency of each division result.

The train_test_split function works by randomly dividing the original data set into two subsets, one of which is used as a training set for training the model, and the other as a test set for evaluating the performance of the model. By dividing the training set and test set, the model can be avoided from overfitting during training and the generalization ability of the model can be evaluated.

code example

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 加载iris数据集
iris = load_iris()

# 输出数据相关内容
print("数据集特征:\n", iris.data[:5])
print("数据集标签:\n", iris.target[:5])
print("数据集特征名称:\n", iris.feature_names)
print("数据集标签名称:\n", iris.target_names)

# 划分数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# 输出划分后的数据集形状
print("训练集特征形状:", X_train.shape)
print("训练集标签形状:", y_train.shape)
print("测试集特征形状:", X_test.shape)
print("测试集标签形状:", y_test.shape)

feature engineering

Feature engineering refers to the process of constructing a more representative and discriminative feature set through a series of technologies such as data preprocessing, feature extraction, feature selection, and feature transformation according to the tasks and data characteristics of machine learning. Feature engineering is an important part of machine learning and is of great significance to the performance and effect of machine learning. Specifically, the significance of feature engineering includes the following points:

  1. Improve the accuracy of the model: Feature engineering can improve the accuracy of the model. Building more representative and discriminative features can help the model better distinguish different types of data, thereby improving the accuracy of the model.

  2. Reduce the complexity of the model: Feature engineering can reduce the complexity of the model. Through techniques such as feature selection and feature transformation, redundant and useless features can be removed, thereby reducing the complexity of the model and improving the generalization ability and robustness of the model. .

  3. Improve the interpretability of the model: Feature engineering can improve the interpretability of the model. Selecting and constructing features with practical significance can help explain the prediction results of the model and improve the interpretability and credibility of the model.

  4. Accelerate model training: Feature engineering can accelerate model training. Through techniques such as feature extraction and feature transformation, raw data can be converted into more efficient and compact feature representations, thereby accelerating model training and prediction.

feature extraction

Feature Extraction refers to extracting useful features from raw data for building machine learning models. In the actual data processing process, the original data is often high-dimensional, noisy and redundant information, so it is necessary to extract features from the original data, reduce the feature dimension, remove noise and redundant information, and improve the accuracy of the machine learning model and efficiency.

sklearn.feature_extraction API

The sklearn.feature_extraction module is a feature extraction module in scikit-learn, which provides some common feature extraction methods. The classes and functions in this module can help us convert raw data such as text, images, and audio into feature representations suitable for machine learning models.

This module mainly includes the following sub-modules:

  • sklearn.feature_extraction.text: Used to extract features from text, such as word count, TF-IDF, etc.
  • sklearn.feature_extraction.image: Used to extract features from images, such as color, texture, shape, etc.
  • sklearn.feature_extraction.sequence: Used to extract features from sequence data, such as N-gram features, sequence distance, sequence transformation, etc.
  • sklearn.feature_extraction.dict_vectorizer: Used to extract features from dictionary data, such as One-hot encoding, etc.

These sub-modules contain multiple feature extraction classes and functions, such as CountVectorizer, TfidfVectorizer, DictVectorizer, etc. These classes and functions have a unified API interface for easy use and combination. Using these classes and functions can help us extract features quickly and efficiently, and improve the effectiveness and efficiency of machine learning models.

Example of dictionary feature extraction

from sklearn.feature_extraction import DictVectorizer

# 定义一个包含字典的列表
data = [
    {'age': 25, 'sex': 'male', 'location': 'New York'},
    {'age': 30, 'sex': 'female', 'location': 'Chicago'},
    {'age': 35, 'sex': 'male', 'location': 'San Francisco'},
    {'age': 40, 'sex': 'female', 'location': 'Los Angeles'}
]

# 实例化 DictVectorizer
vec = DictVectorizer()

# 将列表转换为特征矩阵
features = vec.fit_transform(data)

# 查看特征矩阵
print(features.toarray())

# 查看特征名称
print(vec.get_feature_names())

In this case, we first define a list containing dictionaries, each dictionary representing a data sample. Then, we instantiate the DictVectorizer class and call the fit_transform() method with the list as input to convert it into a feature matrix. Finally, we look at the feature matrix and feature names. The feature matrix is ​​a sparse matrix, each column represents a feature, each row represents a data sample, and the feature value represents the value of the feature in the data sample. feature_names is a list containing feature names, each feature name representing a feature. We will do one-hot encoding for features that have category information. For example, categories 2 and 3 represent two gender categories, and columns 4 to 7 are four city categories .

Text feature extraction case

from sklearn.feature_extraction.text import CountVectorizer

# 定义一组文本数据
text_data = [
    "hello world",
    "hello python",
    "python is great",
    "hello scikit-learn"
]

# 实例化 CountVectorizer 类
vectorizer = CountVectorizer()

# 将文本转换为特征矩阵
features = vectorizer.fit_transform(text_data)

# 查看特征矩阵
print(features.toarray())

# 查看特征名称
print(vectorizer.get_feature_names())

  The feature matrix is ​​a sparse matrix, each row represents a text data, each column represents a word, and the feature value represents the number of times the word appears in the text data. Feature names is a list of all words. It can be seen that this method converts text data into a feature representation based on the number of word occurrences, which can be used for training and prediction of machine learning models.

jieba participle Chinese processing

Jieba is a commonly used Chinese word segmentation tool, written in Python language, which can quickly and efficiently complete Chinese word segmentation tasks. It uses a matching algorithm based on a prefix dictionary, which can quickly find all possible words in the text to complete the word segmentation task. Specifically, jieba will first segment the Chinese characters in the text, and then divide them into different words based on dictionary matching.

import jieba

data  = "阳光下有两个影子,一个是我的,一个也是我的。 鲁迅"

print(list(jieba.cut(data)))

Segment the Chinese word and then extract the word

from sklearn.feature_extraction.text import CountVectorizer
import jieba

data  = "阳光下有两个影子,一个是我的,一个也是我的。 鲁迅"

feature = CountVectorizer();

data = feature.fit_transform(list(jieba.cut(data)))
print(data.toarray())
print(feature.get_feature_names_out())

 

 Tf-idf text feature extraction

TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used text feature extraction method, which can be used to measure the importance of a word for a document or a group of documents. The core idea of ​​this method is that the more a word appears in the current document and the less it appears in all documents, the more important it is to the current document.

The TF-IDF method usually consists of two parts: TF and IDF.

  1. TF (Term Frequency): Indicates the frequency of a word appearing in the current document, that is, the number of times the word appears in the current document divided by the total number of all words in the current document.

  2. IDF (Inverse Document Frequency): Indicates the inverse document frequency of a term in all documents, that is, the logarithm of the number of times the term appears in all documents divided by the total number of all documents.

The calculation formula of TF-IDF is as follows:

TF-IDF = TF * IDF

In Python, you can use the sklearn library for TF-IDF feature extraction. The TfidfVectorizer class in sklearn can extract features from text and convert the result into a sparse matrix form, which is convenient for machine learning model training.

The following is a simple code example for TF-IDF feature extraction using the sklearn library:

import jieba
from sklearn.feature_extraction.text import TfidfVectorizer

def cut_word(text):
    return " ".join(list(jieba.cut(text)))
# 定义样本文本
corpus = ['我喜欢用Python进行自然语言处理',
          'Python非常好用',
          '我正在学习Python自然语言处理']

textList = []
for data in corpus:
    textList.append(cut_word(data))
# 使用TfidfVectorizer类进行TF-IDF特征提取
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(textList)
# 输出提取的特征向量
print(tfidf.toarray())
print(vectorizer.get_feature_names_out())

feature preprocessing 

Feature preprocessing is a very important step in machine learning. It usually refers to a series of processing operations on the original features to make them more suitable for training machine learning models. Feature preprocessing includes data cleaning, feature scaling, and feature encoding.

Normalized

Normalization refers to scaling features to a specific range, usually between [0,1] or [-1,1]. Normalization can avoid the impact of differences between different features on model training and improve the accuracy and stability of the model.

MinMaxScalerIt is scikit-learna preprocessing tool provided in the library, mainly used to scale the data to a specified range, usually between 0 and 1. It scales the data by the maximum and minimum values ​​such that the minimum value in the original dataset is scaled to 0, the maximum value is scaled to 1, and other values ​​are scaled to some value between 0 and 1.

MinMaxScalerIt is suitable for dealing with situations where the value range of the data is quite different, for example, the value range of some features is much larger than that of other features, which will have a certain impact on the training of the model. The scaled data not only makes the model training more stable and accurate, but also speeds up the convergence of the model.

The process of using it MinMaxScaleris very simple, the specific steps are as follows:

  1. Import MinMaxScalerlibrary:from sklearn.preprocessing import MinMaxScaler

  2. Create an MinMaxScalerobject:scaler = MinMaxScaler()

  3. Use fit_transform()the method to scale the data:scaled_data = scaler.fit_transform(data)

feature_rangeThe value of the parameter is [0, 1], that is, the scaled feature value range is between 0 and 1. If you need to scale the feature to other value ranges, you can feature_rangedo it by modifying the parameters.

from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
scaler = MinMaxScaler(feature_range=(-1, 1))
scaled_data = scaler.fit_transform(data)
print(scaled_data)

 standardization

Normalization refers to scaling the data so that it falls into a small specific interval. It is common to scale the data to a range with a mean of 0 and a variance of 1. Standardization can eliminate the dimensional and magnitude differences between data, making the data more comparable and interpretable, and it also has a good effect on some machine learning algorithms.

Standardization usually needs to centralize the original data first, that is, subtract the mean value of the feature from the data of each feature, so that the mean value of the data is 0. Then the data is scaled to a range with a variance of 1, so that the scale of the data can be guaranteed to be consistent, and the model will not be adversely affected due to the different value ranges of the features.

The mean of a feature represents the average of the values ​​of all samples on this feature, and the variance represents the degree of dispersion of the values ​​of all samples on this feature. In the process of standardization, each feature is subtracted from the mean of the feature, and then divided by the standard deviation of the feature, so that the mean of the feature becomes 0 and the variance becomes 1. In this way, the dimension difference between different features can be eliminated, making it easier for the model to converge.

from sklearn.preprocessing import StandardScaler
import numpy as np

# 构造数据
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# 创建 StandardScaler 对象
scaler = StandardScaler()
# 对数据进行标准化
scaled_data = scaler.fit_transform(data)
# 输出标准化后的数据
print(scaled_data)

print("特征的均值:", scaler.mean_)
print("特征的方差:", scaler.var_)

Data dimensionality reduction 

The data contains redundant or irrelevant variables (or features, attributes, indicators, etc.) , aiming to find out the main features from the original features .

Two ways of dimensionality reduction

  • feature selection
  • Principal component analysis (can understand a way of feature extraction)

method

  • Filter (filter): mainly explore the characteristics of the feature itself, the relationship between features and features and target values
    • Variance Selection Method: Low Variance Feature Filtering
    • correlation coefficient
  • Embedded (embedded): The algorithm automatically selects features (association between features and target values)
    • Decision Tree: Information Entropy, Information Gain
    • Regularization: L1, L2
    • Deep Learning: Convolutions and more

Low variance feature filtering

Delete some features of low variance, the meaning of variance was mentioned earlier. Combined with the size of the variance to consider the angle of this method.

  • The variance of the feature is small: the value of most samples of a feature is relatively similar
  • Large feature variance: the values ​​of many samples of a feature are different

Variance is a statistic that describes the range of data distribution and is used to measure the degree of dispersion of data. In feature selection, variance can be used as an indicator to measure the importance of features, because the larger the variance, the wider the range of data distribution and the greater the difference between samples, so this feature is more important.

Simply put, the greater the variance, the greater the differences in the data, and these differences can be used to distinguish different types of data, so they have a greater degree of discrimination and can be used as an important feature.

The following is a case of using sklearn for feature selection, where the method used is variance filtering.

from sklearn.feature_selection import VarianceThreshold
import numpy as np

# 原始数据,共有3个特征
X = np.array([[0, 2, 0], [0, 1, 4], [1, 1, 0], [0, 1, 1]])

# 创建方差过滤器,阈值为1
selector = VarianceThreshold(threshold=1)

# 对数据进行特征选择
X_new = selector.fit_transform(X)

# 输出特征选择后的数据
print("特征选择后的数据:\n", X_new)

In this example, we used VarianceThresholdclasses from sklearn for feature selection. The role of this class is to remove features whose variance is below a certain threshold, because features with smaller variance usually contain less information.

In the code, we first define a 3×3 array Xas the original data, then create an VarianceThresholdobject, and set the threshold to 1. Finally, fit_transformthe method is used to perform feature selection on the original data, and a new feature matrix is ​​obtained X_new.

It can be seen that after feature selection, the variance of the first feature in the original data is 0, so it is filtered out. The remaining one feature has a variance greater than 1 and is therefore retained.

Principal Component Analysis (PCA)

  • Definition: The process of transforming high-dimensional data into low-dimensional data , during which the original data may be discarded and new variables created

  • Function: It is data dimension compression, which reduces the dimension (complexity) of the original data as much as possible and loses a small amount of information.

  • Application: in regression analysis or cluster analysis

 API sklearn.decomposition.PCA

  • sklearn.decomposition.PCA(n_components=None)
    • Decompose the data into a lower dimensional space
    • n_components:
      • Decimal: Indicates how many percent of information is retained
      • integer: how many features to reduce to
    • PCA.fit_transform(X) X: data in numpy array format [n_samples, n_features]
    • Return value: an array of the specified dimensions after conversion
from sklearn.decomposition import PCA
import numpy as np
# 构造数据集
X = np.array([[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10], [10, 11, 12, 13]])

# 使用PCA进行降维
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# 打印降维后的数据集
print('降维后的数据集:\n', X_pca)

It can be seen that using PCA reduces the 4-dimensional data to 2-dimensional. It can be found that the data set after dimensionality reduction has only 2 columns, corresponding to the two principal components obtained by PCA, and the remaining columns are filtered out. 

correlation coefficient

The correlation coefficient is an indicator used to measure the strength of the correlation between two variables. Common correlation coefficients include Pearson correlation coefficient, Spearman rank correlation coefficient, Kendall rank correlation coefficient and so on.

The Pearson correlation coefficient is the most commonly used correlation coefficient, which is used to measure the degree of linear correlation between two continuous variables. Its value ranges from -1 to 1, 0 means no correlation, and a positive number means positive correlation. , a negative number indicates a negative correlation.

The following is an scipy.stats.pearsonrexample of using the calculation of the Pearson correlation coefficient:

from scipy.stats import pearsonr
import numpy as np

# 构造数据集
X = np.array([[-1, 2, 5, -4], [2, -1, -3, 6], [5, -3, -2, 1], [-4, 6, 1, 0]])

# 计算第一列和第三列的Pearson相关系数
r, p = pearsonr(X[:, 0], X[:, 2])
print('Pearson相关系数:', r)

It can be seen that pearsonrthe Pearson correlation coefficient calculated by using the first and third columns is -0.61, indicating that there is a certain negative correlation between the two columns. The closer the value is to -1 or 1, the stronger the correlation between the two variables, and the closer the value is to 0, the weaker the correlation between the two variables. 

Guess you like

Origin blog.csdn.net/weixin_40582034/article/details/129065251