Data preprocessing and model evaluation [machine learning, artificial intelligence, practical examples]

Data preprocessing and model evaluation in machine learning

In the field of machine learning, data preprocessing and model evaluation are two crucial steps. They ensure that the machine learning models we build can effectively learn from the data and make accurate predictions. This article will introduce the concepts of data preprocessing and model evaluation in detail and illustrate their close relationship through real-life examples.

Data preprocessing

What is data preprocessing?

Data preprocessing is an indispensable step in machine learning, which includes two main aspects: data cleaning and feature engineering.

Data cleaning

Data cleaning involves identifying and processing errors, anomalies, or missing values ​​in the data. These issues can cause model training to be unstable or produce inaccurate predictions. Key steps in data cleaning include:

  • Missing value handling : Identify and handle missing values, or choose to delete samples containing missing values. For example, in sales data, if price data for a product is missing, we can use the mean or median to fill it in.

  • Outlier detection and handling : Discover and handle outliers to prevent them from affecting model performance. Outliers may be due to data collection errors or other reasons. For example, if there are negative values ​​in your weight data, this is obviously an anomaly and needs to be corrected or deleted.

feature engineering

Feature engineering involves selecting, transforming, and creating features for use by machine learning models. Good feature engineering can significantly improve model performance. Key steps in feature engineering include:

  • Feature selection : Select features relevant to the problem and remove redundant or irrelevant features. This helps reduce model complexity and improve generalization capabilities.

  • Feature transformation : Transform features to better fit the model. For example, a logarithmic transformation can transform right-skewed data into a nearly normal distribution, which is beneficial for linear models.

Example: Medical dataset preprocessing

Let’s take an example of a medical dataset that includes the patient’s age, gender, weight, blood pressure, and disease status. Before data preprocessing, we may encounter the following problems:

  1. Missing values : Weight data is missing for some patients. We can choose to use average body weight to fill in these missing values ​​to maintain data integrity.

  2. Outlier : There is a record of a patient aged 200 years old in the data, which is obviously an outlier. We need to remove it or fix it.

  3. Feature Selection : Gender may be an irrelevant feature in disease state prediction. We can choose to remove it from the dataset.

  4. Feature transformation : If the blood pressure data exhibits a right-skewed distribution, we can logarithmically transform it to better meet the assumptions of the model.

Through these preprocessing steps, we can prepare data that is more suitable for training machine learning models.

Below is a code example for data processing using NumPy and Pandas to demonstrate the actual steps of data preprocessing in more detail.

import numpy as np
import pandas as pd

# 创建一个示例数据集
data = {
    
    'Age': [25, 30, 35, 40, 45],
        'Weight': [70, 75, np.nan, 80, 85],
        'BloodPressure': [120, 130, 140, 150, 160],
        'DiseaseStatus': [0, 1, 0, 1, 1]}

df = pd.DataFrame(data)

# 处理缺失值
mean_weight = df['Weight'].mean()
df['Weight'].fillna(mean_weight, inplace=True)

# 处理异常值
df = df[df['Age'] < 100]

# 特征选择和变换
# 假设我们决定在建模时不考虑性别,可以将其从数据集中删除
df.drop('Gender', axis=1, inplace=True)

# 对血压进行对数变换
df['BloodPressure'] = np.log(df['BloodPressure'])

# 打印预处理后的数据集
print(df)

The above code first creates a sample dataset, then uses Pandas to handle missing values ​​and outliers, and performs feature selection and feature transformation. These steps are part of data preprocessing, ensuring that the data is suitable for training machine learning models.

Model evaluation and selection

What is model evaluation?

In the machine learning journey, once we train a model, we need to fully evaluate its performance. This process is called 模型评估, and it is a critical step in ensuring that our models are robust enough to handle the needs of real-world applications.

Cross-validation

To evaluate a model's performance and ability to generalize, we use a widely recognized technique called cross-validation. The principle of cross-validation is to divide the data set into multiple non-overlapping subsets, one part is used for model training, and the other part is used for validating the model. The advantage of this approach is that it can repeat training and validation multiple times to more accurately estimate the model's performance.

Select evaluation metrics

However, to gain insight into the model's performance, we need to choose evaluation metrics that are appropriate for the problem and task. Different problems require different indicators to measure the effectiveness of the model. Here are some common evaluation metrics:

  • Accuracy : This is a common metric used for binary or multi-classification problems. It measures the proportion of samples correctly classified by the model. But be careful, when classes are imbalanced, accuracy can mislead us.

  • Precision and Recall : These metrics are very important for dealing with imbalanced class problems. Precision measures how accurately the model predicts positive classes, while recall measures the model's ability to discover positive classes. The trade-offs between them depend on the specific application scenario.

  • Mean Squared Error (MSE) : In regression problems, we usually use MSE to measure the performance of the model. It measures the average difference between the model's predicted values ​​and the actual values. A smaller MSE indicates that the model's predictions are closer to the actual situation.

By choosing appropriate evaluation metrics, we are able to better understand how the model performs in different situations and make adjustments and improvements as needed. This process is an integral part of model development and helps ensure that our models perform well in real-world applications.

Solve overfitting and underfitting

overfitting

Overfitting is when a model performs well on training data but performs poorly on unseen test data. This is because the model is too complex and overfits the noise in the training data.

Underfitting

Underfitting means that the model cannot fit the training data well, resulting in poor performance on both training and test data. Often this is because the model is too simple and cannot capture the complex relationships in the data.

How to solve overfitting and underfitting?

  • To solve overfitting : methods such as reducing model complexity, increasing the amount of training data, and using regularization methods (such as
    L1 or L2 regularization) can be adopted.

  • Solving underfitting : You can increase model complexity, improve feature engineering, increase training time, etc.

Through data preprocessing and model evaluation, we can better understand and utilize data to build high-performing machine learning models. These steps are key factors in the success of real-world machine learning projects, helping to avoid common problems such as overfitting and underfitting, as well as improving the reliability and generalization ability of the model.

Practical example: house price prediction

Suppose we are working on a machine learning project for house price prediction. We have a dataset that includes house features and corresponding prices, and our goal is to build a model that can predict the price of a house based on the input features. In this scenario, data preprocessing and model evaluation are very critical.

Data preprocessing

First, we need to preprocess the data to ensure it is suitable for training the model. Here are some data preprocessing steps we may need to perform:

  1. Missing value handling : Check your data for missing values, such as house dimensions or number of bedrooms. We can use the mean, median, or other statistics to fill in these missing values.

  2. Outlier handling : Find and handle outliers, such as extremely high or low prices, to prevent them from affecting the model's performance. Outliers can be handled by truncation or replacement.

  3. Feature engineering : Select appropriate features based on domain knowledge or the importance of the feature. For example, new features could be created, such as the total square footage of a home, to better capture changes in prices.

  4. Data normalization : For some machine learning algorithms, such as linear regression, standardization (normalization) of the data may aid in model training. This can be accomplished by subtracting the mean and dividing by the standard deviation.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 加载数据集
data = pd.read_csv('house_prices.csv')

# 处理缺失值
data.fillna(data.mean(), inplace=True)

# 处理异常值(例如,删除价格小于1000的记录)
data = data[data['Price'] >= 1000]

# 特征工程:创建总面积特征
data['TotalArea'] = data['LivingArea'] + data['GarageArea']

# 数据标准化
scaler = StandardScaler()
data[['TotalArea', 'Bedrooms']] = scaler.fit_transform(data[['TotalArea', 'Bedrooms']])

# 分割数据集为训练集和测试集
X = data[['TotalArea', 'Bedrooms']]
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model evaluation

Once we have completed data preprocessing, we can start training and evaluating the model. In this example, we use linear regression as the model and choose root mean square error (RMSE) as the evaluation metric.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 训练线性回归模型
model = LinearRegression()
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算均方根误差(RMSE)来评估模型性能
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Root Mean Squared Error (RMSE): {
      
      rmse}')

In this example, we 均方根误差evaluate the model's performance using . A lower RMSE value indicates that the model’s prediction is closer to the actual house price, which is an important evaluation indicator.

When it comes to the problems of overfitting and underfitting of machine learning models, we can illustrate these two problems and how to deal with them with some example code and solutions.

Overfitting problem

Overfitting is when a model performs well on training data but performs poorly on unseen test data. This usually happens when the model is too complex, trying to capture the noise and nuances in the training data. The following is an example that shows how the overfitting problem manifests in a house price prediction model and how to solve it:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# 创建一个带有噪声的示例数据集
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - np.random.rand(16))

# 拟合一个高阶多项式模型
degree = 15
model = LinearRegression()
X_poly = np.vander(X.ravel(), degree)
model.fit(X_poly, y)
y_pred = model.predict(X_poly)

# 计算训练集和测试集的均方根误差(RMSE)
rmse_train = np.sqrt(mean_squared_error(y, y_pred))

# 绘制数据和拟合曲线
plt.scatter(X, y, s=20, label='Data')
plt.plot(X, y_pred, color='r', label=f'Polynomial Degree {
      
      degree}\nTrain RMSE: {
      
      rmse_train:.2f}')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

In this example, we use a high-order polynomial model (order 15) to fit the noisy data. As shown in the figure, the model fits the training data almost perfectly, but it may perform poorly on the test data, which is a typical overfitting situation.

Methods to solve the overfitting problem:
  1. Reduce model complexity : You can try to reduce the complexity of the model, such as reducing the order of the polynomial or reducing the number of layers of the neural network.

  2. Increase the amount of training data : More data can help the model generalize better.

  3. Use regularization methods : Regularization techniques such as L1 or L2 regularization can limit the complexity of the model.

Underfitting problem

Underfitting is when the model fails to fit the training data well, usually because the model is too simple and cannot capture the features in the data 复杂关系.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# 创建一个带有噪声的示例数据集
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# 拟合一个线性模型
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# 计算训练集的均方根误差(RMSE)
rmse_train = np.sqrt(mean_squared_error(y, y_pred))

# 绘制数据和拟合线
plt.scatter(X, y, s=20, label='Data')
plt.plot(X, y_pred, color='r', label=f'Linear Model\nTrain RMSE: {
      
      rmse_train:.2f}')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()

In this example, we use a linear model to fit noisy sinusoidal data. As shown in the figure, the linear model cannot fit the nonlinear relationship of the data well, which is a typical manifestation of underfitting problem.

Methods to solve the under-fitting problem:
  1. Increase model complexity : You can try using more complex models, such as polynomial regression or deep neural networks.

  2. Improve feature engineering : add more relevant features or perform feature transformations.

  3. Increase training time : Increase the training time of the model, allowing it to better fit the data.

  4. Ensemble learning : Use ensemble learning methods, such as random forests or gradient boosted trees, to improve model performance.

Guess you like

Origin blog.csdn.net/qq_22841387/article/details/133432863