Prostate cancer prediction based on linear regression

In this article, we will study how to solve the regression problem on a prostate cancer dataset using linear regression and stochastic gradient descent methods. We'll use Python along with some popular machine learning libraries like scikit-learn, pandas, and NumPy. We will also compare the performance of different methods on this problem to find the best model.

data set

We used a dataset from Prostate data set, which contains clinical data from patients with prostate cancer. The dataset contains 97 samples, each with 9 features and a target value. Our goal was to predict log prostate-specific antigen (lpsa) levels in patients.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("prostate.data", delimiter="\t")
data = data.drop("Unnamed: 0", axis=1)

method

We will use the following method for prediction:

Ordinary Least Squares (OLS)
Gradient Descent (GD)
Stochastic Gradient Descent (SGD)

To evaluate the performance of these methods, we will use Mean Squared Error (MSE) as the metric. We will also use cross-validation to select the best hyperparameters.

Data preprocessing

First, we separate the training and test sets from the dataset. Then, we separate the features and target values of the dataset. In order to give the model better performance on features of different scales, we will use normalization to preprocess the data. This means scaling the features to have a mean of 0 and a standard deviation of 1.

# 分离训练集和测试集
train_data = data[data["train"] == "T"].drop("train", axis=1)
test_data = data[data["train"] == "F"].drop("train", axis=1)

# 分离特征和标签
X_train = train_data.drop("lpsa", axis=1)
y_train = train_data["lpsa"]
X_test = test_data.drop("lpsa", axis=1)
y_test = test_data["lpsa"]

# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 划分训练集和测试集
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_train_scaled, y_train, test_size=0.3, random_state=42)

Model training

Next, we train the model using ordinary least squares, gradient descent, and stochastic gradient descent. To evaluate the performance of these models on the training and test sets, we calculated the mean squared error for each model. The results show that the ordinary least squares method has slightly lower errors on the training and test sets. This shows that ordinary least squares performs better on this problem.

# 普通最小二乘法
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_train_pred_lr = lr.predict(X_train_scaled)
y_test_pred_lr = lr.predict(X_test_scaled)

# 梯度下降
sgd = SGDRegressor(max_iter=1000, tol=1e-3)
sgd.fit(X_train_scaled, y_train)
y_train_pred_sgd = sgd.predict(X_train_scaled)
y_test_pred_sgd = sgd.predict(X_test_scaled)

# 随机梯度下降
sgd_rand = SGDRegressor(max_iter=1000, tol=1e-3, learning_rate="invscaling", eta0=0.01)
sgd_rand.fit(X_train_scaled, y_train)
y_train_pred_sgd_rand = sgd_rand.predict(X_train_scaled)
y_test_pred_sgd_rand = sgd_rand.predict(X_test_scaled)

# 计算训练集上的误差
mse_train_lr = mean_squared_error(y_train, y_train_pred_lr)
mse_train_sgd = mean_squared_error(y_train, y_train_pred_sgd)
mse_train_sgd_rand = mean_squared_error(y_train, y_train_pred_sgd_rand)

# 计算测试集上的误差
mse_test_lr = mean_squared_error(y_test, y_test_pred_lr)
mse_test_sgd = mean_squared_error(y_test, y_test_pred_sgd)
mse_test_sgd_rand = mean_squared_error(y_test, y_test_pred_sgd_rand)

# 输出误差
print("Train MSE: OLR:", mse_train_lr, "GD:", mse_train_sgd, "SGD:", mse_train_sgd_rand)
print("Test MSE: OLR:", mse_test_lr, "GD:", mse_test_sgd, "SGD:", mse_test_sgd_rand)

Model evaluation

To further evaluate the performance of the model, we plotted the residuals. Residual plots show the difference between predicted and actual values. As can be seen from the residual plot, our model performs similarly on the training set and test set, which indicates that the model does not suffer from overfitting or underfitting.

residual plot

# 绘制残差图
plt.scatter(y_train_pred_lr, y_train_pred_lr - y_train, c="blue", marker="o", label="Training data")
plt.scatter(y_test_pred_lr, y_test_pred_lr - y_test, c="orange", marker="s", label="Test data")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.legend(loc="upper left")
plt.hlines(y=0, xmin=-2, xmax=5, lw=2, color="red")
plt.show()

Cross-validation and hyperparameter tuning

To find the best model, we performed hyperparameter tuning of stochastic gradient descent using cross-validation. Cross-validation allows a more accurate assessment of a model's performance on unseen data by repeatedly training and validating the model on different subsets of data. We considered two hyperparameters in this experiment: learning rate and number of iterations.

# 交叉验证
from sklearn.model_selection import cross_val_score


# 计算交叉验证得分
def evaluate_model(model, X, y, cv=5):
    scores = cross_val_score(model, X, y, cv=cv, scoring="neg_mean_squared_error")
    return -np.mean(scores)


# 合并训练集和测试集
X_full = np.concatenate((X_train_scaled, X_test_scaled), axis=0)
y_full = np.concatenate((y_train, y_test), axis=0)

# 对所有数据进行实验
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, test_size=0.3, random_state=42)

# 设置参数范围
learning_rates = [0.001, 0.01, 0.1]
n_iters = [500, 1000, 2000]

# 搜索最优参数
best_mse = float("inf")
best_lr = None
best_n_iter = None

for lr in learning_rates:
    for n_iter in n_iters:
        sgd = SGDRegressor(max_iter=n_iter, tol=1e-3, learning_rate="invscaling", eta0=lr)
        mse = evaluate_model(sgd, X_train_full, y_train_full)
        if mse < best_mse:
            best_mse = mse
            best_lr = lr
            best_n_iter = n_iter

# 输出最优参数
print("Best learning rate:", best_lr, "Best number of iterations:", best_n_iter)

# 使用最优参数训练模型
sgd_best = SGDRegressor(max_iter=best_n_iter, tol=1e-3, learning_rate="invscaling", eta0=best_lr)
sgd_best.fit(X_train_full, y_train_full)
y_test_pred_sgd_best = sgd_best.predict(X_test_full)

# 计算最优模型在测试集上的误差
mse_test_sgd_best = mean_squared_error(y_test_full, y_test_pred_sgd_best)
print("Test MSE with best parameters:", mse_test_sgd_best)

We set different values for the learning rate and number of iterations, and used a grid search method to iterate through all possible parameter combinations. We selected the parameter combination that minimized the cross-validation error as the optimal parameters and trained a new stochastic gradient descent model using these parameters.

result

Stochastic gradient descent models trained with optimal parameters perform similarly to ordinary least squares and gradient descent on the test set. This shows that after parameter adjustment, the performance of the stochastic gradient descent method is comparable to other methods.

Summarize

In this blog, we explore the application of linear regression and stochastic gradient descent methods to regression problems using the prostate cancer dataset. We used libraries such as scikit-learn, pandas, and NumPy for data preprocessing, model training, and evaluation. We compared the performance of different methods by calculating the mean squared error and plotting the residuals. Finally, we tuned the hyperparameters of the stochastic gradient descent method using cross-validation and grid search methods to improve the model's performance on the test set.

, we explore the application of linear regression and stochastic gradient descent methods to regression problems using the prostate cancer dataset. We used libraries such as scikit-learn, pandas, and NumPy for data preprocessing, model training, and evaluation. We compared the performance of different methods by calculating the mean squared error and plotting the residuals. Finally, we tuned the hyperparameters of the stochastic gradient descent method using cross-validation and grid search methods to improve the model's performance on the test set.

This lab shows how to use Python and machine learning libraries to solve a real regression problem, and provides some suggestions for choosing appropriate methods and tuning hyperparameters. I hope this blog will help you solve similar problems in practical applications.