Table of contents
1. Configure the virtual environment
2. Library version introduction
2. Divide training set and test set
3. Perform HSIC LASSO feature selection
5. Classification using Random Forest (using all features)
6. Classification using Random Forest (features selected using HSIC):
1. Experiment introduction
This experiment implements the HSIC LASSO (Hilbert-Schmidt independence criterion LASSO) method for feature selection , and uses a random forest classifier to classify the selected feature subset.
Feature selection is one of the important tasks in machine learning, which can improve the effectiveness of the model, reduce computational overhead, and help us understand the key features of the data.
HSIC LASSO is a kernel-based independence measure method for finding non-redundant features with strong statistical dependence on the output value.
2. Experimental environment
This series of experiments uses the PyTorch deep learning framework, and the relevant operations are as follows (based on the environment of the deep learning series of articles):
1. Configure the virtual environment
The environment of the deep learning series of articles
conda create -n DL python=3.7
conda activate DL
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
conda install matplotlib
conda install scikit-learn
New addition
conda install pandas
conda install seaborn
conda install networkx
conda install statsmodels
pip install pyHSICLasso
Note: My experimental environment installs various libraries in the above order. If you want to try installing them together (God knows if there will be any problems)
2. Library version introduction
software package | This experimental version | The latest version currently |
matplotlib | 3.5.3 | 3.8.0 |
numpy | 1.21.6 | 1.26.0 |
python | 3.7.16 | |
scikit-learn | 0.22.1 | 1.3.0 |
torch | 1.8.1+cu102 | 2.0.1 |
torchaudio | 0.8.1 | 2.0.2 |
torchvision | 0.9.1+cu102 | 0.15.2 |
New
networkx | 2.6.3 | 3.1 |
pandas | 1.2.3 | 2.1.1 |
pyHSICLase | 1.4.2 | 1.4.2 |
seaborn | 0.12.2 | 0.13.0 |
state models | 0.13.5 | 0.14.0 |
3. IDE
It is recommended to use Pycharm (among them, the pyHSICLasso library has an error in VScode, and a solution has not yet been found...)
3. Experimental content
0. Import necessary tools
import random
import pandas as pd
from pyHSICLasso import HSICLasso
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
1. Read data
data = pd.read_csv("cancer_subtype.csv")
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
2. Divide training set and test set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)
Divide the data set into a training set ( X_train
sum y_train
) and a test set ( X_test
sum y_test
). Among them, the test set accounts for 30% of the total data.
3. Perform HSIC LASSO feature selection
random.seed(1)
le = LabelEncoder()
y_hsic = le.fit_transform(y_train)
x_hsic, fea_n = X_train.to_numpy(), X_train.columns.tolist()
hsic.input(x_hsic, y_hsic, featname=fea_n)
hsic.classification(200)
genes = hsic.get_features()
score = hsic.get_index_score()
res = pd.DataFrame([genes, score]).T
- Set random seeds to ensure repeatability of the random process
- Use
LabelEncoder
label encoding to convert the target variable into numeric form. - Feature selection is performed by inputting the training set data
X_train
and labelsy_hsic
into the HSIC LASSO model.hsic.input
Used to set input data and feature nameshsic.classification
Used to run the HSIC LASSO algorithm for feature selection- The selected features are saved in
genes
; - The corresponding feature scores are saved in
score
;
- The selected features are saved in
- Will be
genes、score
stored in DataFrameres
.
4. Feature extraction
hsic_x_train = X_train[res[0]]
hsic_x_test = X_test[res[0]]
According to the feature index selected by HSIC LASSO, the corresponding feature subsets are extracted from the original training set andX_train
test set , and stored in and respectively .X_test
hsic_x_train
hsic_x_test
5. Classification using Random Forest (using all features)
rf_model = RandomForestClassifier(20)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("RF all feature")
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred, digits=5))
Use a Random Forest Classifier (RandomForestClassifier) to train on the training set with all features and make predictions on the test set. The prediction results are stored in rf_pred
and the confusion matrix and classification report are output.
6. Classification using Random Forest (features selected using HSIC):
rf_hsic_model = RandomForestClassifier(20)
rf_hsic_model.fit(hsic_x_train, y_train)
rf_hsic_pred = rf_hsic_model.predict(hsic_x_test)
print("RF HSIC feature")
print(confusion_matrix(y_test, rf_hsic_pred))
print(classification_report(y_test, rf_hsic_pred, digits=5))
A random forest classifier is trained on the feature subset selected using HSIC LASSO and predictions are made on the corresponding feature subset of the test set. The prediction results are stored in and the confusion matrix and classification report are output.hsic_x_train
hsic_x_test
rf_hsic_pred
7. Code integration
# HSIC LASSO
# HSIC全称“Hilbert-Schmidt independence criterion”,“希尔伯特-施密特独立性指标”,跟互信息一样,它也可以用来衡量两个变量之间的独立性
# 核函数的特定选择,可以在基于核的独立性度量(如Hilbert-Schmidt独立性准则(HSIC))中找到对输出值具有很强统计依赖性的非冗余特征
# CIN 107 EBV 23 GS 50 MSI 47 normal 33
import random
import pandas as pd
from pyHSICLasso import HSICLasso
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv("cancer_subtype.csv")
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)
random.seed(1)
le = LabelEncoder()
hsic = HSICLasso()
y_hsic = le.fit_transform(y_train)
x_hsic, fea_n = X_train.to_numpy(), X_train.columns.tolist()
hsic.input(x_hsic, y_hsic, featname=fea_n)
hsic.classification(200)
genes = hsic.get_features()
score = hsic.get_index_score()
res = pd.DataFrame([genes, score]).T
hsic_x_train = X_train[res[0]]
hsic_x_test = X_test[res[0]]
rf_model = RandomForestClassifier(20)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("RF all feature")
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred, digits=5))
rf_hsic_model = RandomForestClassifier(20)
rf_hsic_model.fit(hsic_x_train, y_train)
rf_hsic_pred = rf_hsic_model.predict(hsic_x_test)
print("RF HSIC feature")
print(confusion_matrix(y_test, rf_hsic_pred))
print(classification_report(y_test, rf_hsic_pred, digits=5))