[Bioinformatics] Feature selection using HSIC LASSO method

Table of contents

1. Experiment introduction

2. Experimental environment

1. Configure the virtual environment

2. Library version introduction

3. IDE

3. Experimental content

0. Import necessary tools

1. Read data

2. Divide training set and test set

3. Perform HSIC LASSO feature selection

4. Feature extraction

5. Classification using Random Forest (using all features)

6. Classification using Random Forest (features selected using HSIC):

7. Code integration


1. Experiment introduction

        This experiment implements the HSIC LASSO (Hilbert-Schmidt independence criterion LASSO) method for feature selection , and uses a random forest classifier to classify the selected feature subset.

        Feature selection is one of the important tasks in machine learning, which can improve the effectiveness of the model, reduce computational overhead, and help us understand the key features of the data.

        HSIC LASSO is a kernel-based independence measure method for finding non-redundant features with strong statistical dependence on the output value.

2. Experimental environment

    This series of experiments uses the PyTorch deep learning framework, and the relevant operations are as follows (based on the environment of the deep learning series of articles):

1. Configure the virtual environment

The environment of the deep learning series of articles

conda create -n DL python=3.7 
conda activate DL
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
conda install matplotlib
conda install scikit-learn

New addition

conda install pandas
conda install seaborn
conda install networkx
conda install statsmodels
pip install pyHSICLasso

Note: My experimental environment installs various libraries in the above order. If you want to try installing them together (God knows if there will be any problems)

2. Library version introduction

software package This experimental version The latest version currently
matplotlib 3.5.3 3.8.0
numpy 1.21.6 1.26.0
python 3.7.16
scikit-learn 0.22.1 1.3.0
torch 1.8.1+cu102 2.0.1
torchaudio 0.8.1 2.0.2
torchvision 0.9.1+cu102 0.15.2

New

networkx 2.6.3 3.1
pandas 1.2.3 2.1.1
pyHSICLase 1.4.2 1.4.2
seaborn 0.12.2 0.13.0
state models 0.13.5 0.14.0

3. IDE

        It is recommended to use Pycharm (among them, the pyHSICLasso library has an error in VScode, and a solution has not yet been found...)

Win11 install Anaconda (2022.10) + pycharm (2022.3/2023.1.4) + configure virtual environment_QomolangmaH's blog - CSDN blog https://blog.csdn.net/m0_63834988/article/details/128693741 icon-default.png?t=N7T8https://blog.csdn .net/m0_63834988/article/details/128693741

3. Experimental content

0. Import necessary tools

import random
import pandas as pd
from pyHSICLasso import HSICLasso
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

1. Read data

data = pd.read_csv("cancer_subtype.csv")
x = data.iloc[:, :-1]
y = data.iloc[:, -1]

2. Divide training set and test set

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)

        Divide the data set into a training set ( X_trainsum y_train) and a test set ( X_testsum y_test). Among them, the test set accounts for 30% of the total data.

3. Perform HSIC LASSO feature selection

random.seed(1)
le = LabelEncoder()
y_hsic = le.fit_transform(y_train)
x_hsic, fea_n = X_train.to_numpy(), X_train.columns.tolist()
hsic.input(x_hsic, y_hsic, featname=fea_n)
hsic.classification(200)
genes = hsic.get_features()
score = hsic.get_index_score()
res = pd.DataFrame([genes, score]).T
  • Set random seeds to ensure repeatability of the random process
  • Use LabelEncoderlabel encoding to convert the target variable into numeric form.
  • Feature selection is performed by inputting the training set data X_trainand labels y_hsicinto the HSIC LASSO model.
    • hsic.inputUsed to set input data and feature names
    • hsic.classificationUsed to run the HSIC LASSO algorithm for feature selection
      • The selected features are saved in genes;
      • The corresponding feature scores are saved in score;
    • Will be genes、scorestored in DataFrame  res.

4. Feature extraction

hsic_x_train = X_train[res[0]]
hsic_x_test = X_test[res[0]]

        According to the feature index selected by HSIC LASSO, the corresponding feature subsets are extracted from the original training set andX_train test set , and stored in and respectively .X_testhsic_x_trainhsic_x_test

5. Classification using Random Forest (using all features)

rf_model = RandomForestClassifier(20)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("RF all feature")
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred, digits=5))

        Use a Random Forest Classifier (RandomForestClassifier) ​​to train on the training set with all features and make predictions on the test set. The prediction results are stored in rf_predand the confusion matrix and classification report are output.

6. Classification using Random Forest (features selected using HSIC):

rf_hsic_model = RandomForestClassifier(20)
rf_hsic_model.fit(hsic_x_train, y_train)
rf_hsic_pred = rf_hsic_model.predict(hsic_x_test)
print("RF HSIC feature")
print(confusion_matrix(y_test, rf_hsic_pred))
print(classification_report(y_test, rf_hsic_pred, digits=5))

        A random forest classifier is trained on the feature subset selected using HSIC LASSO and predictions are made on the corresponding feature subset of the test set. The prediction results are stored in and the confusion matrix and classification report are output.hsic_x_trainhsic_x_testrf_hsic_pred

7. Code integration

# HSIC LASSO
# HSIC全称“Hilbert-Schmidt independence criterion”,“希尔伯特-施密特独立性指标”,跟互信息一样,它也可以用来衡量两个变量之间的独立性
# 核函数的特定选择,可以在基于核的独立性度量(如Hilbert-Schmidt独立性准则(HSIC))中找到对输出值具有很强统计依赖性的非冗余特征
# CIN 107 EBV 23 GS 50 MSI 47 normal 33
import random
import pandas as pd
from pyHSICLasso import HSICLasso
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv("cancer_subtype.csv")
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)

random.seed(1)

le = LabelEncoder()
hsic = HSICLasso()
y_hsic = le.fit_transform(y_train)
x_hsic, fea_n = X_train.to_numpy(), X_train.columns.tolist()


hsic.input(x_hsic, y_hsic, featname=fea_n)
hsic.classification(200)
genes = hsic.get_features()
score = hsic.get_index_score()
res = pd.DataFrame([genes, score]).T

hsic_x_train = X_train[res[0]]
hsic_x_test = X_test[res[0]]


rf_model = RandomForestClassifier(20)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("RF all feature")
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred, digits=5))


rf_hsic_model = RandomForestClassifier(20)
rf_hsic_model.fit(hsic_x_train, y_train)
rf_hsic_pred = rf_hsic_model.predict(hsic_x_test)
print("RF HSIC feature")
print(confusion_matrix(y_test, rf_hsic_pred))
print(classification_report(y_test, rf_hsic_pred, digits=5))

Guess you like

Origin blog.csdn.net/m0_63834988/article/details/133443975