The First World Scientific Intelligence Competition: Life Science Track - Biological Age Assessment and Age-Related Disease Risk Prediction (First Notes)

1. Background introduction

This note refers to Datawhale AI Summer Camp (Third Period) - AI for Science Life Science Track Study Manual

The full title of this competition is [The First World Science and Intelligence Competition: Life Science Track - Biological Age Assessment and Age-Related Disease Risk Prediction] Organizer:
Please add image description

[Contest title background]
Biological age assessment is a method of assessing an individual’s physiological age and health status by measuring and analyzing the status of specific indicators or physiological processes in organisms. Biological age provides a more accurate assessment of health and prediction of disease risk than traditional calendar age. With the continuous development of AI technology, the integration and development of computing science and life science will provide cutting-edge ideas and methods for the development of health management applications, research on aging mechanisms, and research and development of anti-aging drugs. Based on the above background, a competition for predicting biological age based on methylation measurement data is held.

【Event Arrangement】
https://tianchi.aliyun.com/s/6a1351ecd2a3987995a7bda7f62542d2

[Competition Tasks]
This competition provides methylation data for healthy people and patients with age-related diseases . Players can build prediction models by analyzing the patterns and characteristics of methylation data , and can predict a person's biology based on their methylation data. age .

[Competition Question Data Set]
The public data contains 10,296 samples, of which 7,833 samples are healthy samples. Each sample provides methylation data, age and disease status of 485,512 sites, that is, 485,512 features.

Training strategy: Extract 80% as training samples and 20% as test samples.

Taking the training set as an example, it includes a total of 8233 samples, of which 6266 are healthy samples, and the rest are diseased samples, involving Alzheimer's disease, schizophrenia, Parkinson's disease, rheumatoid arthritis, stroke, Huntington's disease, Graves' disease, type 2 diabetes, Sjogren's syndrome and other types.

[Evaluation Indicators]
This task uses multiple indicators for evaluation. There are differences in the evaluation indicators between the preliminary round and the semi-finals. In the preliminary round, two indicators (MAE of healthy samples and MAE of diseased samples) are calculated and averaged to obtain the final score.
Please add image description

[Problem-solving ideas] The preliminary task
of this question is to predict the age of the sample , which is a typical regression problem.

  • The input data is the methylation data and disease status of 485512 sites corresponding to each sample.
  • The output is the age corresponding to the sample.

It can be seen that the feature latitude provided is very high (485512 dimensions), and basic feature selection can be considered, such as coverage, correlation, feature importance, etc. Or only use some features to quickly run through the process, and then consider how to add more features later , which is also a good choice.

As for the model, you can choose a machine learning model or a deep learning model. If you use a machine learning model like xgboost, you do not need to perform missing value filling and data standardization operations, and the effect is relatively stable. If you choose a deep learning model, you need to perform missing value filling and data standardization operations, and network construction also requires more attempts.

In view of the above comparison, our Baseline chooses to use machine learning methods. When solving machine learning problems, we generally follow the following process:
Please add image description

2. Run through the baseline

2.1 Environment configuration

Since the data set this time is too large, we use the Alibaba Cloud environment. The specific deployment tutorial is as follows:

https://datawhaler.feishu.cn/docx/GIr5dWijEoGWRJxzSeCcZFmgnAe?from=from_copylink

2.2 baseline analysis

Import library
Copy code
import numpy as np
import pandas as pd
import polars as pl
from collections import defaultdict, Counter
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import mean_squared_log_error
import sys, os, gc, argparse, warnings

In this part of the code, various libraries for data processing, model training and evaluation are imported, including cross-validation modules and evaluation indicator modules in NumPy, Pandas, Polars, collections, XGBoost, LightGBM, CatBoost, and Scikit-learn. and other libraries for handling command line arguments and warning messages

This is a function used to reduce the memory usage of DataFrame. It traverses each column of the DataFrame and converts the data type of the column according to the numeric type, thereby reducing memory usage
# 定义一个函数,用于减少DataFrame的内存使用
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] # 定义数值类型列表
    start_mem = df.memory_usage().sum() / 1024**2    # 计算初始内存使用量
    for col in df.columns: # 遍历DataFrame的每一列
        col_type = df[col].dtypes # 获取该列的数据类型
        if col_type in numerics: # 如果该列是数值类型
            c_min = df[col].min() # 获取该列的最小值
            c_max = df[col].max() # 获取该列的最大值
            if str(col_type)[:3] == 'int': # 如果该列是整数类型
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: # 如果该列的数值范围在int8的范围内
                    df[col] = df[col].astype(np.int8) # 将该列的数据类型转换为int8
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: # 如果该列的数值范围在int16的范围内
                    df[col] = df[col].astype(np.int16) # 将该列的数据类型转换为int16
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: # 如果该列的数值范围在int32的范围内
                    df[col] = df[col].astype(np.int32) # 将该列的数据类型转换为int32
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: # 如果该列的数值范围在int64的范围内
                    df[col] = df[col].astype(np.int64) # 将该列的数据类型转换为int64  
            else: # 如果该列不是整数类型
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: # 如果该列的数值范围在float16的范围内
                    df[col] = df[col].astype(np.float16) # 将该列的数据类型转换为float16
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: # 如果该列的数值范围在float32的范围内
                    df[col] = df[col].astype(np.float32) # 将该列的数据类型转换为float32
                else: # 如果该列的数值范围不在上述三种类型的范围内
                    df[col] = df[col].astype(np.float64) # 将该列的数据类型转换为float64
    end_mem = df.memory_usage().sum() / 1024**2 # 计算结束时的内存使用量
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem)) # 输出内存使用量的减少情况
    return df # 返回处理后的DataFrame
1. Read data

Data reading and preprocessing are crucial steps when it comes to data science projects, as the quality and format of the data have a significant impact on subsequent feature engineering and model training. The following is a detailed analysis of the code and knowledge points of the data reading and preprocessing part:

# 读取数据
path = 'ai4bio' # 定义数据集路径
# 可能因为内存问题所导致数据读取困难,可以选择放弃部分特征,仅读取部分行,baseline仅读取前10000行
#根据自己的算力情况,适当读取数据
traindata = pd.read_csv(f'{
      
      path}/traindata.csv', nrows=10000) # 读取训练数据
trainmap = pd.read_csv(f'{
      
      path}/trainmap.csv') # 读取训练数据的映射信息

testdata = pd.read_csv(f'{
      
      path}/ai4bio_testset_final/testdata.csv', nrows=10000) # 读取测试数据
testmap = pd.read_csv(f'{
      
      path}/ai4bio_testset_final/testmap.csv') # 读取测试数据的映射信息

Use the nrows parameter to limit the number of rows read to 10,000. This is done for quick testing with limited resources.

Data compression (optional)
# traindata = reduce_mem_usage(traindata)
# testdata = reduce_mem_usage(testdata)

This part of the code shows how to use the previously defined reduce_mem_usage function to reduce the memory usage of the DataFrame. By converting data types to appropriate types, the memory footprint of the data set can be reduced, thereby improving processing efficiency.

2. Data preprocessing
traindata = traindata.set_index('cpgsite') # 将训练数据的索引设置为'cpgsite'列
traindata = traindata.T # 转置训练数据
traindata = traindata.reset_index() # 重置训练数据的索引
traindata = traindata.rename(columns={
    
    'index':'sample_id'}) # 重命名训练数据的列名
traindata.columns = ['sample_id'] + [i for i in range(10000)] # 设置训练数据的列名为'sample_id'加上一列自增的数字
traindata.to_pickle(f'{
      
      path}/traindata.pkl') # 将处理后的训练数据保存为pickle文件

testdata = testdata.set_index('cpgsite') # 将测试数据的索引设置为'cpgsite'列
testdata = testdata.T # 转置测试数据
testdata = testdata.reset_index() # 重置测试数据的索引
testdata = testdata.rename(columns={
    
    'index':'sample_id'}) # 重命名测试数据的列名
testdata.columns = ['sample_id'] + [i for i in range(10000)] # 设置测试数据的列名为'sample_id'加上一列自增的数字
testdata.to_pickle(f'{
      
      path}/testdata.pkl') # 将处理后的测试数据保存为pickle文件

First, set the index of the training data to the 'cpgsite' column, and then transpose the data.
Reset the index, add the original index to the data as a column, and change the column name to 'sample_id'.
Add column names to the DataFrame, including the 'sample_id' column and the top 10000 column features.

Data quality analysis
for i in range(10):
    null_cnt = traindata[i].isnull().sum() / traindata.shape[0]
    print(f'特征{
      
      i},对应的缺失率为{
      
      null_cnt}')

This code is used to analyze the missing rate of the first 10 column features. By looping through each feature, counting the number of null values, and then dividing by the number of rows in the dataset, we get the feature's missing rate.

Correlation coefficient matrix
traindata[[i for i in range(1000)]].corr()

This code calculates the correlation coefficient matrix between the first 1000 columns of features in the dataset. The correlation coefficient matrix can help analyze the relationship between features, thereby providing guidance for feature selection and feature engineering.

3. Data cleaning
Data splicing
traindata = traindata.merge(trainmap[['sample_id', 'age', 'gender', 'sample_type', 'disease']],on='sample_id',how='left')
testdata = testdata.merge(testmap[['sample_id', 'gender']],on='sample_id',how='left')

In this part of the code, the training and test data are merged with the mapping information. Use Pandas' merge function to merge based on the 'sample_id' column and add the features from the mapping information to the dataset.

# 定义了一个名为disease_mapping的字典,它将疾病名称映射为对应的数值。例如,'Alzheimer's disease'被映射为1,'Parkinson's disease'被映射为4,以此类推。这样的映射通常用于机器学习模型中的特征编码,以便将文本形式的类别标签转换为可以输入到模型的数字形式。
disease_mapping = {
    
    
    'control': 0,
    "Alzheimer's disease": 1,
    "Graves' disease": 2,
    "Huntington's disease": 3,
    "Parkinson's disease": 4,
    'rheumatoid arthritis': 5,
    'schizophrenia': 6,
    "Sjogren's syndrome": 7,
    'stroke': 8,
    'type 2 diabetes': 9
}
sample_type_mapping = {
    
    'control': 0, 'disease tissue': 1}
gender_mapping = {
    
    'F': 0, 'M': 1}

traindata['disease'] = traindata['disease'].map(disease_mapping)
traindata['sample_type'] = traindata['sample_type'].map(sample_type_mapping)
traindata['gender'] = traindata['gender'].map(gender_mapping)
testdata['gender'] = testdata['gender'].map(gender_mapping)

By creating a dictionary, text-based classification features (such as disease names, sample types, gender, etc.) are mapped into numerical labels to facilitate model training and processing.

4. Build features
# 特征工程
# 计算traindata和testdata数据集中前10000行每一列的最大值、最小值、标准差、方差、偏度、均值和中位数,并将结果分别存储在traindata和testdata的'max'、'min'、'std'、'var'、'skew'、'mean'和'median'列中,这些统计量可以用于描述数据集的特征和分布情况。
traindata['max'] = traindata[[i for i in range(10000)]].max(axis=1)
traindata['min'] = traindata[[i for i in range(10000)]].min(axis=1)
traindata['std'] = traindata[[i for i in range(10000)]].std(axis=1)
traindata['var'] = traindata[[i for i in range(10000)]].var(axis=1)
traindata['skew'] = traindata[[i for i in range(10000)]].skew(axis=1)
traindata['mean'] = traindata[[i for i in range(10000)]].mean(axis=1)
traindata['median'] = traindata[[i for i in range(10000)]].median(axis=1)

testdata['max'] = testdata[[i for i in range(10000)]].max(axis=1)
testdata['min'] = testdata[[i for i in range(10000)]].min(axis=1)
testdata['std'] = testdata[[i for i in range(10000)]].std(axis=1)
testdata['var'] = testdata[[i for i in range(10000)]].var(axis=1)
testdata['skew'] = testdata[[i for i in range(10000)]].skew(axis=1)
testdata['mean'] = testdata[[i for i in range(10000)]].mean(axis=1)
testdata['median'] = testdata[[i for i in range(10000)]].median(axis=1)

# 入模特征选择
cols = [i for i in range(10000)] + ['gender','max','min','std','var','skew','mean','median']
5. Model training and verification
# 定义一个名为catboost_model的函数,接收四个参数:train_x, train_y, test_x和seed
def catboost_model(train_x, train_y, test_x, seed = 2023):
    folds = 5  # 设置K折交叉验证折数为5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed) # 使用KFold方法创建一个交叉验证对象kf,设置折数、是否打乱顺序和随机种子
    oof = np.zeros(train_x.shape[0]) # 初始化一个全零数组oof,长度为train_x的长度
    test_predict = np.zeros(test_x.shape[0]) # 初始化一个全零数组test_predict,长度为test_x的长度
    cv_scores = [] # 初始化一个空列表cv_scores,用于存储交叉验证得分
    # 使用for循环遍历kf的每个折叠
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
               # 打印当前折数的序号
        print('************************************ {} ************************************'.format(str(i+1)))
        # 获取当前折叠的训练集索引和验证集索引,根据索引获取训练集和验证集的特征和标签
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        # 定义CatBoostRegressor模型的参数
        params = {
    
    'learning_rate': 0.1, # 学习率,控制模型参数更新的速度。值越大,模型更新越快,但可能陷入局部最优解;值越小,模型更新越慢,但可能收敛到更好的解。
          'depth': 5,  # 树的深度,即决策树的最大层数。树的深度越深,模型的复杂度越高,可能导致过拟合;树的深度越浅,模型的复杂度越低,可能导致欠拟合。
          'bootstrap_type':'Bernoulli', # 自助法的类型,用于有放回地抽样。'Bernoulli'表示使用伯努利分布进行抽样,每次抽样后将结果反馈到训练集中。
          'random_seed':2023, # 随机种子,用于控制随机过程。设置相同的随机种子可以保证每次运行代码时得到相同的结果。
          'od_type': 'Iter',  # 迭代次数优化方法的类型。'Iter'表示使用迭代次数优化方法,通过多次迭代来寻找最优的迭代次数。
          'od_wait': 100,  # 迭代次数优化方法的等待时间,即两次迭代之间的最小间隔。设置较长的等待时间可以加快收敛速度,但可能导致过拟合;设置较短的等待时间可以加快收敛速度,但可能导致欠拟合。
          'allow_writing_files': False, # 是否允许写入文件。设置为False表示不保存模型参数,只返回模型对象。
          'task_type':"GPU",  # 任务类型,表示模型运行在GPU还是CPU上。设置为"GPU"表示模型运行在GPU上,如果计算机没有GPU,可以设置为"CPU"。
          'devices':'0:1' # 设备列表,表示使用哪些GPU设备。"0:1"表示只使用第一个GPU设备。
        }
        
        # 创建CatBoostRegressor模型实例
        # 根据自己的算力与精力,调整iterations,V100环境iterations=500需要跑10min
        model = CatBoostRegressor(iterations=2000, **params)
        # 使用训练集和验证集拟合模型
        model.fit(trn_x, trn_y, # 训练集的特征和标签,用于模型的训练。
                  eval_set=(val_x, val_y), # 验证集的特征和标签,用于在训练过程中评估模型性能。
                  metric_period=500, # 定评估指标的计算周期,即每隔多少次迭代计算一次评估指标。
                  use_best_model=True, # 设置为True表示在训练过程中使用验证集上性能最好的模型参数。
                  cat_features=[], # 包含需要转换为类别特征的特征名称,没有需要转换的特征,所以为空列表。
                  verbose=1 # 设置日志输出的详细程度,1表示输出详细信息。
                 )
                  
        # 使用模型对测试集进行预测
        val_pred  = model.predict(val_x)
        test_pred = model.predict(test_x)
        # 将验证集预测结果存储到oof数组中
        oof[valid_index] = val_pred
        # 计算K折测试集预测结果的平均值并累加到test_predict数组中
        test_predict += test_pred / kf.n_splits
        
        # 暂时忽略健康样本和患病样本在计算MAE上的差异,仅使用常规的MAE指标
        # 计算验证集预测结果与真实值之间的平均绝对误差(MAE)
        score = mean_absolute_error(val_y, val_pred)
        # 将MAE添加到cv_scores列表中
        cv_scores.append(score)
        print(cv_scores) # 打印cv_scores列表
        
        # 获取特征重要性打分,便于评估特征
        if i == 0:
                # 将特征名称和打分存储到DataFrame中
            fea_ = model.feature_importances_
            fea_name = model.feature_names_
            fea_score = pd.DataFrame({
    
    'fea_name':fea_name, 'score':fea_})
            # 按照打分降序排列DataFrame
            fea_score = fea_score.sort_values('score', ascending=False)
            # 将排序后的DataFrame保存为CSV文件(命名为feature_importances.csv)
            fea_score.to_csv('feature_importances.csv', index=False)
        
    return oof, test_predict # 返回oof和test_predict数组

# 调用catboost_model函数,进行模型训练与结果预测
cat_oof, cat_test = catboost_model(traindata[cols], traindata['age'], testdata[cols])


  • In the catboost_model function, the folds of K-fold cross-validation are first set and the KFold object kf is created.
  • Use kf.split(train_x, train_y) to get the training set and validation set index for each fold.
  • In each fold, create a CatBoostRegressor model instance and fit it using the training and validation sets.
  • During the fitting process, parameters such as the calculation cycle of the evaluation index, the use of the best-performing model parameters on the validation set, and whether to use the GPU are set.
  • After each fold, the validation set prediction results are stored in the oof array, and the test set prediction results are accumulated in the test_predict array.
Result output
# 输出赛题提交格式的结果
testdata['age'] = cat_test # 将testdata数据框中的age列赋值为cat_test。
testdata['age'] = testdata['age'].astype(float) # 将age列的数据类型转换为浮点数。
testdata['age'] = testdata['age'].apply(lambda x: x if x>0 else 0.0) # 使用lambda函数对age列中的每个元素进行判断,如果大于0,则保持不变,否则将其替换为0.0。
testdata['age'] = testdata['age'].apply(lambda x: '%.2f' % x) # 使用lambda函数将age列中的每个元素格式化为保留两位小数的字符串。
testdata['age'] = testdata['age'].astype(str) # 将age列的数据类型转换为字符串。
testdata[['sample_id','age']].to_csv('submit.txt',index=False) # 将sample_id和age两列保存到名为submit.txt的文件中,不包含索引。

Once you have submitted the documents, you can upload them to the official website to view the results! ! !

2.3 Summary

  1. Data processing and preprocessing:

Use third-party libraries (NumPy, Pandas, Polars) for data processing and analysis.
Use the reduce_mem_usage function to reduce the memory usage of the DataFrame.
Read data files in CSV format through pd.read_csv.
Use the basic operations of DataFrame (transpose, reset index, rename column names) to organize the data.
Use loops and list comprehensions to calculate missingness rates and correlation coefficient matrices for features.
Use the merge function to merge data sets based on specific columns.

  1. Feature engineering:

Statistical features (maximum, minimum, standard deviation, variance, skewness, mean, median) are calculated and added to the data set.

  1. Model training and validation:

Use three gradient boosting tree models (XGBoost, LightGBM, CatBoost) for model training and prediction.
Use KFold cross-validation to divide the training set and validation set, and perform model training and evaluation during the process.
Use the relevant parameters of CatBoostRegressor (learning rate, tree depth, bootstrap type, etc.) to create a model.
During the training process, the MAE score on the validation set is calculated and output, as well as the feature importance.

  1. Result output:

Output the prediction results of the model on the test set into the competition submission format.
Use the apply and lambda functions to process the prediction results to ensure that the results meet the submission requirements.

Guess you like

Origin blog.csdn.net/qq_42859625/article/details/132360716