The First World Scientific Intelligence Competition: Life Science Track - Biological Age Assessment and Age-Related Disease Risk Prediction (Second Notes)

The First World Scientific Intelligence Competition: Life Science Track - Biological Age Assessment and Age-Related Disease Risk Prediction (Second Notes)

For background introduction, see the first notes: http://t.csdn.cn/ylDRW

The Baseline code in Task 1 provides a basic framework, but may require further optimization and improvement to improve the performance of the model. Here are some possible directions for improvement:

Feature engineering: You can try more complex feature engineering, including feature construction based on domain knowledge, and using dimensionality reduction techniques to reduce the number of features.

Model tuning: The hyperparameters of the model can be tuned in more detail, including learning rate, tree depth, number of iterations, etc. You can also consider trying different gradient boosting tree models, such as XGBoost, LightGBM, etc.

Model fusion: Consider using model fusion technology to combine the prediction results of multiple models to improve prediction performance.

Feature selection: By analyzing feature importance, the most important features can be selected to reduce model complexity and training time.

Handling missing values: You can try different methods to handle missing values ​​other than just replacing them with 0.0. For example, you can use interpolation methods to estimate missing values.

Model interpretation: Consider using model interpretation techniques to better understand your model's decision-making process and feature importance.

Data augmentation: If additional data is available, consider merging it with the training dataset to increase the amount of training data for your model.

Among them, the directions that can be considered in this competition include feature engineering, model tuning, model fusion, feature selection, and processing missing values. We will make some attempts to see if we can improve the performance of the model.

1. Further optimize the features

When dealing with large datasets, it is very important to perform feature selection to reduce computational and memory requirements and improve the efficiency of the model. There are many methods of feature selection. Here are some commonly used feature selection techniques:

  1. Correlation filtering: By calculating the correlation between each feature and the target variable (for example, Pearson correlation coefficient or Spearman rank correlation coefficient), you can identify features that have a strong relationship with the target variable. You can set a correlation threshold and select features with correlations above the threshold.

  2. Variance filtering: Removes features with very low variance, as they may not be of much help to the model’s predictions. You can calculate the variance of each feature and select features with variance above a threshold.

  3. Recursive Feature Elimination (RFE): This is a recursive method that first trains the model using all features, then excludes the least important features, and then trains the model again. This process is repeated until the required number of features is reached or the performance index is reached.

  4. L1 regularization (Lasso): Linear models (such as Lasso regression) using L1 regularization can push the coefficients of some features to zero, thereby achieving feature selection. Features with coefficients of zero can be removed.

  5. Tree model feature importance: Tree models such as Random Forest, XGBoost, LightGBM, etc. can provide the importance score of each feature. You can select top-ranked features based on importance scores.

  6. Feature selection libraries: There are some Python libraries dedicated to feature selection, such as scikit-learnZhong SelectKBestand Zhong SelectFromModel, and borutaother libraries. These libraries provide convenient tools to perform various feature selection methods.

  7. Feature dimensionality reduction techniques: In addition to feature selection, feature dimensionality reduction techniques such as principal component analysis (PCA), linear discriminant analysis (LDA), and singular value decomposition (SVD) can also help reduce the number of features.

In practical applications, multiple feature selection methods are usually combined to find the optimal feature subset. The specific method of feature selection depends on the characteristics of the data set and the requirements of the machine learning task. You can try different methods, evaluate their impact on model performance, and choose the feature selection strategy that best suits your situation.

1. Perform feature selection on original features and filter out familiar, unique, high missing, and highly relevant features;

drop_cols = []

print('过滤异常特征...')
for col in traindata.columns[1:]:  # 从第2列开始,第1列是'sample_id'
    if traindata[col].nunique() == 1:  # 唯一属性值
        drop_cols.append(col)
    if traindata[col].isnull().sum() / traindata.shape[0] > 0.6:  # 缺失率大于0.95
        drop_cols.append(col)

print('过滤高相关特征...')
def correlation(data, threshold):
    col_corr = set()
    numeric_cols = data.select_dtypes(include=[np.number]).columns  # 仅选择数值型特征
    corr_matrix = data[numeric_cols].corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                colname = corr_matrix.columns[i]
                col_corr.add(colname)
    return list(col_corr)


selected_cols = [col for col in traindata.columns if col not in drop_cols]

drop_cols += correlation(traindata[selected_cols], 0.98)

# 从特征中移除不需要的列
selected_cols = [col for col in selected_cols if col not in drop_cols]

Note: When filtering highly relevant features, too many features will occupy too much memory and cause a kernel interrupt.

2. Extract dimensionality reduction features such as PCA, LDA, and SVD. Because the feature latitude is too large, reading all features or directly using all features for training requires a lot of memory. You can perform dimensionality reduction operations on the data group by group. For example, divide the features into 10 groups, perform PCA, LDA, and SVD operations on each group, and then The 10 sets of dimensionally reduced results are combined as the model features.

# 特征提取
from sklearn.decomposition import PCA
import numpy as np

n_components = 16  # 设置PCA的降维维度
n_groups = 10  # 将数据分成10组

# 分组降维并合并结果
pca_features = []

for i in range(n_groups):
    # 每组的起始和结束索引
    start_idx = i * (len(selected_cols) // n_groups)
    end_idx = (i + 1) * (len(selected_cols) // n_groups)
    
    # 提取当前组的特征
    group_data = traindata[['sample_id'] + selected_cols[start_idx:end_idx]]
    
    # 使用PCA进行降维
    pca = PCA(n_components=n_components)
    pca_result = pca.fit_transform(group_data.drop(columns=['sample_id']))
    
    # 将降维后的结果合并到pca_features列表中
    pca_features.append(pca_result)

# 将降维后的结果水平合并
pca_features = np.hstack(pca_features)

# 创建新的DataFrame并添加降维后的特征
pca_columns = [f'pca_{
      
      i}' for i in range(n_components * n_groups)]
pca_df = pd.DataFrame(data=pca_features, columns=pca_columns)

# 将降维后的特征添加到训练数据中
traindata = pd.concat([traindata, pca_df], axis=1)

This method requires PCA to deal with missing values, so there is currently no good way to deal with it.

2. Missing values

To deal with missing values, we need to understand what the biological meaning of NaN values ​​in methylation data is?

In biology, NaN values ​​in methylation data often indicate that no methylation information was detected or measured at that specific site. This could be for a variety of reasons, including limitations of the experimental technique, measurement error, or that certain sites simply are not methylated in a particular sample.

The meaning of NaN values ​​in methylation data can be summarized as follows:

  1. No methylation detected : In some cases, experimental techniques may not accurately detect the methylation status of certain sites, causing values ​​at these sites to be marked as NaN. This may be due to insufficient technical sensitivity or the presence of interference factors.

  2. Measurement error : In other cases, NaN values ​​may be due to measurement error, i.e., random or systematic errors in measuring methylation. These errors may originate from various factors during the experimental process, including sample preparation, sequencing, etc.

  3. Missing sites : In some cases, the sites themselves may indeed not be methylated in some samples. This may be because a particular DNA sequence has no methylation sites in some samples, so the data in these samples is marked as NaN.

When analyzing methylation data, it is very important to handle NaN values. Typically, methods for dealing with NaN values ​​include data imputation (replacing NaN values ​​with mean, median, etc.), deleting samples or sites containing NaN values, using statistical methods to estimate NaN values, etc., depending on the goals and data of the study nature. Make sure to choose an appropriate NaN value processing strategy based on research needs and data characteristics to ensure the accuracy and reliability of analysis.

Here I count the missing rate of the first 10 features, as follows:

特征0,对应的缺失率为0.5364994534191667
特征1,对应的缺失率为0.570144540264788
特征2,对应的缺失率为0.9562735333414308
特征3,对应的缺失率为0.5799829952629662
特征4,对应的缺失率为0.6345196161787927
特征5,对应的缺失率为0.42791206121705333
特征6,对应的缺失率为0.4956880845378355
特征7,对应的缺失率为0.530547795457306
特征8,对应的缺失率为0.5468237580468844
特征9,对应的缺失率为0.5464593708247297

As can be seen, the missing rate of methylation data is high. So when processing methylation data, you can first evaluate whether these NaN values ​​contain useful information. If the missing values ​​are relevant to the research question and cannot simply be filled with statistical values, then consider retaining these NaN values ​​and treating them as meaningful categories.

If NaN values ​​are not important to the research question, you can choose to delete samples or features containing NaN values, or use statistical values ​​to fill NaN values. The final decision should be based on domain knowledge, data analysis needs, and research goals. When dealing with NaN values, it is recommended to always record and document them for future data analysis and interpretation.

When dealing with NaN values ​​in methylation data, the appropriate treatment method should be selected based on the nature of the data and research goals. Here are some common ways to handle NaN values, which you can evaluate and choose based on your specific situation:

Remove samples or features containing NaN:

Delete samples containing NaN values: If there are too many NaN values ​​in some samples or these NaN values ​​cannot be reasonably filled, you can choose to delete these samples. Doing so may reduce sample size but ensures the reliability of the analysis.
Delete features containing NaN values: If the proportion of NaN values ​​in some features is too high and these features are not important to solving the problem, you can choose to delete these features.
Populate NaN values:

Use statistical values ​​such as mean, median, mode, etc. to fill NaN values:

For continuous data, you can fill NaN values ​​with the mean, median, or other statistical value of the feature.

Use fixed values ​​for filling: Sometimes you can choose a suitable fixed value to fill NaN values ​​based on domain knowledge or experimental design.
Filling using interpolation methods: For data such as time series, interpolation methods such as linear interpolation or interpolation can be used to fill NaN values.
Create a model to populate:

If you think the distribution of NaN values ​​is related to other features, consider using a machine learning model (such as a regression model or a classification model) to predict NaN values.

Since the data here is difficult to determine the type of missing values, we will not specifically deal with missing values ​​here.

3. Model fusion

Combining predictions from multiple better-performing models can work here!

There are two main strategies for model fusion (ensemble learning):

1. bagging

The idea of ​​Bagging is that all basic models are treated uniformly, and each basic model has only one vote. Then use democratic voting to get the final result.

In most cases, the variance of the results obtained through bagging is smaller.

We built the cv_model function, which can choose to use lightgbm, xgboost and catboost models internally. We can run these three models in sequence, and then average the results of the three models for fusion.

def cv_model(clf, train_x, train_y, test_x, clf_name, seed = 2023):
    '''
    clf:调用模型
    train_x:训练数据
    train_y:训练数据对应标签
    test_x:测试数据
    clf_name:选择使用模型名
    seed:随机种子
    '''
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    oof = np.zeros(train_x.shape[0])
    test_predict = np.zeros(test_x.shape[0])
    cv_scores = []
    
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)
            params = {
    
    
                'boosting_type': 'gbdt',
                'objective': 'regression',
                'metric': 'mae',
                'min_child_weight': 6,
                'num_leaves': 2 ** 6,
                'lambda_l2': 10,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.1,
                'seed': 2023,
                'nthread' : 16,
                'verbose' : -1,
            }
            model = clf.train(params, train_matrix, 2000, valid_sets=[train_matrix, valid_matrix],
                              categorical_feature=[], verbose_eval=200, early_stopping_rounds=100)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        
        if clf_name == "xgb":
            xgb_params = {
    
    
              'booster': 'gbtree', 
              'objective': 'reg:squarederror',
              'eval_metric': 'mae',
              'max_depth': 5,
              'lambda': 10,
              'subsample': 0.7,
              'colsample_bytree': 0.7,
              'colsample_bylevel': 0.7,
              'eta': 0.1,
              'tree_method': 'hist',
              'seed': 520,
              'nthread': 16
              }
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            test_matrix = clf.DMatrix(test_x)
            
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            
            model = clf.train(xgb_params, train_matrix, num_boost_round=2000, evals=watchlist, verbose_eval=200, early_stopping_rounds=100)
            val_pred  = model.predict(valid_matrix)
            test_pred = model.predict(test_matrix)
            
        if clf_name == "cat":
            params = {
    
    'learning_rate': 0.1, 'depth': 5, 'bootstrap_type':'Bernoulli','random_seed':2023,
                      'od_type': 'Iter', 'od_wait': 100, 'random_seed': 11, 'allow_writing_files': False}
            
            model = clf(iterations=2000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      metric_period=200,
                      use_best_model=True, 
                      cat_features=[],
                      verbose=1)
            
            val_pred  = model.predict(val_x)
            test_pred = model.predict(test_x)
        
        oof[valid_index] = val_pred
        test_predict += test_pred / kf.n_splits
        
        score = mean_absolute_error(val_y, val_pred)
        cv_scores.append(score)
        print(cv_scores)
        
    return oof, test_predict

# 选择lightgbm模型
lgb_oof, lgb_test = cv_model(lgb, traindata[cols], traindata['label'], testdata[cols], 'lgb')
# 选择xgboost模型
xgb_oof, xgb_test = cv_model(xgb, traindata[cols], traindata['label'], testdata[cols], 'xgb')
# 选择catboost模型
cat_oof, cat_test = cv_model(CatBoostRegressor, traindata[cols], traindata['label'], testdata[cols], 'cat')

# 进行取平均融合
final_test = (lgb_test + xgb_test + cat_test) / 3

2.boosting

The most essential difference between Boosting and bagging is that it does not treat the basic models uniformly. Instead, it selects the "elites" through constant testing and screening, and then gives the elites more voting rights, while the basic models that perform poorly are given With fewer voting rights, everyone’s votes are then combined to get the final result.
In most cases, the bias of the results obtained through boosting is smaller.

4. Appropriately increasing the number of features read and the number of iterations can continue to improve performance. Be careful to prevent too many features from being read, causing a kernel crash.

In practice, by simply adjusting these two parameters, the score can be reduced to below 3.5.

Summarize:

  • Reading 100,000 features at once is unbearable and the kernel always crashes.
  • How to deal with missing values
  • Is catboost definitely better than xgboost and lightgbm?

Guess you like

Origin blog.csdn.net/qq_42859625/article/details/132433012