A few days ago, Kaggle released the ICR - Identifying Age-Related Conditions Disease Identification Competition. This is a binary classification task in machine learning . You need to use the ML method to diagnose patients and determine whether patients have related diseases, so as to provide doctors with a basis for reasonable diagnosis.
This competition provides 4 pieces of data, namely:
train, test, sample_submission, greeks. in:
The train file marks the relevant features and labels of each patient.
test and sample_submission are used when submitting answers.
greeks is supplementary metadata and only applies to the training set.
In order to help students score points and get cards, I have brought you great benefits:
The original price of 198 yuan is free to watch the competition lecture!
Complete high score baseline for free!
" Classic Introduction to Algorithm Competition (2nd Edition ) " free shipping! (Details at the end of the article)
Scan the QR code to watch the lectures, get the baseline, and pick up books for free!
Competition Lecture
Training Data Analysis
Number of rows in train data: 617
Number of columns data: 58
Data sample:
Data distribution:
Correlation analysis:
Build training data
Read the data, see the baseline code for details, which has a more detailed introduction
train = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')
test = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')
greeks = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/greeks.csv')
sample_submission = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv')
Baseline process
Load data, feature processing:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
# Combine numeric and categorical features
FEATURES = num_cols + cat_cols
# Fill missing values with mean for numeric variables
imputer = SimpleImputer(strategy='mean')
numeric_df = pd.DataFrame(imputer.fit_transform(train[num_cols]), columns=num_cols)
# Scale numeric variables using min-max scaling
scaler = MinMaxScaler()
scaled_numeric_df = pd.DataFrame(scaler.fit_transform(numeric_df), columns=num_cols)
# Encode categorical variables using one-hot encoding
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_cat_df = pd.DataFrame(encoder.fit_transform(train[cat_cols]), columns=encoder.get_feature_names_out(cat_cols))
# Concatenate the scaled numeric and encoded categorical variables
processed_df = pd.concat([scaled_numeric_df, encoded_cat_df], axis=1)
Define the training function:
from sklearn.utils import class_weight
FOLDS = 10
SEED = 1004
xgb_models = []
xgb_oof = []
f_imp = []
counter = 1
X = processed_df
y = train['Class']
# Calculate the sample weights
weights = class_weight.compute_sample_weight('balanced', y)
skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
if (fold + 1)%5 == 0 or (fold + 1) == 1:
print(f'{"#"*24} Training FOLD {fold+1} {"#"*24}')
X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
X_valid, y_valid = X.iloc[val_idx], y.iloc[val_idx]
watchlist = [(X_train, y_train), (X_valid, y_valid)]
# Apply weights in the XGBClassifier
model = XGBClassifier(n_estimators=1000, n_jobs=-1, max_depth=4, eta=0.2, colsample_bytree=0.67)
model.fit(X_train, y_train, sample_weight=weights[train_idx], eval_set=watchlist, early_stopping_rounds=300, verbose=0)
val_preds = model.predict_proba(X_valid)[:, 1]
# Apply weights in the log_loss
val_score = log_loss(y_valid, val_preds, sample_weight=weights[val_idx])
best_iter = model.best_iteration
idx_pred_target = np.vstack([val_idx, val_preds, y_valid]).T
f_imp.append({i: j for i, j in zip(X.columns, model.feature_importances_)})
print(f'{" "*20} Log-loss: {val_score:.5f} {" "*6} best iteration: {best_iter}')
xgb_oof.append(idx_pred_target)
xgb_models.append(model)
print('*'*45)
print(f'Mean Log-loss: {np.mean([log_loss(item[:, 2], item[:, 1], sample_weight=weights[item[:, 0].astype(int)]) for item in xgb_oof]):.5f}')
Feature importance view:
# Confusion Matrix for the last fold
cm = confusion_matrix(y_valid, model.predict(X_valid))
# Feature Importance for the last model
feature_imp = pd.DataFrame({'Value':xgb_models[-1].feature_importances_, 'Feature':X.columns})
feature_imp = feature_imp.sort_values(by="Value", ascending=False)
feature_imp_top20 = feature_imp.iloc[:20]
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
# Subplot 1: Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', ax=ax[0], cmap='YlOrRd')
ax[0].set_title('Confusion Matrix')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('True')
# Subplot 2: Feature Importance
sns.barplot(x="Value", y="Feature", data=feature_imp_top20, ax=ax[1], palette='YlOrRd_r')
ax[1].set_title('Feature Importance')
plt.tight_layout()
plt.show()
Scan the QR code to watch the lectures, get the baseline, and pick up books for free!
Free shipping book benefits
Add customer service and participate in the lottery to send books with the screenshot of the current article. 50 students will be selected, and "Introduction to Algorithm Competition (Second Edition)" will be sent out with free shipping!
Books will be mailed at the end of the month, thank you for your patience~
Scan the QR code to watch the lectures, get the baseline, and pick up books for free!