[Contains baseline] Kaggle machine learning new competition guide! Send the book at the end of the article~

70be19ea0a9cc9fbab459b34f56ba7d9.png

A few days ago, Kaggle released the ICR - Identifying Age-Related Conditions Disease Identification Competition. This is a binary classification task in machine learning . You need to use the ML method to diagnose patients and determine whether patients have related diseases, so as to provide doctors with a basis for reasonable diagnosis.

This competition provides 4 pieces of data, namely:

train, test, sample_submission, greeks. in:

The train file marks the relevant features and labels of each patient.

test and sample_submission are used when submitting answers.

greeks is supplementary metadata and only applies to the training set.

In order to help students score points and get cards, I have brought you great benefits:

 The original price of 198 yuan is free to watch the competition lecture!

 Complete high score baseline for free!

" Classic Introduction to Algorithm Competition (2nd Edition ) " free shipping! (Details at the end of the article)

9c15847e48c11f7cf533ca3cdafe3458.png

Scan the QR code to watch the lectures, get the baseline, and pick up books for free!

Competition Lecture

d544658f9a66164486e71ab5c93b3ef7.png

Training Data Analysis

Number of rows in train data: 617
Number of columns data: 58

Data sample:

131f6087c4b3148692cd15c6dc7f3583.png

Data distribution:

4f2da7f6af97e47deb1d92e7e69a745c.png

Correlation analysis:

b67a67a5686e19ecaf168f15c1427b3e.png

Build training data

Read the data, see the baseline code for details, which has a more detailed introduction

train             = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/train.csv')
test              = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/test.csv')
greeks            = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/greeks.csv')
sample_submission = pd.read_csv('/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv')

Baseline process

Load data, feature processing:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder


# Combine numeric and categorical features
FEATURES = num_cols + cat_cols


# Fill missing values with mean for numeric variables
imputer = SimpleImputer(strategy='mean')
numeric_df = pd.DataFrame(imputer.fit_transform(train[num_cols]), columns=num_cols)


# Scale numeric variables using min-max scaling
scaler = MinMaxScaler()
scaled_numeric_df = pd.DataFrame(scaler.fit_transform(numeric_df), columns=num_cols)


# Encode categorical variables using one-hot encoding
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_cat_df = pd.DataFrame(encoder.fit_transform(train[cat_cols]), columns=encoder.get_feature_names_out(cat_cols))


# Concatenate the scaled numeric and encoded categorical variables
processed_df = pd.concat([scaled_numeric_df, encoded_cat_df], axis=1)

Define the training function:

from sklearn.utils import class_weight


FOLDS = 10
SEED = 1004
xgb_models = []
xgb_oof = []
f_imp = []


counter = 1
X = processed_df
y = train['Class']


# Calculate the sample weights
weights = class_weight.compute_sample_weight('balanced', y)


skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    if (fold + 1)%5 == 0 or (fold + 1) == 1:
        print(f'{"#"*24} Training FOLD {fold+1} {"#"*24}')
        
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]  
    X_valid, y_valid = X.iloc[val_idx], y.iloc[val_idx]
    watchlist = [(X_train, y_train), (X_valid, y_valid)]
    
    # Apply weights in the XGBClassifier
    model = XGBClassifier(n_estimators=1000, n_jobs=-1, max_depth=4, eta=0.2, colsample_bytree=0.67)
    model.fit(X_train, y_train, sample_weight=weights[train_idx], eval_set=watchlist, early_stopping_rounds=300, verbose=0)
    
    val_preds = model.predict_proba(X_valid)[:, 1]
    
    # Apply weights in the log_loss
    val_score = log_loss(y_valid, val_preds, sample_weight=weights[val_idx])
    best_iter = model.best_iteration
    
    idx_pred_target = np.vstack([val_idx,  val_preds, y_valid]).T
    f_imp.append({i: j for i, j in zip(X.columns, model.feature_importances_)})
    print(f'{" "*20} Log-loss: {val_score:.5f} {" "*6} best iteration: {best_iter}')          
    
    xgb_oof.append(idx_pred_target)   
    xgb_models.append(model)  
    
print('*'*45)
print(f'Mean Log-loss: {np.mean([log_loss(item[:, 2], item[:, 1], sample_weight=weights[item[:, 0].astype(int)]) for item in xgb_oof]):.5f}')

Feature importance view:

# Confusion Matrix for the last fold
cm = confusion_matrix(y_valid, model.predict(X_valid))


# Feature Importance for the last model
feature_imp = pd.DataFrame({'Value':xgb_models[-1].feature_importances_, 'Feature':X.columns})
feature_imp = feature_imp.sort_values(by="Value", ascending=False)
feature_imp_top20 = feature_imp.iloc[:20]


fig, ax = plt.subplots(1, 2, figsize=(14, 4))


# Subplot 1: Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', ax=ax[0], cmap='YlOrRd')
ax[0].set_title('Confusion Matrix')
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('True')


# Subplot 2: Feature Importance
sns.barplot(x="Value", y="Feature", data=feature_imp_top20, ax=ax[1], palette='YlOrRd_r')
ax[1].set_title('Feature Importance')


plt.tight_layout()
plt.show()

7ff6eb6085df15b51eb5604566384456.png

Scan the QR code to watch the lectures, get the baseline, and pick up books for free!

Free shipping book benefits

da08d00bdd875ba41d43173735d0514b.jpeg

Add customer service and participate in the lottery to send books with the screenshot of the current article. 50 students will be selected, and "Introduction to Algorithm Competition (Second Edition)" will be sent out with free shipping!

Books will be mailed at the end of the month, thank you for your patience~

9a135fd3cb10a75d29725682ef3002a4.png

Scan the QR code to watch the lectures, get the baseline, and pick up books for free!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/131672064