RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]5

# 2、数据预处理

# 2.1、缺失值填充

# 2.2、构造特征

after fillna and FE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    object 
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    891 non-null    object 
 8   FamilySize  891 non-null    int64  
 9   IsAlone     891 non-null    int32  
dtypes: float64(2), int32(1), int64(5), object(2)
memory usage: 66.3+ KB
None

# 2.3、特征编码

after LabelEncoder
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int32  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked    891 non-null    int32  
 8   FamilySize  891 non-null    int64  
 9   IsAlone     891 non-null    int32  
dtypes: float64(2), int32(3), int64(5)
memory usage: 59.3 KB
None

# 2.4、分离特征与标签

# 3、模型训练与评估

# 3.1、数据集划分为训练集和测试集

# 3.2、模型训练与评估

XGBoost
Accuracy:  0.8435754189944135
F1:  0.7812500000000001
AUC:  0.8275978407557355
XGBoost 0.8435754189944135 0.7812500000000001 0.8275978407557355

	ACC	F1	AUC
XGBoost	0.832402235	0.765625	0.815519568
XGBoost+FamilySize	0.843575419	0.78125	0.827597841
XGBoost+FamilySize+IsAlone	0.843575419	0.78125	0.827597841

# 3.3、模型导出为JSON文件

# 获取模型的参数

model.json {'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None}

# 4、模型推理

# 4.1、载入模型文件

# 4.2、创建模型并载入模型jason参数

# 4.3、模型推理

# 4.3.1、加载一条新样本

# 4.3.2、预处理新样本数据

raw test data
   Pclass   Sex  Age  SibSp  Parch  Fare Embarked  FamilySize  IsAlone
0       3  male   25      1      0  7.25        S           2        0
test data after LabelEncoder
   Pclass  Sex  Age  SibSp  Parch  Fare  Embarked  FamilySize  IsAlone
0       3    0   25      1      0  7.25         0           2        0

# 4.3.3、基于json文件需要模型再训练，然后推理预测

Model Reasoning 
    Pclass  Sex  Age  SibSp  Parch  Fare  Embarked  FamilySize  IsAlone
0       3    0   25      1      0  7.25         0           2        0
推理结果: [0]

ML之XGBoost：基于泰坦尼克号数据集(填充/标签编码/推理数据再处理)利用XGBoost算法(json文件的模型导出和载入推理)实现二分类预测应用案例

基于泰坦尼克号数据集(独热编码/标签编码)利用XGBoost算法(json文件的模型导出和载入推理)实现二分类预测应用案例

# 1、定义数据集

# 2、数据预处理

# 2.1、缺失值填充

# 2.2、构造特征

# 2.3、特征编码

# 2.4、分离特征与标签

# 3、模型训练与评估

# 3.1、数据集划分为训练集和测试集

# 3.2、模型训练与评估

# 3.3、模型导出为JSON文件

# 获取模型的参数

# 4、模型推理

# 4.1、载入模型文件

# 4.2、创建模型并载入模型jason参数

# 4.3、模型推理

# 4.3.1、加载一条新样本

# 4.3.2、预处理新样本数据

# 4.3.3、基于json文件需要模型再训练，然后推理预测

猜你喜欢