# 1、定义数据集
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
NonePassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S[5 rows x 12 columns]5
# 2、数据预处理
# 2.1、缺失值填充
# 2.2、构造特征
after fillna and FE
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Sex 891 non-null object 3 Age 891 non-null float644 SibSp 891 non-null int64 5 Parch 891 non-null int64 6 Fare 891 non-null float647 Embarked 891 non-null object 8 FamilySize 891 non-null int64 9 IsAlone 891 non-null int32
dtypes: float64(2), int32(1), int64(5), object(2)
memory usage: 66.3+ KB
# 2.3、特征编码
after LabelEncoder
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Sex 891 non-null int32 3 Age 891 non-null float644 SibSp 891 non-null int64 5 Parch 891 non-null int64 6 Fare 891 non-null float647 Embarked 891 non-null int32 8 FamilySize 891 non-null int64 9 IsAlone 891 non-null int32
dtypes: float64(2), int32(3), int64(5)
memory usage: 59.3 KB
# 2.4、分离特征与标签
# 3、模型训练与评估
# 3.1、数据集划分为训练集和测试集
# 3.2、模型训练与评估
Accuracy: 0.8435754189944135
F1: 0.7812500000000001
AUC: 0.8275978407557355
XGBoost 0.8435754189944135 0.7812500000000001 0.8275978407557355
ACC | F1 | AUC | |
XGBoost | 0.832402235 | 0.765625 | 0.815519568 |
XGBoost+FamilySize | 0.843575419 | 0.78125 | 0.827597841 |
XGBoost+FamilySize+IsAlone | 0.843575419 | 0.78125 | 0.827597841 |
# 3.3、模型导出为JSON文件
# 获取模型的参数
model.json {'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None}
# 4、模型推理
# 4.1、载入模型文件
# 4.2、创建模型并载入模型jason参数
# 4.3、模型推理
# 4.3.1、加载一条新样本
# 4.3.2、预处理新样本数据
raw test dataPclass Sex Age SibSp Parch Fare Embarked FamilySize IsAlone
0 3 male 25 1 0 7.25 S 2 0
test data after LabelEncoderPclass Sex Age SibSp Parch Fare Embarked FamilySize IsAlone
0 3 0 25 1 0 7.25 0 2 0
# 4.3.3、基于json文件需要模型再训练,然后推理预测
Model Reasoning Pclass Sex Age SibSp Parch Fare Embarked FamilySize IsAlone
0 3 0 25 1 0 7.25 0 2 0
推理结果: [0]