XGBoost 分类模型优化：超参数调优与性能提升的协同攻略

🧾 1. 极端梯度提升（XGBoost）

XGBoost是一个开源软件库，它为C++、Java、Python、R、Julia、Perl和Scala提供了一个正则化的梯度提升框架，它适用于Linux、Windows和macOS，从项目描述来看，它旨在提供一个"可扩展、可移植和分布式的梯度提升库"。

极端梯度提升算法.梯度提升是指一类集成机器学习算法，可用于分类或回归预测建模问题.集成由决策树模型构建。

XGBoost是一种更正则化的梯度提升形式。XGBoost使用高级正则化（L1和L2），这提高了模型泛化能力。与梯度提升相比，XGBoost提供了高性能。它的训练非常快，可以跨集群并行化。

Extreme Gadient Boosted（xgboost）类似于梯度提升框架，但更高效。它同时具有线性模型求解器和树学习算法。因此，它的快速之处在于它能够在单台机器上进行并行计算……

超参数介绍

🛠 2. 超参数

超参数调整是什么意思？

在机器学习中，超参数优化或调整是为学习算法选择一组最佳超参数的问题。超参数是一个参数，其值用于控制学习过程。相比之下，其他参数（通常是节点权重）的值被学习。超参数调整过程是走钢丝，以实现欠拟合和过拟合之间的平衡。欠拟合是指机器学习模型无法减少测试或训练集的误差。

什么是超参数调整示例？

模型超参数的一些例子包括：逻辑回归分类器中的惩罚，即L1或L2正则化。训练神经网络的学习率。支持向量机的C和sigma超参数。

为什么我们使用超参数调优？

超参数调整是控制机器学习模型行为的重要组成部分。如果我们不正确调整超参数，我们估计的模型参数会产生次优结果，因为它们不会最小化损失函数。这意味着我们的模型会犯更多的错误。

XGBoost参数-分类

⌛️ 3. 用于分类的XGBoost超参数

XGBoost Boosters

-> gbtree :

gbtree 用于 tree-based models;

-> gblinear :

gblinear 用于 linear models 在每次迭代时运行。

-> dart :

dart 也是一个基于树的模型。
XGBoost主要结合了大量的回归树和较小的学习率。在这种情况下，早期添加的树很重要，后期添加的树不重要。Vinayak和Gilad-Bachrach提出了一种新方法，将来自深度神经网络社区的dropout技术添加到增强树中，并在某些情况下报告了更好的结果——它被称为dart。它会丢弃树以解决过度拟合。可以防止琐碎的树（纠正微不足道的错误）。由于训练中引入的随机性，预计以下几个差异：训练可能比gbtree慢，因为随机丢弃会阻止预测缓冲区的使用。由于随机性，早期停止可能不稳定。

一般参数

它与我们使用哪种booster进行boost有关，通常是树模型或线性模型

1. booster [default（默认）=gbtree]

gbtree用于tree-based models；gblinear用于linear models在每次迭代时运行。

2. silent [default=0]:

要激活静音静音模式，请将其设置为1，即关闭；因此，不会打印任何消息。

通常是个好主意，将其保持为0，因为消息可能有助于理解模型；以及指标的进展情况。

3. nthread [如果未设置，则默认为最大可用线程数]

这用于并行处理，应输入系统中的内核数
；如果您希望在所有内核上运行，则不应输入值，算法将自动检测

Booster参数

取决于您选择的助推器。我们将在这里讨论基于树的助推器。

1. eta [default=0.3]

类似于learning_rate；与geeneralize一起工作。
**要使用的典型最终值：**0.01-0.2

2. min_child_weight [default=1]

定义子中所需的所有观察值的最小权重和。它用于控制过度拟合。较高的值阻止模型学习-可能高度特定于为树选择的特定样本的关系。
过高的值可能会导致拟合不足，因此，应使用CV进行调整。

3. max_depth [default=6]

树的最大深度，它也用于控制过度拟合，因为更高的深度将允许模型学习非常特定于特定样本的关系。应使用CV进行调整。
Typical values: 3-10

4. max_leaf_nodes

树中终端节点或叶子的最大数量。
由于创建了二叉树，深度为“n”将产生最多2^n个叶子。如果定义了这个，GBM将忽略max_depth。

5. gamma [default=0]

只有当产生的拆分在损失函数中给出正减少时，才会拆分节点。Gamma指定发生拆分所需的最小损失减少。
使算法保守。值可以根据损失函数而变化，应该进行调整。可用于控制过拟合。

6. max_delta_step [default=0]

在最大增量步长中，我们允许每棵树的权重估计为。如果将值设置为0，则表示没有约束。如果将其设置为正值，则有助于使更新步骤更加保守。
这通常不使用。

7. subsample [default=1]

它表示观察值的部分是每棵树的随机样本。
较低的值使算法更加保守，并防止过度拟合但太小的值可能会导致欠拟合。
Typical values: 0.5-1

8. colsample_bytree [default=1]

表示每棵树的随机样本列的分数。较小的colsample_bytreepovides额外的规则化。
Typical values: 0.5-1

9. colsample_bylevel [default=1]

表示每个级别中每个拆分的列的子样本比率。

10. alpha [default=0]

L1正则化项。
L1正则化通过在每次迭代中从权重中减去少量，从而最终使权重为零，从而强制无信息特征的权重为零。为了简单起见，它也被称为正则化。适用于叶子权重（而不是特征权重）；较大的值意味着更多的正则化。

11. lambda [default=1]

L2正则化项。
这用于处理XGBoost的正则化部分。应该探索它以减少过度拟合。
L2正则化就像在每次迭代中去除一小部分权重的力。因此，权重永远不会等于零。L2正则化惩罚（权重）²有一个额外的参数来调整L2正则化项，称为正则化率（lambda）。
比alpha更光滑。此外，适用于叶子重量。

12. scale_pos_weight [default=1]

在高级不平衡的情况下，应使用大于0的值，因为它有助于更快地收敛。

学习任务参数

决定学习场景。例如，回归任务可能会在排名任务中使用不同的参数。

1. objective [default=reg:linear]

这个定义了要最小化的损失函数。
最常用的值是：
reg：线性-用于回归
reg： logistic-用于分类，当您只想要决策，而不是概率时。
二进制：逻辑-二分类的逻辑回归，返回预测概率（而不是决策类）
多分类：softmax-多类分类使用softmax目标，返回预测类（不是概率）；你还需要设置一个额外的num_class（类的数量）参数来定义唯一类的数量
多分类：softprob-与softmax相同，但返回属于每个类的每个数据点的预测概率。

2. eval_metric [ default according to objective ]

用于验证数据的指标。
默认值为rmse用于回归，错误用于分类。
Typical values are:
rmse – 均方根误差
mae – 平均绝对误差
logloss – 负对数似然
error – 二进制分类错误率（0.5阈值）
merror – 多类分类错误率
mlogloss – 多类对数损失
auc – 曲线下方的区域

3. seed [default=0]

为了重现结果

命令行参数

-与XGBoost的CLI版本的行为相关。

理解偏差-方差权衡

🧮 4. 理解偏差-方差权衡

如果你学习过机器学习或统计学课程，这很可能是最重要的概念之一，当我们允许模型变得更复杂（例如更深入）时，模型有更好的能力来拟合训练数据，导致模型偏差更小，然而，这样复杂的模型需要更多的数据来拟合。

XGBoost中的大多数参数都是关于偏差方差权衡的。最好的模型应该小心地将模型复杂性与其预测能力进行交易。参数文档会告诉你每个参数是否会使模型更加保守。这可以用来帮助你在复杂模型和简单模型之间转动旋钮。

XGBoost超参数调优方法

📑 6. 超参数调整的最佳方法

导入必要的模块和功能

import numpy as np 
import pandas as pd 
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
import json
from sklearn import manifoldfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import OrdinalEncoderimport xgboost as xgbfrom sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import StratifiedKFoldfrom sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import confusion_matrix
from xgboost import plot_treewarnings.filterwarnings('ignore')

warnings.filterwarnings('ignore')

def process_df(df):df=df.drop(['Name','PassengerId'],axis=1)df=df.dropna()target=df['Transported']df=df.drop(['Transported'],axis=1)target = target.astype(int)df['Cabin_1']= df['Cabin'].str[0]df['Cabin_2']= df['Cabin'].str[2]df['Cabin_3']= df['Cabin'].str[5]df=df.drop(['Cabin'],axis=1)# Create the training and test datasetsX_train, X_test, y_train, y_test = train_test_split(df, target, test_size = 0.2, random_state=100,stratify=target)numaric_columns=list(df.select_dtypes(include=np.number).columns)print("Numaric columns ("+str(len(numaric_columns))+") :",", ".join(numaric_columns))cat_columns=df.select_dtypes(include=['object']).columns.tolist()print("Categorical columns ("+str(len(cat_columns))+") :",", ".join(cat_columns))X_train_n=X_train[numaric_columns]X_test_n=X_test[numaric_columns]X_train_c=X_train[cat_columns]X_test_c=X_test[cat_columns]encoder=OrdinalEncoder()X_train_c = encoder.fit_transform(X_train_c)X_train_c=pd.DataFrame(X_train_c)X_test_c = encoder.transform(X_test_c)X_test_c=pd.DataFrame(X_test_c)i=1for column in X_train_c:X_train_n["cat_"+str(i)]=X_train_c[column]X_test_n["cat_"+str(i)]=X_test_c[column]i=i+1#X_train=pd.concat([X_train_n,X_train_c],axis=1,ignore_index=True)#X_test=pd.concat([X_test_n,X_test_c],axis=1,ignore_index=True)X_train_n=X_train_n.fillna(X_train_n.mean())X_test_n=X_test_n.fillna(X_test_n.mean())return X_train_n, X_test_n, y_train, y_test

准备数据

df= pd.read_csv("./spaceship-titanic/train.csv")
df.head()

X_train, X_test, y_train, y_test = process_df(df)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Numaric columns (6) : Age, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
Categorical columns (7) : HomePlanet, CryoSleep, Destination, VIP, Cabin_1, Cabin_2, Cabin_3
(5411, 13) (1353, 13) (5411,) (1353,)

准备功能以使以后的事情更容易

基本参数：

booster=“gbtree”，因为这是一个分类问题
目标=“二进制：逻辑”，根据问题
tree_method="gpu_hist"使用GPU

def xgb_helper(PARAMETERS,V_PARAM_NAME=False,V_PARAM_VALUES=False,BR=10):temp_dmatrix =xgb.DMatrix(data=X_train, label=y_train)if V_PARAM_VALUES==False:cv_results = xgb.cv(dtrain=temp_dmatrix, nfold=5,num_boost_round=BR,params=PARAMETERS, as_pandas=True, seed=123 )return cv_resultselse:results=[]for v_param_value in V_PARAM_VALUES:PARAMETERS[V_PARAM_NAME]=v_param_valuecv_results = xgb.cv(dtrain=temp_dmatrix, nfold=5,num_boost_round=BR,params=PARAMETERS, as_pandas=True, seed=123)results.append((cv_results["train-auc-mean"].tail().values[-1],cv_results["test-auc-mean"].tail().values[-1]))data = list(zip(V_PARAM_VALUES, results))print(pd.DataFrame(data,columns=[V_PARAM_NAME,"auc"]))return cv_results

创建通用基础模型并评估性能

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc"}
xgb_helper(PARAMETERS)

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.858413	0.002187	0.821862	0.005628
1	0.870282	0.001689	0.828690	0.004939
2	0.876322	0.000856	0.835879	0.002703
3	0.880106	0.001456	0.838758	0.004261
4	0.884137	0.001624	0.838674	0.003967
5	0.887057	0.001697	0.841544	0.003510
6	0.889695	0.002341	0.842214	0.005057
7	0.891214	0.002497	0.842708	0.005743
8	0.892886	0.002178	0.843644	0.005887
9	0.893928	0.002417	0.844092	0.005543

优化提升轮数（因为我们将使用xgb中的DMatrix）

# 创建DMatrix：housing_dmatrix
housing_dmatrix =xgb.DMatrix(data=X_train, label=y_train)# 为每棵树创建参数字典：参数 
params = {"objective":"binary:logistic", "max_depth":5}# 创建提升轮数列表
num_rounds = [5, 10, 15, 20, 25]# 每个XGBoost模型存储最后一轮rmse的空列表
final_rmse_per_round = []# 迭代num_rounds，每个num_boost_round参数构建一个模型
for curr_num_rounds in num_rounds:# 执行交叉验证：cv_resultscv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=5, num_boost_round=curr_num_rounds, metrics="auc", as_pandas=True, seed=123)# 追加最后一轮RMSEfinal_rmse_per_round.append(cv_results["test-auc-mean"].tail().values[-1])# 打印生成的DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","auc"]))

       num_boosting_rounds       auc0                    5  0.8400331                   10  0.8435942                   15  0.8457473                   20  0.8454004                   25  0.844094

Pointers:

Taking num_boosting_rounds = 10; 以避免过度拟合

1.选择学习率。也许你可以从更高的开始。0.5-0.1在大多数情况下是可以开始的。

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5}
xgb_helper(PARAMETERS)

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.858413	0.002187	0.821862	0.005628
1	0.872958	0.002187	0.831666	0.004109
2	0.879817	0.001506	0.836776	0.005321
3	0.884087	0.002211	0.838199	0.005545
4	0.888099	0.002316	0.840862	0.006667
5	0.890920	0.002014	0.841520	0.005769
6	0.893914	0.001330	0.842223	0.005199
7	0.896395	0.002359	0.842609	0.005717
8	0.897477	0.002852	0.842670	0.005631
9	0.900422	0.002209	0.842716	0.005787

Pointers:

一个好的起始分类器，train-auc均值为0.900422，test-auc均值为0.842716。让我们继续调整并减少过度拟合。

2. 使用 CV, tune max_depth and min_child_weight next.

2.1. 调整 max_depth.

Tips: 保持在3-10左右。

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5}
V_PARAM_NAME="max_depth"
V_PARAM_VALUES=range(3,10,1)data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       max_depth                                       auc0          3  (0.8643899441624237, 0.8434407018878254)1          4  (0.8762391828253827, 0.8432457875673736)2          5  (0.8898517716460612, 0.8449580853400329)3          6  (0.9004222967227931, 0.8427158084686912)4          7  (0.9107993432039982, 0.8379593667718235)5          8  (0.9233976130022082, 0.8309442810154544)6          9  (0.9331601626845524, 0.8328975608830239)

Pointers:

Taking max_depth 5, as per the score.

2.2. 调整 min_child_weigh.

Tips: 对于不平衡的数据集保持小，有利于平衡

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5}
V_PARAM_NAME="min_child_weight"
V_PARAM_VALUES=range(0,5,1)data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       min_child_weight                                       auc0                 0  (0.8923932136315347, 0.8400307515925263)1                 1  (0.8898517716460612, 0.8449580853400329)2                 2  (0.8878844356167308, 0.8430174554684318)3                 3   (0.8842988914848681, 0.842868768786811)4                 4  (0.8841976959835126, 0.8426672020116197)

Pointers:

Taking min_child_weight 1, as per the score.

3. gamma.

Tips: 保持它像0.1-0.2一样小。稍后会调整。

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5,"min_child_weight":1}
V_PARAM_NAME = "gamma"
V_PARAM_VALUES = [0.1,0.2,0.5,1,1.5,2]data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       gamma                                       auc0    0.1   (0.889994877151358, 0.8443115370456209)1    0.2  (0.8908020107961383, 0.8447684831269007)2    0.5   (0.888948000423167, 0.8450077335411604)3    1.0   (0.888184172336776, 0.8451616022595031)4    1.5  (0.8875863399894248, 0.8435783340359313)5    2.0  (0.8861811009373095, 0.8449147712912233)

Pointers:

Taking gamma 1, as per the score.

4. 调整 subsample 和 colsample_bytree.

4.1. 调整 subsample.

Tips: 保持范围小0.5-0.9。

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5,"min_child_weight":1,"gamma":1}
V_PARAM_NAME = "subsample"
V_PARAM_VALUES = [.4,.5,.6,.7,.8,.9]data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       subsample                                       auc0        0.4    (0.8749532101660904, 0.83281692487631)1        0.5  (0.8789542430190034, 0.8356286366243391)2        0.6  (0.8804439995005579, 0.8371190653296372)3        0.7  (0.8852418637174774, 0.8388110573107215)4        0.8  (0.8868771320489373, 0.8385871061415084)5        0.9  (0.8880784719598278, 0.8414592031099557)

Pointers:

Taking 0.7

4.2. Tune colsample_bytree.

Tips: Keep it small in range 0.5-0.9.

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5,"min_child_weight":1,"gamma":1,"subsample":0.7}
V_PARAM_NAME = "colsample_bytree"
V_PARAM_VALUES = [.4,.5,.6,.7,.8,.9]data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       colsample_bytree                                       auc0               0.4   (0.8737438808751801, 0.831927418798438)1               0.5  (0.8781175336774438, 0.8371252843166564)2               0.6  (0.8793282404547682, 0.8362558750372221)3               0.7  (0.8815593124156965, 0.8382064256388843)4               0.8  (0.8818068187259996, 0.8414608580897104)5               0.9  (0.8824219339976163, 0.8391185793280422)

Pointers:

Taking 0.8

4.3. 调整 scale_pos_weight.

Tips: 基于 class 不平衡.

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5,"min_child_weight":1,"gamma":1,"subsample":0.7,"colsample_bytree":.8}V_PARAM_NAME = "scale_pos_weight"
V_PARAM_VALUES = [.5,1,2]data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       scale_pos_weight                                       auc0               0.5  (0.8793958985004761, 0.8381736971719818)1               1.0  (0.8818068187259996, 0.8414608580897104)2               2.0  (0.8810854054804358, 0.8366761184232949)

Pointers:

Taking 1

5. 调整正则化参数 (alpha,lambda)

5.1. 更改 alpha.

Tips: Based on class imbalance.

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5,"min_child_weight":1,"gamma":1,"subsample":0.7,"colsample_bytree":.8, "scale_pos_weight":1}V_PARAM_NAME = "reg_alpha"
V_PARAM_VALUES = np.linspace(start=0.001, stop=1, num=20).tolist()data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

        reg_alpha                                       auc0    0.001000  (0.8818047689235573, 0.8414642805838721)1    0.053579  (0.8822512519204411, 0.8400495473072034)2    0.106158  (0.8816533897741998, 0.8393925267300872)3    0.158737  (0.8817818509762845, 0.8406296490672798)4    0.211316  (0.8801161063692018, 0.8397169942954633)5    0.263895  (0.8812037168173367, 0.8392127901963959)6    0.316474  (0.8801615034837067, 0.8427679621158068)7    0.369053  (0.8805653368213827, 0.8404140823263765)8    0.421632  (0.8802277851530302, 0.8402395041940075)9    0.474211  (0.8803759440342087, 0.8421274755689788)10   0.526789  (0.8803912729575127, 0.8423146821519907)11   0.579368  (0.8801605635521235, 0.8414003162278721)12   0.631947  (0.8819615487131747, 0.8424679296448012)13   0.684526  (0.8812525217771652, 0.8419542782384447)14   0.737105  (0.8800798281136639, 0.8411308776060921)15   0.789684   (0.8791173070125374, 0.843057560946075)16   0.842263  (0.8800692484855718, 0.8442545484482243)17   0.894842  (0.8790444602851833, 0.8393354812370287)18   0.947421  (0.8796063533992223, 0.8406622971346293)19   1.000000   (0.8784791031845056, 0.841360884610965)

Pointers:

Taking 0.15

5.2. 更改 lambda.

Tips: Based on class imbalance.

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","learning_rate": 0.5,"max_depth":5,"min_child_weight":1,"gamma":1,"subsample":0.7,"colsample_bytree":.8, "scale_pos_weight":1,"reg_alpha":0.15}V_PARAM_NAME = "reg_lambda"
V_PARAM_VALUES = np.linspace(start=0.001, stop=1, num=20).tolist()data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

        reg_lambda                                       auc0     0.001000  (0.8842321498802038, 0.8369315944875698)1     0.053579  (0.8846942202574949, 0.8367595558776119)2     0.106158  (0.8842139683915015, 0.8376755628888233)3     0.158737  (0.8841648054675174, 0.8374257380594093)4     0.211316  (0.8846029132722426, 0.8409101483580509)5     0.263895  (0.8849316913321706, 0.8349999831449061)6     0.316474  (0.8827466914234432, 0.8407336856742689)7     0.369053  (0.8835788031788671, 0.8399715195787192)8     0.421632    (0.883240239698053, 0.839815382235226)9     0.474211  (0.8833454760444799, 0.8381444821333328)10    0.526789  (0.8817612864701729, 0.8376539639984177)11    0.579368  (0.8820436530763995, 0.8367212403795014)12    0.631947  (0.8814519611406787, 0.8389461863625524)13    0.684526  (0.8808375865994101, 0.8382765202241252)14    0.737105  (0.8809840506591808, 0.8395892670674611)15    0.789684  (0.8814380192185137, 0.8398002164520746)16    0.842263   (0.8822586609494353, 0.840774913104777)17    0.894842  (0.8812572158281702, 0.8421072055782199)18    0.947421  (0.8815423976980001, 0.8404889892591069)19    1.000000   (0.881010486305841, 0.8407738594877234)

Pointers:

Taking 1

6. 最后，降低学习率并添加更多树木

6.1. 降低学习率（Learning rate）.

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","max_depth":5,"min_child_weight":1,"gamma":1,"subsample":0.7,"colsample_bytree":.8, "scale_pos_weight":1,"reg_alpha":0.15,"reg_lambda":1}V_PARAM_NAME = "learning_rate"
V_PARAM_VALUES = np.linspace(start=0.01, stop=0.3, num=10).tolist()data=xgb_helper(PARAMETERS,V_PARAM_NAME=V_PARAM_NAME,V_PARAM_VALUES=V_PARAM_VALUES);

       learning_rate                                       auc0       0.010000   (0.855713530270808, 0.8334207445339894)1       0.042222  (0.8587092932455096, 0.8348147563720609)2       0.074444   (0.862042487494276, 0.8357581120451331)3       0.106667  (0.8658326130454131, 0.8380194134948816)4       0.138889   (0.867456218936832, 0.8397551989588203)5       0.171111  (0.8710396787562106, 0.8411854139672925)6       0.203333   (0.873634746455022, 0.8417298506166315)7       0.235556  (0.8742498286647391, 0.8442379624397564)8       0.267778  (0.8758428845033487, 0.8421168040274616)9       0.300000  (0.8763744967034954, 0.8431341804372676)

Pointers:

Taking 0.3

完整模型

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc","max_depth":5,"min_child_weight":1,"gamma":1,"subsample":0.7,"colsample_bytree":.8, "scale_pos_weight":1,"reg_alpha":0.15,"reg_lambda":1,"learning_rate": 0.3}clf = xgb.XGBClassifier( tree_method="gpu_hist",objective="binary:logistic",eval_metric="auc",max_depth=5,min_child_weight=1,gamma=1,subsample=0.7,colsample_bytree=.8, scale_pos_weight=1,reg_alpha=0.15,reg_lambda=1,learning_rate= 0.3,n_estimators=800)clf.fit(X_train,y_train)clf.save_model("categorical-model.json")

pred = clf.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, pred, target_names=["0","1"]))

                  precision    recall  f1-score   support0       0.81      0.72      0.76       6731       0.75      0.83      0.79       680accuracy                           0.78      1353macro avg       0.78      0.78      0.78      1353weighted avg       0.78      0.78      0.78      1353

from sklearn.metrics import plot_roc_curve
plot_roc_curve(clf, X_test, y_test)

<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7822c299bf10>

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,pred)

0.7765033650904642

# Get a graph
graph = xgb.to_graphviz(clf, num_trees=1)
# Or get a matplotlib axis
ax = xgb.plot_tree(clf, num_trees=1)
# Get feature importances
plt.show()

哪个参数做什么？

📒 5. 哪个参数做什么？

控制过拟合

当你观察到训练精度高，但测试精度低时，很可能你遇到了过拟合问题。
通常有两种方法可以控制XGBoost中的过度拟合：

第一种方法是直接控制模型复杂度。
- 这包括max_depth、min_child_weight和γ。
第二种方法是添加随机性以使训练对噪音具有鲁棒性。
- 这包括子样本和colsample_bytree。
- 你也可以减少步长eta。记住这样做时增加num_round。

控制过拟合-代码示例：方法1

正常条件

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc"}
xgb_helper(PARAMETERS)

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.858413	0.002187	0.821862	0.005628
1	0.870282	0.001689	0.828690	0.004939
2	0.876322	0.000856	0.835879	0.002703
3	0.880106	0.001456	0.838758	0.004261
4	0.884137	0.001624	0.838674	0.003967
5	0.887057	0.001697	0.841544	0.003510
6	0.889695	0.002341	0.842214	0.005057
7	0.891214	0.002497	0.842708	0.005743
8	0.892886	0.002178	0.843644	0.005887
9	0.893928	0.002417	0.844092	0.005543

因此，训练和测试AUC分数之间的差异约为0.05，即5%-这相当高。

让我们正规化。

过拟合控制

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc", "max_depth":2 , "min_child_weight":3, "gamma":2}
xgb_helper(PARAMETERS)

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.756095	0.006712	0.745616	0.012709
1	0.794688	0.003016	0.787055	0.012553
2	0.807168	0.003095	0.800497	0.013173
3	0.813550	0.003732	0.807774	0.011470
4	0.816639	0.003505	0.810455	0.011466
5	0.817854	0.003483	0.812003	0.011679
6	0.821096	0.004441	0.814993	0.011331
7	0.825188	0.002974	0.818432	0.010964
8	0.828995	0.002206	0.821516	0.010546
9	0.830942	0.001293	0.824446	0.009938

因此，训练和测试AUC分数之间的差异小于0.01，即1%-这是相当好的。

控制过拟合-代码示例：方法2

PARAMETERS={"objective":'binary:logistic',"eval_metric":"auc", "subsample":0.3,"colsample_bytree":0.3,"eta":.05}
xgb_helper(PARAMETERS,25) #increasing num bossting round to 15

	train-auc-mean	train-auc-std	test-auc-mean	test-auc-std
0	0.795186	0.003369	0.777080	0.012016
1	0.802556	0.013406	0.784662	0.018255
2	0.805264	0.012572	0.784439	0.015186
3	0.811922	0.010274	0.788479	0.010984
4	0.814276	0.009995	0.788180	0.011646
5	0.817306	0.009768	0.787911	0.007396
6	0.817554	0.011856	0.788386	0.007400
7	0.819985	0.013085	0.791502	0.012313
8	0.821693	0.013683	0.794037	0.013249
9	0.827255	0.004718	0.799101	0.009761

所以，现在训练和测试AUC分数之间的差异小于0.035，即3.5%-这比正常情况下要好。

更快的训练表现

有一个名为tree_method的参数，将其设置为hist或gpu_hist以加快计算速度。

处理不平衡数据集

对于广告点击日志等常见情况，数据集极不平衡。这会影响XGBoost模型的训练，有两种方法可以改进它。

如果您只关心预测的整体性能指标（AUC）
- 通过scale_pos_weight平衡正负权重
- 使用AUC进行评估
如果你关心预测正确的概率
- 在这种情况下，您无法重新平衡数据集
- 设置参数max_delta_step为有限的数字（如1），以帮助收敛