参数配置 #

参数概述 #

XGBoost 的参数分为三大类：

text

┌─────────────────────────────────────────────────────────────┐
│                    XGBoost 参数分类                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. 通用参数（General Parameters）                           │
│     - booster                                                │
│     - nthread                                                │
│     - verbosity                                              │
│                                                              │
│  2. Booster 参数（Booster Parameters）                       │
│     - 树参数：max_depth, min_child_weight                    │
│     - 学习参数：eta, gamma                                   │
│     - 采样参数：subsample, colsample_*                       │
│     - 正则化参数：lambda, alpha                              │
│                                                              │
│  3. 学习任务参数（Learning Task Parameters）                 │
│     - objective                                              │
│     - eval_metric                                            │
│     - seed                                                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

通用参数 #

booster #

选择 booster 类型：

python

params = {
    'booster': 'gbtree',  # 树模型（默认）
    # 'booster': 'gblinear',  # 线性模型
    # 'booster': 'dart',  # DART booster
}

Booster	说明	适用场景
gbtree	基于树的模型	大多数场景
gblinear	线性模型	高维稀疏数据
dart	Dropout 加速	需要更好泛化

nthread / n_jobs #

设置并行线程数：

python

params = {
    'nthread': 4,  # 使用 4 个线程
    # 'nthread': -1,  # 使用所有可用线程
}

verbosity #

控制日志输出：

python

params = {
    'verbosity': 0,  # 静默
    # 'verbosity': 1,  # 警告
    # 'verbosity': 2,  # 信息
    # 'verbosity': 3,  # 调试
}

树参数（Booster 参数） #

max_depth #

树的最大深度：

python

params = {
    'max_depth': 6,  # 默认值
}

text

┌─────────────────────────────────────────────────────────────┐
│                    max_depth 选择指南                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  值较小 (3-5):                                               │
│  - 模型更简单，不易过拟合                                    │
│  - 适合数据量小、特征少的情况                                │
│  - 训练更快                                                  │
│                                                              │
│  值较大 (7-10):                                              │
│  - 模型更复杂，能学习更复杂的模式                            │
│  - 适合数据量大、特征多的情况                                │
│  - 可能过拟合                                                │
│                                                              │
│  建议：从 6 开始，根据验证集表现调整                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

min_child_weight #

叶子节点最小权重和：

python

params = {
    'min_child_weight': 1,  # 默认值
}

python

import xgboost as xgb
import numpy as np
from sklearn.model_selection import GridSearchCV

# 网格搜索找最佳 max_depth 和 min_child_weight
param_grid = {
    'max_depth': [3, 5, 7, 9],
    'min_child_weight': [1, 3, 5, 7]
}

clf = xgb.XGBClassifier(objective='binary:logistic', n_estimators=100)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

print(f"最佳参数: {grid_search.best_params_}")

gamma (min_split_loss) #

分裂最小增益：

python

params = {
    'gamma': 0,  # 默认值
    # 'gamma': 0.1,  # 保守值
    # 'gamma': 1.0,  # 更保守
}

max_leaves #

最大叶子节点数：

python

params = {
    'max_leaves': 0,  # 无限制（默认）
    'max_leaves': 32,  # 限制叶子数
}

学习参数 #

eta (learning_rate) #

学习率：

python

params = {
    'eta': 0.3,  # 默认值
    # 'eta': 0.1,  # 常用值
    # 'eta': 0.01,  # 需要更多迭代
}

text

┌─────────────────────────────────────────────────────────────┐
│               eta 与 n_estimators 的关系                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  eta = 0.3  →  n_estimators ≈ 100-300                       │
│  eta = 0.1  →  n_estimators ≈ 300-500                       │
│  eta = 0.05 →  n_estimators ≈ 500-1000                      │
│  eta = 0.01 →  n_estimators ≈ 1000-2000                     │
│                                                              │
│  经验法则：eta × n_estimators ≈ 常数                         │
│                                                              │
│  较小的 eta 通常有更好的泛化性能                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

grow_policy #

树的生长策略：

python

params = {
    'grow_policy': 'depthwise',  # 层级生长（默认）
    # 'grow_policy': 'lossguide',  # 叶子生长
}

max_bin #

特征分箱数：

python

params = {
    'max_bin': 256,  # 默认值
    # 'max_bin': 512,  # 更精确，但更慢
    # 'max_bin': 64,  # 更快，但可能损失精度
}

采样参数 #

subsample #

样本采样比例：

python

params = {
    'subsample': 1.0,  # 默认值（使用全部样本）
    'subsample': 0.8,  # 使用 80% 样本
}

colsample_bytree / colsample_bylevel / colsample_bynode #

特征采样：

python

params = {
    'colsample_bytree': 1.0,   # 每棵树的特征采样比例
    'colsample_bylevel': 1.0,  # 每层的特征采样比例
    'colsample_bynode': 1.0,   # 每个节点的特征采样比例
}

# 常用组合
params = {
    'colsample_bytree': 0.8,
    'colsample_bylevel': 0.8,
}

python

# 采样参数调优示例
from sklearn.model_selection import GridSearchCV

param_grid = {
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'colsample_bylevel': [0.6, 0.8, 1.0]
}

clf = xgb.XGBClassifier(
    objective='binary:logistic',
    max_depth=6,
    n_estimators=100,
    learning_rate=0.1
)

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

正则化参数 #

lambda (reg_lambda) #

L2 正则化权重：

python

params = {
    'lambda': 1.0,  # 默认值
    'lambda': 10.0,  # 更强的正则化
}

alpha (reg_alpha) #

L1 正则化权重：

python

params = {
    'alpha': 0,  # 默认值
    'alpha': 1.0,  # 启用 L1 正则化
}

python

# 正则化参数调优
param_grid = {
    'reg_alpha': [0, 0.001, 0.01, 0.1, 1, 10],
    'reg_lambda': [0.1, 1, 10, 100]
}

clf = xgb.XGBClassifier(
    objective='binary:logistic',
    max_depth=6,
    n_estimators=100
)

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

学习任务参数 #

objective #

目标函数：

python

# 二分类
params = {'objective': 'binary:logistic'}

# 多分类
params = {
    'objective': 'multi:softprob',  # 输出概率
    'num_class': 3
}

# 回归
params = {'objective': 'reg:squarederror'}  # MSE
params = {'objective': 'reg:absoluteerror'}  # MAE
params = {'objective': 'reg:logistic'}  # Logistic 回归

# 排序
params = {'objective': 'rank:pairwise'}
params = {'objective': 'rank:ndcg'}

objective	说明	适用场景
binary:logistic	二分类逻辑回归	二分类
binary:hinge	二分类 Hinge 损失	二分类
multi:softmax	多分类 Softmax	多分类
multi:softprob	多分类概率输出	多分类
reg:squarederror	均方误差	回归
reg:absoluteerror	平均绝对误差	回归
reg:logistic	Logistic 回归	回归
rank:pairwise	成对排序	排序
rank:ndcg	NDCG 排序	排序

eval_metric #

评估指标：

python

# 单个指标
params = {'eval_metric': 'logloss'}

# 多个指标
params = {'eval_metric': ['logloss', 'auc', 'error']}

eval_metric	说明	适用场景
rmse	均方根误差	回归
mae	平均绝对误差	回归
logloss	对数损失	分类
error	错误率	分类
auc	AUC	二分类
aucpr	PR-AUC	二分类
mlogloss	多分类对数损失	多分类
merror	多分类错误率	多分类
ndcg	NDCG	排序
map	MAP	排序

seed #

随机种子：

python

params = {
    'seed': 42,  # 固定随机种子
}

类别不平衡参数 #

scale_pos_weight #

正样本权重：

python

# 计算正负样本比例
scale_pos_weight = len(y_train[y_train==0]) / len(y_train[y_train==1])

params = {
    'objective': 'binary:logistic',
    'scale_pos_weight': scale_pos_weight
}

max_delta_step #

最大 delta 步长：

python

params = {
    'max_delta_step': 0,  # 默认（无限制）
    'max_delta_step': 1,  # 帮助处理类别不平衡
}

GPU 参数 #

tree_method #

树构建方法：

python

# CPU
params = {'tree_method': 'auto'}  # 自动选择
params = {'tree_method': 'exact'}  # 精确算法
params = {'tree_method': 'approx'}  # 近似算法
params = {'tree_method': 'hist'}  # 直方图算法

# GPU
params = {'tree_method': 'hist', 'device': 'cuda'}

device #

设备选择：

python

params = {
    'tree_method': 'hist',
    'device': 'cuda',  # 使用 GPU
    # 'device': 'cuda:0',  # 指定 GPU 0
    # 'device': 'cuda:1',  # 指定 GPU 1
}

参数配置示例 #

二分类完整配置 #

python

import xgboost as xgb

params = {
    # 通用参数
    'booster': 'gbtree',
    'nthread': 4,
    'verbosity': 1,
    
    # 树参数
    'max_depth': 6,
    'min_child_weight': 1,
    'gamma': 0,
    
    # 学习参数
    'eta': 0.1,
    'grow_policy': 'depthwise',
    
    # 采样参数
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    
    # 正则化参数
    'lambda': 1,
    'alpha': 0,
    
    # 任务参数
    'objective': 'binary:logistic',
    'eval_metric': ['logloss', 'auc'],
    'seed': 42
}

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

model = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=[(dtrain, 'train'), (dtest, 'eval')],
    early_stopping_rounds=50,
    verbose_eval=10
)

回归完整配置 #

python

params = {
    'booster': 'gbtree',
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    
    'max_depth': 6,
    'min_child_weight': 1,
    'eta': 0.05,
    
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    
    'lambda': 1,
    'alpha': 0,
    
    'seed': 42
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dtest, 'eval')],
    early_stopping_rounds=50
)

多分类完整配置 #

python

params = {
    'objective': 'multi:softprob',
    'num_class': 3,
    'eval_metric': 'mlogloss',
    
    'max_depth': 6,
    'eta': 0.1,
    
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    
    'seed': 42
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=500,
    evals=[(dtrain, 'train'), (dtest, 'eval')],
    early_stopping_rounds=50
)

参数调优策略 #

text

┌─────────────────────────────────────────────────────────────┐
│                    参数调优顺序                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Step 1: 固定学习率和较大 n_estimators                       │
│          eta=0.1, n_estimators=1000                          │
│                                                              │
│  Step 2: 调整树参数                                          │
│          max_depth, min_child_weight                         │
│                                                              │
│  Step 3: 调整 gamma                                          │
│          gamma: [0, 0.1, 0.2, 0.5, 1]                        │
│                                                              │
│  Step 4: 调整采样参数                                        │
│          subsample, colsample_bytree                         │
│                                                              │
│  Step 5: 调整正则化参数                                      │
│          lambda, alpha                                       │
│                                                              │
│  Step 6: 降低学习率，增加迭代次数                            │
│          eta=0.01, n_estimators=5000                         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

下一步 #

现在你已经了解了参数配置，接下来学习训练与评估掌握模型训练和评估技巧！