训练配置 #

训练参数概览 #

text

┌─────────────────────────────────────────────────────────────┐
│                   核心训练参数                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  优化参数：                                                  │
│  ├── 学习率（Learning Rate）                                │
│  ├── 批次大小（Batch Size）                                 │
│  ├── 训练轮数（Epochs）                                     │
│  └── 优化器（Optimizer）                                    │
│                                                             │
│  正则化参数：                                                │
│  ├── 权重衰减（Weight Decay）                               │
│  ├── Dropout                                                │
│  └── 梯度裁剪（Gradient Clipping）                          │
│                                                             │
│  调度参数：                                                  │
│  ├── 学习率调度器（LR Scheduler）                           │
│  ├── 预热步数（Warmup Steps）                               │
│  └── 学习率衰减                                             │
│                                                             │
│  效率参数：                                                  │
│  ├── 梯度累积（Gradient Accumulation）                      │
│  ├── 混合精度（Mixed Precision）                            │
│  └── 梯度检查点（Gradient Checkpointing）                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

学习率 #

学习率选择 #

text

学习率影响：
├── 太大：训练不稳定、Loss 震荡、发散
├── 太小：收敛慢、易陷入局部最优
└── 合适：快速稳定收敛

推荐范围：
├── 全量微调：1e-5 到 5e-5
├── LoRA：1e-4 到 5e-4
├── QLoRA：2e-4 到 1e-3
└── Adapter：1e-4 到 5e-4

学习率设置 #

python

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
)

学习率调度器 #

text

┌─────────────────────────────────────────────────────────────┐
│                  学习率调度策略                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Linear（线性衰减）                                         │
│  ├── 从初始值线性衰减到 0                                   │
│  ├── 简单有效                                               │
│  └── 适合大多数场景                                         │
│                                                             │
│  Cosine（余弦衰减）                                         │
│  ├── 按余弦曲线衰减                                         │
│  ├── 训练后期更平缓                                         │
│  └── 推荐用于大模型                                         │
│                                                             │
│  Constant（恒定）                                           │
│  ├── 学习率保持不变                                         │
│  ├── 简单直接                                               │
│  └── 适合短训练                                             │
│                                                             │
│  Constant with Warmup                                       │
│  ├── 预热后保持恒定                                         │
│  ├── 适合 PEFT                                              │
│  └── 推荐用于 LoRA                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

python

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    num_train_epochs=3,
)

批次大小 #

批次大小选择 #

text

批次大小影响：
├── 太大：显存不足、可能降低泛化能力
├── 太小：训练不稳定、速度慢
└── 合适：稳定训练、充分利用显存

推荐设置：
├── 7B 模型：batch_size=4-16
├── 13B 模型：batch_size=2-8
├── 70B 模型：batch_size=1-4
└── 根据显存调整

梯度累积 #

python

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
)

等效批次大小 = per_device_train_batch_size × gradient_accumulation_steps
示例：4 × 4 = 16

text

梯度累积优势：
├── 模拟大批次训练
├── 解决显存限制
├── 提高训练稳定性
└── 不影响模型效果

使用场景：
├── 显存不足
├── 需要大批次
├── 多卡训练
└── 分布式训练

训练轮数 #

轮数选择 #

text

训练轮数影响：
├── 太少：欠拟合、模型未充分学习
├── 太多：过拟合、泛化能力下降
└── 合适：达到最佳性能

推荐设置：
├── 数据量大（>100K）：1-2 轮
├── 数据量中（10K-100K）：2-3 轮
├── 数据量小（<10K）：3-5 轮
└── 使用早停策略

早停策略 #

python

from transformers import EarlyStoppingCallback

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    eval_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

优化器 #

优化器选择 #

text

┌─────────────────────────────────────────────────────────────┐
│                    优化器对比                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  AdamW                                                      │
│  ├── 标准优化器                                             │
│  ├── 权重衰减解耦                                           │
│  ├── 显存占用大（2倍参数量）                                │
│  └── 推荐用于全量微调                                       │
│                                                             │
│  AdamW 8-bit                                                │
│  ├── 量化版 AdamW                                           │
│  ├── 显存占用小                                             │
│  ├── 需要 bitsandbytes                                      │
│  └── 推荐用于 LoRA/QLoRA                                    │
│                                                             │
│  Paged AdamW 8-bit                                          │
│  ├── 分页优化器                                             │
│  ├── 处理显存峰值                                           │
│  ├── 更稳定                                                 │
│  └── 推荐用于大模型                                         │
│                                                             │
│  LION                                                       │
│  ├── 新优化器                                               │
│  ├── 显存效率高                                             │
│  ├── 收敛快                                                 │
│  └── 适合大模型                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

优化器配置 #

python

training_args = TrainingArguments(
    output_dir="./results",
    optim="adamw_torch",
    learning_rate=2e-4,
    weight_decay=0.01,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
)

training_args_8bit = TrainingArguments(
    output_dir="./results",
    optim="adamw_8bit",
    learning_rate=2e-4,
    weight_decay=0.01,
)

正则化 #

权重衰减 #

python

training_args = TrainingArguments(
    output_dir="./results",
    weight_decay=0.01,
)

text

权重衰减作用：
├── 防止过拟合
├── 提高泛化能力
├── 模型参数稀疏化
└── 推荐范围：0.01-0.1

Dropout #

python

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

text

Dropout 作用：
├── 防止过拟合
├── 提高泛化能力
├── 训练时随机丢弃神经元
└── 推荐范围：0.05-0.2

梯度裁剪 #

python

training_args = TrainingArguments(
    output_dir="./results",
    max_grad_norm=1.0,
)

text

梯度裁剪作用：
├── 防止梯度爆炸
├── 稳定训练
├── 限制梯度范数
└── 推荐值：0.5-1.0

效率优化 #

混合精度训练 #

python

training_args = TrainingArguments(
    output_dir="./results",
    fp16=True,
)

或

training_args = TrainingArguments(
    output_dir="./results",
    bf16=True,
)

text

混合精度优势：
├── 减少显存占用（约 50%）
├── 加速训练（约 2x）
├── 保持精度
└── 支持硬件：
    ├── FP16：所有 GPU
    └── BF16：Ampere+ GPU

梯度检查点 #

python

training_args = TrainingArguments(
    output_dir="./results",
    gradient_checkpointing=True,
)

text

梯度检查点优势：
├── 大幅减少显存（约 30-50%）
├── 时间换空间
├── 适合大模型
└── 训练速度略慢

Flash Attention #

python

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    use_flash_attention_2=True,
)

text

Flash Attention 优势：
├── 减少显存占用
├── 加速注意力计算
├── 支持更长序列
└── 需要 Ampere+ GPU

完整配置示例 #

LoRA 微调配置 #

python

from transformers import TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    fp16=True,
    gradient_checkpointing=True,
    optim="adamw_8bit",
    max_grad_norm=1.0,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

QLoRA 微调配置 #

python

from transformers import BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="constant",
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
)

训练监控 #

TensorBoard #

python

training_args = TrainingArguments(
    output_dir="./results",
    logging_dir="./logs",
    logging_steps=10,
    report_to="tensorboard",
)

启动 TensorBoard：
tensorboard --logdir ./logs

Weights & Biases #

python

training_args = TrainingArguments(
    output_dir="./results",
    report_to="wandb",
    run_name="my-finetune-run",
)

初始化 W&B：
import wandb
wandb.init(project="my-project", name="my-run")

MLflow #

python

training_args = TrainingArguments(
    output_dir="./results",
    report_to="mlflow",
)

import mlflow
mlflow.start_run()

超参数调优 #

网格搜索 #

python

from itertools import product

learning_rates = [1e-4, 2e-4, 5e-4]
batch_sizes = [4, 8, 16]
epochs = [2, 3, 5]

best_loss = float('inf')
best_params = None

for lr, bs, ep in product(learning_rates, batch_sizes, epochs):
    training_args = TrainingArguments(
        output_dir=f"./results/lr{lr}_bs{bs}_ep{ep}",
        learning_rate=lr,
        per_device_train_batch_size=bs,
        num_train_epochs=ep,
    )
    
    trainer = Trainer(model=model, args=training_args, ...)
    result = trainer.train()
    
    if result.training_loss < best_loss:
        best_loss = result.training_loss
        best_params = {'lr': lr, 'batch_size': bs, 'epochs': ep}

Optuna #

python

import optuna

def objective(trial):
    lr = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
    batch_size = trial.suggest_categorical('batch_size', [4, 8, 16])
    epochs = trial.suggest_int('epochs', 1, 5)
    
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=lr,
        per_device_train_batch_size=batch_size,
        num_train_epochs=epochs,
    )
    
    trainer = Trainer(model=model, args=training_args, ...)
    result = trainer.train()
    
    return result.training_loss

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

常见问题 #

训练不稳定 #

text

问题：Loss 震荡、不收敛

解决方案：
1. 降低学习率
   learning_rate=1e-5

2. 增加预热步数
   warmup_ratio=0.2

3. 使用梯度裁剪
   max_grad_norm=0.5

4. 增大批次大小
   per_device_train_batch_size=8

5. 使用更稳定的优化器
   optim="adamw_torch"

显存不足 #

text

问题：CUDA out of memory

解决方案：
1. 减小批次大小
   per_device_train_batch_size=1

2. 增加梯度累积
   gradient_accumulation_steps=16

3. 启用梯度检查点
   gradient_checkpointing=True

4. 使用混合精度
   fp16=True 或 bf16=True

5. 使用 QLoRA
   load_in_4bit=True

6. 使用 8-bit 优化器
   optim="adamw_8bit"

过拟合 #

text

问题：训练集表现好，测试集表现差

解决方案：
1. 增加数据量
2. 数据增强
3. 减少训练轮数
4. 增加权重衰减
   weight_decay=0.1
5. 增加 Dropout
   lora_dropout=0.2
6. 早停策略
   early_stopping_patience=3

下一步 #

现在你已经掌握了训练配置的核心知识，接下来学习 LoRA 技术，深入了解参数高效微调！