性能优化 #

优化目标 #

text

┌─────────────────────────────────────────────────────────────┐
│                    性能优化维度                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  显存优化：                                                  │
│  ├── 降低显存占用                                           │
│  ├── 训练更大模型                                           │
│  └── 使用更大批次                                           │
│                                                             │
│  速度优化：                                                  │
│  ├── 加速训练                                               │
│  ├── 缩短实验周期                                           │
│  └── 提高迭代效率                                           │
│                                                             │
│  推理优化：                                                  │
│  ├── 加速推理                                               │
│  ├── 降低延迟                                               │
│  └── 提高吞吐量                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

显存优化 #

显存分析 #

text

显存占用组成：
├── 模型参数
│   └── FP32: 参数量 × 4 字节
│   └── FP16: 参数量 × 2 字节
│
├── 梯度
│   └── 与参数量相同
│
├── 优化器状态
│   └── Adam: 参数量 × 8 字节（FP32）
│   └── Adam 8-bit: 参数量 × 2 字节
│
├── 激活值
│   └── 取决于批次大小和序列长度
│
└── 临时缓存
    └── 中间计算结果

量化训练 #

python

from transformers import BitsAndBytesConfig

4-bit 量化：
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

8-bit 量化：
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

text

量化效果对比：

FP32（无量化）：
├── 7B 模型：28GB
├── 13B 模型：52GB
└── 70B 模型：280GB

FP16（半精度）：
├── 7B 模型：14GB
├── 13B 模型：26GB
└── 70B 模型：140GB

8-bit 量化：
├── 7B 模型：7GB
├── 13B 模型：13GB
└── 70B 模型：70GB

4-bit 量化：
├── 7B 模型：3.5GB
├── 13B 模型：6.5GB
└── 70B 模型：35GB

梯度检查点 #

python

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    gradient_checkpointing=True,
)

或直接在模型上启用：
model.gradient_checkpointing_enable()

text

梯度检查点原理：
├── 不保存所有激活值
├── 需要时重新计算
├── 时间换空间
└── 显存减少 30-50%

适用场景：
├── 显存不足
├── 大模型训练
├── 长序列训练
└── 单卡训练

梯度累积 #

python

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
)

text

梯度累积效果：
├── 减少显存峰值
├── 模拟大批次训练
├── 不影响模型效果
└── 训练时间略增

推荐设置：
├── 小显存：batch_size=1, accumulation=16-32
├── 中显存：batch_size=4, accumulation=4-8
└── 大显存：batch_size=8, accumulation=2-4

优化器选择 #

python

标准 AdamW：
training_args = TrainingArguments(
    output_dir="./results",
    optim="adamw_torch",
)

8-bit AdamW：
training_args = TrainingArguments(
    output_dir="./results",
    optim="adamw_8bit",
)

Paged AdamW 8-bit：
training_args = TrainingArguments(
    output_dir="./results",
    optim="paged_adamw_8bit",
)

text

优化器显存对比（7B 模型）：

AdamW（FP32）：
├── 参数：14GB
├── 梯度：14GB
├── 优化器状态：112GB
└── 总计：140GB

AdamW 8-bit：
├── 参数：14GB
├── 梯度：14GB
├── 优化器状态：28GB
└── 总计：56GB

节省：84GB（60%）

混合精度训练 #

python

FP16 混合精度：
training_args = TrainingArguments(
    output_dir="./results",
    fp16=True,
)

BF16 混合精度：
training_args = TrainingArguments(
    output_dir="./results",
    bf16=True,
)

text

混合精度优势：
├── 显存减少约 50%
├── 训练速度提升约 2x
├── 保持精度
└── 硬件要求：
    ├── FP16：所有 GPU
    └── BF16：Ampere+ GPU

训练加速 #

Flash Attention #

python

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    use_flash_attention_2=True,
)

text

Flash Attention 优势：
├── 注意力计算加速 2-4x
├── 显存占用减少
├── 支持更长序列
└── 需要 Ampere+ GPU

DeepSpeed #

python

ds_config = {
    "train_batch_size": 16,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 2e-4,
        }
    },
    "fp16": {
        "enabled": True,
    },
    "zero_optimization": {
        "stage": 2,
    }
}

training_args = TrainingArguments(
    output_dir="./results",
    deepspeed=ds_config,
)

text

DeepSpeed ZeRO 阶段：

ZeRO-1：
├── 分片优化器状态
├── 显存减少约 4x
└── 通信开销小

ZeRO-2：
├── 分片优化器状态和梯度
├── 显存减少约 8x
└── 通信开销中等

ZeRO-3：
├── 分片所有状态
├── 显存减少约 N 倍（N=GPU 数量）
└── 通信开销大

FSDP #

python

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    fsdp="full_shard auto_wrap",
    fsdp_config={
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_sharding_strategy": "FULL_SHARD",
    },
)

text

FSDP 优势：
├── PyTorch 原生支持
├── 完全分片数据并行
├── 显存效率高
└── 适合超大模型

多卡训练 #

python

单机多卡：
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=3,
)

多机多卡：
torchrun --nproc_per_node=8 \
         --nnodes=4 \
         --node_rank=0 \
         --master_addr="10.0.0.1" \
         --master_port=12345 \
         train.py

text

多卡训练策略：

数据并行（DP）：
├── 复制模型到每张卡
├── 分割数据
├── 简单直接
└── 显存效率低

分布式数据并行（DDP）：
├── 每张卡独立模型
├── 同步梯度
├── 效率高
└── 推荐使用

完全分片数据并行（FSDP）：
├── 分片模型参数
├── 显存效率最高
└── 适合超大模型

推理优化 #

模型量化 #

python

动态量化：
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

静态量化：
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained(model_name)
qconfig = AutoQuantizationConfig.avx2(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized-model", quantization_config=qconfig)

模型剪枝 #

python

import torch.nn.utils.prune as prune

def prune_model(model, amount=0.3):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=amount)
    
    return model

model = prune_model(model, amount=0.3)

知识蒸馏 #

python

from transformers import AutoModelForCausalLM, AutoTokenizer

teacher_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-72B")
student_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-7B")

def distillation_loss(teacher_logits, student_logits, temperature=2.0):
    from torch.nn.functional import kl_div, log_softmax, softmax
    
    teacher_probs = softmax(teacher_logits / temperature, dim=-1)
    student_log_probs = log_softmax(student_logits / temperature, dim=-1)
    
    return kl_div(student_log_probs, teacher_probs, reduction='batchmean') * (temperature ** 2)

推理引擎 #

text

┌─────────────────────────────────────────────────────────────┐
│                   推理引擎对比                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  vLLM                                                        │
│  ├── 最快推理速度                                           │
│  ├── PagedAttention                                         │
│  ├── 连续批处理                                             │
│  └── 推荐用于生产                                           │
│                                                             │
│  TGI (Text Generation Inference)                            │
│  ├── Hugging Face 出品                                      │
│  ├── 功能完善                                               │
│  ├── 易于部署                                               │
│  └── 生产级方案                                             │
│                                                             │
│  TensorRT-LLM                                               │
│  ├── NVIDIA 出品                                            │
│  ├── 极致优化                                               │
│  ├── 需要 TensorRT                                          │
│  └── 最高性能                                               │
│                                                             │
│  llama.cpp                                                  │
│  ├── C++ 实现                                               │
│  ├── CPU 推理                                               │
│  ├── 跨平台                                                 │
│  └── 边缘设备                                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

python

vLLM 示例：
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2-7B")
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)

outputs = llm.generate(["你好，请介绍一下自己。"], sampling_params)

性能监控 #

训练监控 #

python

import torch

def monitor_training():
    print(f"GPU 显存已用: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU 显存缓存: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    print(f"GPU 显存最大: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

在训练循环中调用：
for batch in train_dataloader:
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    
    if step % 100 == 0:
        monitor_training()

性能分析 #

python

import torch.profiler as profiler

with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA,
    ],
    schedule=profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=profiler.tensorboard_trace_handler('./logs'),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        prof.step()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

优化策略 #

按场景选择 #

text

单卡小显存（<16GB）：
├── QLoRA（4-bit 量化）
├── 梯度检查点
├── 小批次 + 梯度累积
├── 8-bit 优化器
└── 混合精度

单卡大显存（>32GB）：
├── LoRA（大 rank）
├── Flash Attention
├── 中等批次
└── 混合精度

多卡训练：
├── DDP（2-4 卡）
├── FSDP（4+ 卡）
├── DeepSpeed ZeRO
└── 梯度检查点（可选）

超大模型（>70B）：
├── DeepSpeed ZeRO-3
├── FSDP
├── 模型并行
└── CPU offloading

优化优先级 #

text

1. 显存优化（必须）：
   ├── 量化训练
   ├── 梯度检查点
   └── 梯度累积

2. 速度优化（推荐）：
   ├── 混合精度
   ├── Flash Attention
   └── 优化器选择

3. 分布式优化（可选）：
   ├── DDP
   ├── FSDP
   └── DeepSpeed

4. 推理优化（部署时）：
   ├── 模型量化
   ├── 推理引擎
   └── 批处理

常见问题 #

显存仍然不足 #

text

解决方案：
1. 进一步量化
   - 使用 4-bit 量化
   - 双重量化

2. 减小模型
   - 使用更小的基座模型
   - 减小 LoRA rank

3. 减小批次
   - batch_size=1
   - 增加 gradient_accumulation

4. 使用 CPU offloading
   - DeepSpeed ZeRO-Offload
   - FSDP CPU offload

训练速度慢 #

text

解决方案：
1. 启用混合精度
   fp16=True 或 bf16=True

2. 使用 Flash Attention
   use_flash_attention_2=True

3. 增大批次
   在显存允许的情况下

4. 使用更快的数据加载
   - 预取数据
   - 多线程加载

5. 减少日志频率
   logging_steps=100

下一步 #

现在你已经掌握了性能优化技巧，接下来学习文本分类实战，开始实际项目！