变体方法 #

变体概览 #

text

┌─────────────────────────────────────────────────────────────┐
│                    LoRA 变体家族                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  量化变体：                                                  │
│  ├── QLoRA: 4-bit 量化 + LoRA                               │
│  └── LoftQ: 量化感知 LoRA                                   │
│                                                             │
│  自适应变体：                                                │
│  ├── AdaLoRA: 自适应秩调整                                  │
│  └── SoRA: 稀疏 LoRA                                        │
│                                                             │
│  优化变体：                                                  │
│  ├── LoRA+: 非对称学习率                                    │
│  ├── DoRA: 权重分解                                         │
│  └── rsLoRA: 稳定秩缩放                                     │
│                                                             │
│  结构变体：                                                  │
│  ├── LoRA-FA: 冻结 A 矩阵                                   │
│  ├── VeRA: 向量随机化                                       │
│  └── Delta-LoRA: 增量更新                                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

QLoRA #

简介 #

QLoRA（Quantized Low-Rank Adaptation）结合了量化和 LoRA，在保持效果的同时大幅降低显存需求。

text

┌─────────────────────────────────────────────────────────────┐
│                    QLoRA 核心技术                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. 4-bit NormalFloat (NF4) 量化                            │
│     ├── 针对正态分布权重优化                                 │
│     └── 比标准 4-bit 更精确                                 │
│                                                             │
│  2. 双重量化 (Double Quantization)                          │
│     ├── 量化量化常数                                        │
│     └── 进一步节省显存                                      │
│                                                             │
│  3. 分页优化器 (Paged Optimizers)                           │
│     ├── GPU 显存不足时使用 CPU 内存                          │
│     └── 避免显存溢出                                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

QLoRA 实现 #

python

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model_name = "meta-llama/Llama-2-7b-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

QLoRA vs LoRA #

特性	LoRA	QLoRA
显存需求	~15GB (7B)	~6GB (7B)
模型精度	FP16/BF16	4-bit NF4
训练速度	快	稍慢
效果	基准	接近基准
适用场景	显存充足	显存受限

AdaLoRA #

简介 #

AdaLoRA（Adaptive Budget Allocation for Low-Rank Adaptation）根据重要性动态调整每个层的秩。

text

┌─────────────────────────────────────────────────────────────┐
│                    AdaLoRA 核心思想                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  动态秩调整：                                                │
│  ├── 重要层：分配更高秩                                     │
│  ├── 次要层：分配较低秩                                     │
│  └── 自动学习最优分配                                       │
│                                                             │
│  奇异值分解：                                                │
│  ├── ΔW = PΛQ                                               │
│  ├── Λ: 可学习的奇异值                                      │
│  └── 根据奇异值重要性剪枝                                   │
│                                                             │
│  优势：                                                     │
│  ├── 自动优化参数分配                                       │
│  ├── 更高效的参数利用                                       │
│  └── 适应不同任务需求                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

AdaLoRA 实现 #

python

from peft import AdaLoraConfig, get_peft_model

adalora_config = AdaLoraConfig(
    init_r=12,
    target_r=8,
    beta1=0.85,
    beta2=0.85,
    tinit=200,
    tfinal=1000,
    deltaT=10,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, adalora_config)
model.print_trainable_parameters()

AdaLoRA 参数说明 #

python

adalora_params = {
    "init_r": "初始秩",
    "target_r": "目标秩",
    "beta1": "奇异值更新系数",
    "beta2": "奇异值更新系数",
    "tinit": "开始调整的步数",
    "tfinal": "结束调整的步数",
    "deltaT": "调整间隔",
}

LoRA+ #

简介 #

LoRA+ 通过为 A 和 B 矩阵设置不同的学习率来优化训练。

text

┌─────────────────────────────────────────────────────────────┐
│                    LoRA+ 核心改进                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  非对称学习率：                                              │
│  ├── η_A = η / √r                                          │
│  ├── η_B = η × √r                                          │
│  └── η_B / η_A = r                                         │
│                                                             │
│  理论依据：                                                  │
│  ├── A 和 B 的梯度尺度不同                                  │
│  ├── 非对称学习率平衡更新                                   │
│  └── 加速收敛                                               │
│                                                             │
│  效果：                                                     │
│  ├── 训练速度提升 2x                                        │
│  ├── 最终效果更好                                           │
│  └── 无额外计算开销                                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

LoRA+ 实现 #

python

import torch
from torch.optim import AdamW

def create_lora_plus_optimizer(model, base_lr=2e-4, r=8):
    lora_a_params = []
    lora_b_params = []
    other_params = []
    
    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue
        
        if "lora_A" in name:
            lora_a_params.append(param)
        elif "lora_B" in name:
            lora_b_params.append(param)
        else:
            other_params.append(param)
    
    import math
    lr_a = base_lr / math.sqrt(r)
    lr_b = base_lr * math.sqrt(r)
    
    optimizer = AdamW([
        {"params": lora_a_params, "lr": lr_a},
        {"params": lora_b_params, "lr": lr_b},
        {"params": other_params, "lr": base_lr},
    ])
    
    return optimizer

DoRA #

简介 #

DoRA（Weight-Decomposed Low-Rank Adaptation）将权重分解为幅度和方向两部分。

text

┌─────────────────────────────────────────────────────────────┐
│                    DoRA 核心思想                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  权重分解：                                                  │
│  W = m × V                                                  │
│                                                             │
│  其中：                                                     │
│  ├── m: 幅度向量 (magnitude)                               │
│  ├── V: 方向矩阵 (direction)                               │
│  └── ||V||_col = 1 (列归一化)                              │
│                                                             │
│  LoRA 更新：                                                 │
│  V' = V₀ + ΔV = V₀ + BA                                    │
│  m' = m₀ + Δm                                              │
│                                                             │
│  优势：                                                     │
│  ├── 分别优化幅度和方向                                     │
│  ├── 更稳定的学习过程                                       │
│  └── 更好的效果                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

DoRA 实现 #

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class DoRALinear(nn.Module):
    def __init__(self, in_features, out_features, r=8, alpha=16):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / r
        
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        self.weight.requires_grad = False
        
        self.lora_A = nn.Parameter(torch.empty(r, in_features))
        self.lora_B = nn.Parameter(torch.empty(out_features, r))
        
        self.magnitude = nn.Parameter(torch.ones(out_features))
        
        nn.init.kaiming_uniform_(self.lora_A, a=5**0.5)
        nn.init.zeros_(self.lora_B)
    
    def forward(self, x):
        result = F.linear(x, self.weight)
        
        lora_out = x @ self.lora_A.T @ self.lora_B.T
        
        V = self.weight + lora_out * self.scaling
        V_norm = F.normalize(V, p=2, dim=1)
        
        W_dora = self.magnitude.unsqueeze(1) * V_norm
        
        return F.linear(x, W_dora)

其他变体 #

LoRA-FA #

python

class LoRAFA(nn.Module):
    def __init__(self, in_features, out_features, r=8, alpha=16):
        super().__init__()
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / r
        
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        self.weight.requires_grad = False
        
        self.lora_A = nn.Parameter(torch.empty(r, in_features))
        self.lora_B = nn.Parameter(torch.empty(out_features, r))
        
        nn.init.kaiming_uniform_(self.lora_A, a=5**0.5)
        self.lora_A.requires_grad = False
        
        nn.init.zeros_(self.lora_B)
    
    def forward(self, x):
        result = F.linear(x, self.weight)
        lora_out = x @ self.lora_A.T @ self.lora_B.T
        return result + lora_out * self.scaling

VeRA #

python

class VeRALinear(nn.Module):
    def __init__(self, in_features, out_features, r=8, alpha=16):
        super().__init__()
        self.r = r
        self.alpha = alpha
        self.scaling = alpha / r
        
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        self.weight.requires_grad = False
        
        self.shared_A = nn.Parameter(torch.empty(r, in_features))
        self.shared_B = nn.Parameter(torch.empty(out_features, r))
        
        self.vector_a = nn.Parameter(torch.ones(r))
        self.vector_b = nn.Parameter(torch.ones(r))
        
        nn.init.normal_(self.shared_A, std=0.1)
        nn.init.normal_(self.shared_B, std=0.1)
    
    def forward(self, x):
        result = F.linear(x, self.weight)
        
        A = self.shared_A * self.vector_a.unsqueeze(1)
        B = self.shared_B * self.vector_b.unsqueeze(0)
        
        lora_out = x @ A.T @ B.T
        return result + lora_out * self.scaling

变体对比 #

变体	核心改进	显存	效果	适用场景
LoRA	基准	中	基准	通用
QLoRA	量化	低	接近	显存受限
AdaLoRA	自适应秩	中	更好	复杂任务
LoRA+	非对称学习率	中	更好	加速训练
DoRA	权重分解	中	更好	高质量需求
LoRA-FA	冻结 A	低	相当	参数更少
VeRA	共享权重	最低	相当	多任务

选择指南 #

text

┌─────────────────────────────────────────────────────────────┐
│                    变体选择指南                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  显存受限 (< 16GB):                                         │
│  └── 推荐: QLoRA                                            │
│                                                             │
│  追求最佳效果:                                               │
│  └── 推荐: DoRA 或 AdaLoRA                                  │
│                                                             │
│  加速训练:                                                   │
│  └── 推荐: LoRA+                                            │
│                                                             │
│  多任务场景:                                                 │
│  └── 推荐: VeRA                                             │
│                                                             │
│  通用场景:                                                   │
│  └── 推荐: 标准 LoRA                                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

下一步 #

现在你已经了解了 LoRA 的各种变体方法，接下来学习生态工具，掌握完整的 LoRA 工具链！