微调优化 #

微调概述 #

微调（Fine-tuning）是在预训练模型基础上，使用特定数据集进行进一步训练，以适应特定场景或声音。

text

┌─────────────────────────────────────────────────────────────┐
│                     微调 vs 从头训练                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  从头训练：                                                  │
│  ├── 需要大量数据（10+ 小时）                               │
│  ├── 训练时间长（数天到数周）                                │
│  ├── 需要大量计算资源                                       │
│  └── 适合：大规模生产项目                                    │
│                                                             │
│  微调：                                                      │
│  ├── 需要较少数据（30分钟 - 2小时）                         │
│  ├── 训练时间短（数小时）                                    │
│  ├── 计算资源需求低                                         │
│  └── 适合：个人项目、特定声音                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

微调准备 #

选择基础模型 #

python

from TTS.api import TTS

# 查看可用模型
models = TTS.list_models()

# 推荐用于微调的模型
recommended_models = [
    "tts_models/en/ljspeech/vits",
    "tts_models/en/vctk/vits",
    "tts_models/multilingual/multi-dataset/xtts_v2",
]

for model in recommended_models:
    print(model)

数据要求 #

text

┌─────────────────────────────────────────────────────────────┐
│                   微调数据要求                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  最低要求：                                                  │
│  ├── 音频时长：30 分钟                                      │
│  ├── 音频质量：清晰、无噪音                                 │
│  └── 单一说话人                                             │
│                                                             │
│  推荐配置：                                                  │
│  ├── 音频时长：1-2 小时                                     │
│  ├── 采样率：22050 Hz                                       │
│  ├── 格式：WAV                                              │
│  └── 录音环境：安静、一致                                   │
│                                                             │
│  最佳实践：                                                  │
│  ├── 音频时长：5+ 小时                                      │
│  ├── 专业录音                                               │
│  ├── 多样化内容                                             │
│  └── 一致的音量和语速                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

微调 VITS 模型 #

准备配置文件 #

json

{
    "model": "vits",
    "run_name": "finetuned_vits",
    "run_description": "Fine-tuned VITS model",
    
    "audio": {
        "sample_rate": 22050,
        "fft_size": 1024,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80
    },
    
    "datasets": [
        {
            "name": "my_voice",
            "path": "my_dataset/",
            "meta_file_train": "metadata.csv",
            "language": "en"
        }
    ],
    
    "training": {
        "batch_size": 16,
        "epochs": 100,
        "learning_rate": 0.0001,
        "save_step": 500,
        "output_path": "finetune_output/"
    }
}

执行微调 #

bash

# 微调命令
tts --config_path finetune_config.json \
    --restore_path ~/.local/share/tts/tts_models--en--ljspeech--vits/model_file.pth \
    --coq_dataset_path my_dataset/

Python 微调脚本 #

python

import subprocess
from pathlib import Path

def finetune_vits(
    base_model="tts_models/en/ljspeech/vits",
    dataset_path="my_dataset",
    config_path="finetune_config.json",
    output_dir="finetune_output",
    epochs=100
):
    # 下载基础模型
    from TTS.api import TTS
    tts = TTS(base_model)
    
    # 获取模型路径
    model_path = Path.home() / ".local/share/tts"
    model_path = model_path / base_model.replace("/", "--")
    
    # 执行微调
    cmd = [
        "tts",
        "--config_path", config_path,
        "--restore_path", str(model_path / "model_file.pth"),
        "--coq_dataset_path", dataset_path,
        "--output_path", output_dir,
    ]
    
    subprocess.run(cmd)
    print(f"微调完成，模型保存在: {output_dir}")

# 使用
finetune_vits(
    base_model="tts_models/en/ljspeech/vits",
    dataset_path="my_dataset",
    config_path="finetune_config.json"
)

微调 XTTS 模型 #

XTTS 微调配置 #

json

{
    "model": "xtts_v2",
    "run_name": "finetuned_xtts",
    
    "audio": {
        "sample_rate": 22050,
        "output_path": "output"
    },
    
    "datasets": [
        {
            "name": "custom_voice",
            "path": "my_dataset/",
            "meta_file_train": "metadata.csv",
            "language": "en"
        }
    ],
    
    "training": {
        "batch_size": 4,
        "epochs": 10,
        "learning_rate": 0.00001,
        "save_step": 100,
        "output_path": "xtts_finetune/"
    }
}

XTTS 微调脚本 #

python

import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from trainer import Trainer, TrainerArgs

def finetune_xtts(
    dataset_path,
    output_path,
    base_model_path=None,
    epochs=10,
    batch_size=4,
    learning_rate=1e-5
):
    # 加载配置
    config = XttsConfig()
    config.audio.sample_rate = 22050
    config.batch_size = batch_size
    config.epochs = epochs
    config.learning_rate = learning_rate
    config.output_path = output_path
    
    # 加载模型
    model = Xtts.init_from_config(config)
    
    if base_model_path:
        model.load_checkpoint(config, base_model_path)
    
    # 初始化训练器
    trainer = Trainer(
        TrainerArgs(),
        config,
        output_path,
        model=model,
        train_dataset=dataset_path
    )
    
    # 开始训练
    trainer.fit()

# 使用
finetune_xtts(
    dataset_path="my_dataset/",
    output_path="xtts_finetune/",
    epochs=10
)

微调策略 #

学习率策略 #

python

# 微调学习率建议
learning_rate_strategies = {
    "保守": 1e-5,      # 最小改动，保持原模型特性
    "适中": 1e-4,      # 平衡改动和保持
    "激进": 1e-3,      # 较大改动，需要更多数据
}

# 推荐使用较小的学习率
recommended_lr = 1e-4

冻结层策略 #

python

import torch

def freeze_encoder(model, freeze=True):
    """冻结编码器层"""
    for param in model.encoder.parameters():
        param.requires_grad = not freeze
    print(f"编码器 {'已冻结' if freeze else '已解冻'}")

def freeze_decoder(model, freeze=True):
    """冻结解码器层"""
    for param in model.decoder.parameters():
        param.requires_grad = not freeze
    print(f"解码器 {'已冻结' if freeze else '已解冻'}")

# 使用：只微调解码器
from TTS.api import TTS
tts = TTS("tts_models/en/ljspeech/vits")
model = tts.synthesizer.tts_model

freeze_encoder(model, freeze=True)
freeze_decoder(model, freeze=False)

渐进式微调 #

python

def progressive_finetune(model, epochs_per_stage=20):
    """渐进式微调策略"""
    
    # 阶段 1：只训练最后几层
    print("阶段 1：训练输出层")
    freeze_encoder(model, freeze=True)
    freeze_decoder(model, freeze=False)
    # train(epochs=epochs_per_stage)
    
    # 阶段 2：解冻更多层
    print("阶段 2：训练更多层")
    freeze_encoder(model, freeze=False)
    # train(epochs=epochs_per_stage)
    
    # 阶段 3：全模型微调
    print("阶段 3：全模型微调")
    # 使用更小的学习率
    # train(epochs=epochs_per_stage, lr=1e-5)

性能优化 #

数据优化 #

python

import librosa
import soundfile as sf
import numpy as np
from pathlib import Path

def optimize_dataset(input_dir, output_dir):
    """优化数据集以提高训练效果"""
    input_dir = Path(input_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    for audio_file in input_dir.glob("*.wav"):
        audio, sr = librosa.load(str(audio_file), sr=None)
        
        # 1. 归一化音量
        audio = audio / np.max(np.abs(audio)) * 0.95
        
        # 2. 去除静音
        audio, _ = librosa.effects.trim(audio, top_db=25)
        
        # 3. 确保最小长度
        min_length = sr * 1  # 至少 1 秒
        if len(audio) < min_length:
            continue
        
        # 4. 重采样
        if sr != 22050:
            audio = librosa.resample(audio, sr, 22050)
            sr = 22050
        
        # 保存
        output_path = output_dir / audio_file.name
        sf.write(str(output_path), audio, sr)
        print(f"优化: {audio_file.name}")

# 使用
optimize_dataset("raw_audio", "optimized_dataset/wavs")

训练优化 #

json

{
    "training": {
        "batch_size": 16,
        "gradient_accumulation_steps": 2,
        "mixed_precision": true,
        "num_loader_workers": 8,
        "pin_memory": true,
        
        "optimizer": "AdamW",
        "optimizer_params": {
            "betas": [0.9, 0.999],
            "weight_decay": 0.01
        },
        
        "lr_scheduler": "OneCycleLR",
        "lr_scheduler_params": {
            "max_lr": 0.001,
            "pct_start": 0.1
        }
    }
}

评估微调效果 #

对比测试 #

python

from TTS.api import TTS
import soundfile as sf
import numpy as np

def compare_models(
    original_model,
    finetuned_model_path,
    config_path,
    test_texts,
    output_dir="comparison"
):
    from pathlib import Path
    output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    
    # 加载原始模型
    tts_original = TTS(original_model)
    
    # 加载微调模型
    tts_finetuned = TTS(
        model_path=finetuned_model_path,
        config_path=config_path
    )
    
    for i, text in enumerate(test_texts):
        # 原始模型
        original_path = output_dir / f"original_{i}.wav"
        tts_original.tts_to_file(text=text, file_path=str(original_path))
        
        # 微调模型
        finetuned_path = output_dir / f"finetuned_{i}.wav"
        tts_finetuned.tts_to_file(text=text, file_path=str(finetuned_path))
        
        print(f"对比 {i}:")
        print(f"  原始: {original_path}")
        print(f"  微调: {finetuned_path}")

# 使用
compare_models(
    original_model="tts_models/en/ljspeech/vits",
    finetuned_model_path="finetune_output/best_model.pth",
    config_path="finetune_config.json",
    test_texts=[
        "Hello, this is a test.",
        "The quick brown fox jumps over the lazy dog."
    ]
)

自动评估 #

python

import numpy as np
import soundfile as sf
from scipy import signal

def calculate_spectral_distance(audio1_path, audio2_path):
    """计算频谱距离"""
    audio1, sr1 = sf.read(audio1_path)
    audio2, sr2 = sf.read(audio2_path)
    
    # 确保长度一致
    min_len = min(len(audio1), len(audio2))
    audio1 = audio1[:min_len]
    audio2 = audio2[:min_len]
    
    # 计算频谱
    f1, P1 = signal.welch(audio1, sr1)
    f2, P2 = signal.welch(audio2, sr2)
    
    # 计算距离
    distance = np.sqrt(np.mean((P1 - P2) ** 2))
    return distance

def evaluate_finetuning(original_dir, finetuned_dir):
    """评估微调效果"""
    from pathlib import Path
    
    original_dir = Path(original_dir)
    finetuned_dir = Path(finetuned_dir)
    
    distances = []
    for orig_file in original_dir.glob("*.wav"):
        finetuned_file = finetuned_dir / orig_file.name
        if finetuned_file.exists():
            dist = calculate_spectral_distance(str(orig_file), str(finetuned_file))
            distances.append(dist)
            print(f"{orig_file.name}: 距离 = {dist:.4f}")
    
    avg_distance = np.mean(distances)
    print(f"\n平均频谱距离: {avg_distance:.4f}")
    return avg_distance

微调最佳实践 #

数据准备 #

text

┌─────────────────────────────────────────────────────────────┐
│                   微调最佳实践                               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  数据方面：                                                  │
│  ├── 使用高质量录音                                         │
│  ├── 保持一致的录音环境                                     │
│  ├── 覆盖多样化的内容                                       │
│  └── 确保文本转录准确                                       │
│                                                             │
│  训练方面：                                                  │
│  ├── 使用较小的学习率（1e-4 到 1e-5）                       │
│  ├── 监控验证损失，避免过拟合                               │
│  ├── 定期保存检查点                                         │
│  └── 使用早停策略                                           │
│                                                             │
│  评估方面：                                                  │
│  ├── 主观听感测试                                           │
│  ├── 与原模型对比                                           │
│  ├── 多人评估                                               │
│  └── 记录问题和改进                                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

避免过拟合 #

json

{
    "training": {
        "epochs": 50,
        "early_stopping": true,
        "early_stopping_patience": 5,
        "early_stopping_metric": "loss",
        "early_stopping_threshold": 0.001,
        
        "weight_decay": 0.01,
        "dropout": 0.1
    }
}

下一步 #

掌握了微调优化后，继续学习高级配置，了解分布式训练和自定义模型！