模型详解 #

模型概览 #

Whisper 提供了五种不同规模的模型，从小型的 tiny 到大型的 large，在速度和准确性之间提供不同的权衡。

text

┌─────────────────────────────────────────────────────────────┐
│                    Whisper 模型家族                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  tiny  →  base  →  small  →  medium  →  large              │
│  快速      较快      中等      较慢       最慢              │
│  低精度    较低      中等      较高       最高              │
│                                                             │
│  速度: 快 ←─────────────────────────────────→ 慢           │
│  精度: 低 ←─────────────────────────────────→ 高           │
│  内存: 小 ←─────────────────────────────────→ 大           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

模型规格 #

详细参数对比 #

模型	参数量	Encoder 层数	Decoder 层数	注意力头	宽度
tiny	39 M	4	4	6	384
base	74 M	6	6	8	512
small	244 M	12	12	12	768
medium	769 M	24	24	16	1024
large	1550 M	32	32	20	1280

性能指标 #

模型	英语相对速度	多语言相对速度	VRAM 需求	英文 WER	多语言 WER
tiny	~32x	~32x	~1 GB	5.4%	12.2%
base	~16x	~16x	~1 GB	4.2%	10.5%
small	~6x	~6x	~2 GB	3.4%	8.5%
medium	~2x	~2x	~5 GB	2.8%	6.8%
large	1x	1x	~10 GB	2.4%	5.4%

模型版本 #

large-v1 #

python

import whisper

model = whisper.load_model("large-v1")

初始发布的 large 模型版本。

large-v2 #

python

import whisper

model = whisper.load_model("large-v2")

改进版本，性能更好，训练数据更多。

large-v3 #

python

import whisper

model = whisper.load_model("large-v3")

最新版本，语言识别和转录性能进一步提升。

模型选择指南 #

按场景选择 #

text

实时转录场景:
├── tiny: 实时字幕、语音助手
├── base: 准实时转录
└── small: 低延迟高质量

批量处理场景:
├── small: 一般批量处理
├── medium: 高质量批量处理
└── large: 最高质量要求

资源受限场景:
├── tiny: 移动设备、嵌入式
├── base: 低配置服务器
└── small: 中等配置服务器

高精度场景:
├── medium: 专业转录
└── large: 最高精度要求

按语言选择 #

text

英语音频:
├── tiny/base: 足够好的效果
├── small: 高质量需求
└── medium/large: 最高质量

中文音频:
├── base: 基本需求
├── small: 推荐
└── medium/large: 最高质量

低资源语言:
├── small: 最低推荐
├── medium: 推荐
└── large: 最高质量

按硬件选择 #

text

CPU 环境:
├── tiny: 可用
├── base: 可用（较慢）
└── small+: 不推荐

GPU (4GB VRAM):
├── tiny: 快速
├── base: 快速
├── small: 可用
└── medium+: 不推荐

GPU (8GB VRAM):
├── tiny - medium: 快速
└── large: 可用

GPU (12GB+ VRAM):
├── 所有模型: 都可使用
└── large: 推荐

加载模型 #

基本加载 #

python

import whisper

model = whisper.load_model("base")

指定设备 #

python

import whisper

model = whisper.load_model("base", device="cuda")

model = whisper.load_model("base", device="cpu")

自动选择设备 #

python

import whisper
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device)

指定下载路径 #

python

import whisper

model = whisper.load_model("base", download_root="/path/to/models")

加载特定版本 #

python

import whisper

model_v1 = whisper.load_model("large-v1")
model_v2 = whisper.load_model("large-v2")
model_v3 = whisper.load_model("large-v3")

模型下载 #

自动下载 #

首次使用时自动下载到缓存目录：

text

~/.cache/whisper/
├── tiny.pt
├── base.pt
├── small.pt
├── medium.pt
└── large-v3.pt

手动下载 #

bash

mkdir -p ~/.cache/whisper

cd ~/.cache/whisper

curl -O https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt

curl -O https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt

下载链接 #

模型	下载链接
tiny	https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt
base	https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt
small	https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt
medium	https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt
large-v3	https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt

性能测试 #

速度测试 #

python

import whisper
import time

def benchmark_models(audio_path, models=["tiny", "base", "small", "medium", "large"]):
    results = {}
    
    for model_name in models:
        print(f"测试 {model_name}...")
        
        model = whisper.load_model(model_name)
        
        start_time = time.time()
        result = model.transcribe(audio_path)
        elapsed_time = time.time() - start_time
        
        results[model_name] = {
            "time": elapsed_time,
            "text_length": len(result["text"]),
            "segments": len(result["segments"])
        }
        
        print(f"  耗时: {elapsed_time:.2f}s")
    
    return results

results = benchmark_models("test.mp3")

准确性测试 #

python

import whisper

def compare_models(audio_path, reference_text, models=["tiny", "base", "small", "medium"]):
    results = {}
    
    for model_name in models:
        model = whisper.load_model(model_name)
        result = model.transcribe(audio_path)
        
        wer = calculate_wer(reference_text, result["text"])
        
        results[model_name] = {
            "text": result["text"],
            "wer": wer
        }
        
        print(f"{model_name}: WER = {wer:.2%}")
    
    return results

def calculate_wer(reference, hypothesis):
    ref_words = reference.split()
    hyp_words = hypothesis.split()
    
    d = [[0] * (len(hyp_words) + 1) for _ in range(len(ref_words) + 1)]
    
    for i in range(len(ref_words) + 1):
        d[i][0] = i
    for j in range(len(hyp_words) + 1):
        d[0][j] = j
    
    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i-1] == hyp_words[j-1]:
                d[i][j] = d[i-1][j-1]
            else:
                d[i][j] = min(
                    d[i-1][j] + 1,
                    d[i][j-1] + 1,
                    d[i-1][j-1] + 1
                )
    
    return d[len(ref_words)][len(hyp_words)] / len(ref_words)

内存使用测试 #

python

import whisper
import torch

def check_memory_usage(model_name):
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
        
        model = whisper.load_model(model_name, device="cuda")
        
        result = model.transcribe("test.mp3")
        
        peak_memory = torch.cuda.max_memory_allocated() / 1024**3
        print(f"{model_name}: 峰值显存 {peak_memory:.2f} GB")
    else:
        print("CUDA 不可用")

for model in ["tiny", "base", "small", "medium", "large"]:
    check_memory_usage(model)

模型结构 #

架构图 #

text

┌─────────────────────────────────────────────────────────────┐
│                    Whisper 模型架构                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  输入音频 (16000 Hz)                                        │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │           Log-Mel 频谱图 (80 x 3000)                 │   │
│  └─────────────────────────────────────────────────────┘   │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Encoder (编码器)                         │   │
│  │  ┌─────────────────────────────────────────────┐    │   │
│  │  │ Conv1 (3x1, 384/512/768/1024/1280)         │    │   │
│  │  │ Conv2 (3x1)                                 │    │   │
│  │  │ Positional Embedding                        │    │   │
│  │  │ N x Transformer Block                       │    │   │
│  │  │   - Multi-Head Self-Attention               │    │   │
│  │  │   - MLP                                     │    │   │
│  │  └─────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────┘   │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Decoder (解码器)                         │   │
│  │  ┌─────────────────────────────────────────────┐    │   │
│  │  │ Token Embedding                             │    │   │
│  │  │ Positional Embedding                        │    │   │
│  │  │ N x Transformer Block                       │    │   │
│  │  │   - Masked Multi-Head Self-Attention        │    │   │
│  │  │   - Multi-Head Cross-Attention              │    │   │
│  │  │   - MLP                                     │    │   │
│  │  └─────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────┘   │
│         │                                                   │
│         ▼                                                   │
│  输出词元序列                                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

模型参数详情 #

python

import whisper

model = whisper.load_model("base")

print("Encoder 层数:", len(model.encoder.blocks))
print("Decoder 层数:", len(model.decoder.blocks))
print("注意力头数:", model.encoder.blocks[0].attn.n_head)
print("模型维度:", model.dims.n_state)
print("词汇表大小:", model.dims.n_vocab)

模型微调 #

冻结参数 #

python

import whisper

model = whisper.load_model("base")

for param in model.parameters():
    param.requires_grad = False

for param in model.decoder.parameters():
    param.requires_grad = True

添加自定义头 #

python

import whisper
import torch.nn as nn

class WhisperWithClassifier(nn.Module):
    def __init__(self, whisper_model, num_classes):
        super().__init__()
        self.whisper = whisper_model
        self.classifier = nn.Linear(whisper_model.dims.n_state, num_classes)
    
    def forward(self, mel):
        features = self.whisper.encoder(mel)
        pooled = features.mean(dim=1)
        logits = self.classifier(pooled)
        return logits

model = whisper.load_model("base")
classifier_model = WhisperWithClassifier(model, num_classes=10)

下一步 #

了解了模型选择后，继续学习参数调优掌握如何优化转录效果！