参数调优 #

参数概览 #

Whisper 提供了丰富的参数来控制转录行为，合理调整这些参数可以显著提高转录质量。

text

┌─────────────────────────────────────────────────────────────┐
│                    参数分类                                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  解码参数:                                                   │
│  ├── temperature (温度)                                     │
│  ├── beam_size (束搜索大小)                                 │
│  ├── best_of (最佳候选数)                                   │
│  └── patience (耐心参数)                                    │
│                                                             │
│  质量阈值:                                                   │
│  ├── compression_ratio_threshold (压缩比阈值)               │
│  ├── logprob_threshold (对数概率阈值)                       │
│  └── no_speech_threshold (无语音阈值)                       │
│                                                             │
│  上下文参数:                                                 │
│  ├── initial_prompt (初始提示)                              │
│  ├── condition_on_previous_text (条件文本)                  │
│  └── suppress_tokens (抑制词元)                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

温度参数 #

什么是温度？ #

温度控制模型输出的随机性。较低的温度产生更确定性的输出，较高的温度增加多样性。

text

温度 = 0.0:
├── 完全确定性
├── 总是选择最可能的词元
└── 适合大多数场景

温度 = 0.2 - 0.8:
├── 适度随机性
├── 可能产生更自然的输出
└── 适合有歧义的音频

温度 = 1.0:
├── 高随机性
├── 输出多样性高
└── 可能产生不稳定结果

使用示例 #

python

import whisper

model = whisper.load_model("base")

result = model.transcribe("audio.mp3", temperature=0.0)
print(f"温度 0.0: {result['text']}")

result = model.transcribe("audio.mp3", temperature=0.5)
print(f"温度 0.5: {result['text']}")

result = model.transcribe("audio.mp3", temperature=1.0)
print(f"温度 1.0: {result['text']}")

温度回退 #

Whisper 支持温度回退策略，当低温度解码失败时自动尝试更高温度：

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)

print(f"使用的温度: {result.get('temperature', 'N/A')}")
print(f"转录结果: {result['text']}")

温度选择建议 #

python

def get_temperature_for_scenario(scenario):
    scenarios = {
        "clear_audio": 0.0,
        "noisy_audio": 0.2,
        "multiple_speakers": 0.3,
        "ambiguous_content": (0.0, 0.2, 0.4),
        "creative_content": (0.0, 0.4, 0.8)
    }
    return scenarios.get(scenario, 0.0)

temperature = get_temperature_for_scenario("noisy_audio")
result = model.transcribe("audio.mp3", temperature=temperature)

束搜索 #

什么是束搜索？ #

束搜索是一种解码策略，在每一步保留多个候选序列，最终选择得分最高的序列。

text

贪婪搜索 (beam_size=1):
├── 每步只保留一个最佳候选
├── 速度快
└── 可能错过全局最优

束搜索 (beam_size=5):
├── 每步保留 5 个候选
├── 更可能找到最优解
└── 速度较慢

使用示例 #

python

import whisper

model = whisper.load_model("base")

result_greedy = model.transcribe("audio.mp3", beam_size=1)
result_beam5 = model.transcribe("audio.mp3", beam_size=5)
result_beam10 = model.transcribe("audio.mp3", beam_size=10)

print(f"贪婪搜索: {result_greedy['text']}")
print(f"束搜索(5): {result_beam5['text']}")
print(f"束搜索(10): {result_beam10['text']}")

best_of 参数 #

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    beam_size=5,
    best_of=5
)

print(result["text"])

patience 参数 #

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    beam_size=5,
    patience=1.0
)

print(result["text"])

质量阈值参数 #

压缩比阈值 #

压缩比用于检测重复或循环输出：

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    compression_ratio_threshold=2.4
)

print(f"压缩比阈值: 2.4")
print(f"转录结果: {result['text']}")

for segment in result["segments"]:
    print(f"压缩比: {segment['compression_ratio']:.2f}")

对数概率阈值 #

对数概率用于评估转录质量：

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    logprob_threshold=-1.0
)

for segment in result["segments"]:
    avg_logprob = segment["avg_logprob"]
    quality = "好" if avg_logprob > -1.0 else "差"
    print(f"[{quality}] 平均对数概率: {avg_logprob:.2f}")

无语音阈值 #

检测音频片段是否包含语音：

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    no_speech_threshold=0.6
)

for segment in result["segments"]:
    no_speech_prob = segment["no_speech_prob"]
    has_speech = no_speech_prob < 0.6
    print(f"语音检测: {'有' if has_speech else '无'} (概率: {no_speech_prob:.2f})")

上下文参数 #

初始提示 #

提供上下文信息以提高转录准确性：

python

import whisper

model = whisper.load_model("base")

prompts = [
    "这是一段关于人工智能的技术演讲。",
    "演讲者正在讨论机器学习和深度学习。",
    "内容涉及神经网络、自然语言处理和计算机视觉。"
]

for prompt in prompts:
    result = model.transcribe(
        "technical_speech.mp3",
        initial_prompt=prompt,
        language="zh"
    )
    print(f"提示: {prompt}")
    print(f"结果: {result['text'][:100]}...\n")

专业领域提示 #

python

domain_prompts = {
    "medical": "这是一段医学领域的专业讨论，涉及诊断、治疗和药物。",
    "legal": "这是一段法律相关的对话，涉及合同、法规和诉讼。",
    "technical": "这是一段技术讨论，涉及编程、算法和系统架构。",
    "financial": "这是一段金融领域的讨论，涉及投资、股票和市场分析。"
}

def transcribe_with_domain(audio_path, domain):
    model = whisper.load_model("base")
    prompt = domain_prompts.get(domain, "")
    
    result = model.transcribe(
        audio_path,
        initial_prompt=prompt,
        language="zh"
    )
    return result["text"]

text = transcribe_with_domain("medical_audio.mp3", "medical")
print(text)

条件文本 #

控制是否使用前文作为上下文：

python

import whisper

model = whisper.load_model("base")

result_with_context = model.transcribe(
    "long_audio.mp3",
    condition_on_previous_text=True
)

result_no_context = model.transcribe(
    "long_audio.mp3",
    condition_on_previous_text=False
)

print("使用上下文:")
print(result_with_context["text"][:200])

print("\n不使用上下文:")
print(result_no_context["text"][:200])

抑制词元 #

阻止模型生成特定词元：

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    suppress_tokens=[-1]
)

print(result["text"])

完整参数示例 #

高质量转录配置 #

python

import whisper

model = whisper.load_model("medium")

result = model.transcribe(
    "audio.mp3",
    language="zh",
    task="transcribe",
    temperature=0.0,
    beam_size=5,
    best_of=5,
    patience=1.0,
    compression_ratio_threshold=2.4,
    logprob_threshold=-1.0,
    no_speech_threshold=0.6,
    condition_on_previous_text=True,
    initial_prompt="这是一段高质量的音频转录任务。",
    word_timestamps=True
)

print(result["text"])

快速转录配置 #

python

import whisper

model = whisper.load_model("base")

result = model.transcribe(
    "audio.mp3",
    language="zh",
    temperature=0.0,
    beam_size=1,
    best_of=1,
    condition_on_previous_text=False
)

print(result["text"])

噪音音频配置 #

python

import whisper

model = whisper.load_model("small")

result = model.transcribe(
    "noisy_audio.mp3",
    language="zh",
    temperature=(0.0, 0.2, 0.4),
    beam_size=5,
    compression_ratio_threshold=2.4,
    logprob_threshold=-1.0,
    no_speech_threshold=0.6
)

print(result["text"])

参数调试 #

参数网格搜索 #

python

import whisper
from itertools import product

def grid_search_params(audio_path, param_grid):
    model = whisper.load_model("base")
    results = []
    
    for temp, beam in product(
        param_grid["temperature"],
        param_grid["beam_size"]
    ):
        result = model.transcribe(
            audio_path,
            temperature=temp,
            beam_size=beam
        )
        
        results.append({
            "temperature": temp,
            "beam_size": beam,
            "text": result["text"],
            "segments": len(result["segments"])
        })
    
    return results

param_grid = {
    "temperature": [0.0, 0.2, 0.4],
    "beam_size": [1, 3, 5]
}

results = grid_search_params("audio.mp3", param_grid)

for r in results:
    print(f"temp={r['temperature']}, beam={r['beam_size']}: {r['text'][:50]}...")

参数效果对比 #

python

import whisper

def compare_params(audio_path, configs):
    model = whisper.load_model("base")
    
    for name, config in configs.items():
        result = model.transcribe(audio_path, **config)
        
        print(f"\n配置: {name}")
        print(f"参数: {config}")
        print(f"结果: {result['text'][:100]}...")
        
        avg_logprob = sum(s["avg_logprob"] for s in result["segments"]) / len(result["segments"])
        print(f"平均对数概率: {avg_logprob:.2f}")

configs = {
    "默认": {},
    "高质量": {"beam_size": 5, "temperature": 0.0},
    "快速": {"beam_size": 1, "temperature": 0.0},
    "噪音处理": {"temperature": (0.0, 0.2, 0.4), "beam_size": 5}
}

compare_params("audio.mp3", configs)

DecodingOptions 详解 #

使用 DecodingOptions #

python

import whisper

model = whisper.load_model("base")

options = whisper.DecodingOptions(
    language="zh",
    task="transcribe",
    temperature=0.0,
    beam_size=5,
    best_of=5,
    patience=1.0,
    length_penalty=1.0,
    suppress_tokens=[-1],
    initial_prompt="这是初始提示",
    condition_on_previous_text=True,
    fp16=True
)

audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)

result = whisper.decode(model, mel, options)
print(result.text)

参数完整列表 #

参数	类型	默认值	说明
task	str	transcribe	任务类型
language	str	None	语言代码
temperature	float/tuple	0.0	采样温度
beam_size	int	1	束搜索大小
best_of	int	5	最佳候选数
patience	float	1.0	束搜索耐心
length_penalty	float	1.0	长度惩罚
suppress_tokens	list	[-1]	抑制词元
initial_prompt	str	None	初始提示
condition_on_previous_text	bool	True	条件文本
fp16	bool	True	使用 FP16
compression_ratio_threshold	float	2.4	压缩比阈值
logprob_threshold	float	-1.0	对数概率阈值
no_speech_threshold	float	0.6	无语音阈值

下一步 #

掌握了参数调优后，继续学习性能优化了解如何提升处理速度！