快速开始 #
准备工作 #
准备测试音频 #
你可以使用任何音频文件进行测试,支持以下格式:
text
支持的音频格式:
├── MP3 (.mp3)
├── WAV (.wav)
├── M4A (.m4a)
├── FLAC (.flac)
├── OGG (.ogg)
├── WebM (.webm)
└── MP4 (.mp4) - 提取音频
下载示例音频 #
bash
curl -o sample.mp3 https://example.com/sample.mp3
或使用自己的音频文件。
命令行使用 #
基本转录 #
bash
whisper sample.mp3
指定模型 #
bash
whisper sample.mp3 --model base
whisper sample.mp3 --model small
whisper sample.mp3 --model medium
whisper sample.mp3 --model large
指定语言 #
bash
whisper sample.mp3 --language Chinese
whisper sample.mp3 --language English
whisper sample.mp3 --language Japanese
输出格式 #
bash
whisper sample.mp3 --output_format txt
whisper sample.mp3 --output_format srt
whisper sample.mp3 --output_format vtt
whisper sample.mp3 --output_format json
whisper sample.mp3 --output_format tsv
指定输出目录 #
bash
whisper sample.mp3 --output_dir ./output
翻译模式 #
bash
whisper sample.mp3 --task translate
完整命令示例 #
bash
whisper sample.mp3 \
--model base \
--language Chinese \
--output_format srt \
--output_dir ./subtitles
Python API 使用 #
基本转录 #
python
import whisper
model = whisper.load_model("base")
result = model.transcribe("sample.mp3")
print(result["text"])
指定语言 #
python
import whisper
model = whisper.load_model("base")
result = model.transcribe("sample.mp3", language="zh")
print(result["text"])
翻译模式 #
python
import whisper
model = whisper.load_model("base")
result = model.translate("sample.mp3")
print(result["text"])
获取详细结果 #
python
import whisper
model = whisper.load_model("base")
result = model.transcribe("sample.mp3")
print("完整文本:")
print(result["text"])
print("\n分段文本:")
for segment in result["segments"]:
print(f"[{segment['start']:.2f} - {segment['end']:.2f}] {segment['text']}")
输出结构 #
python
result = {
"text": "完整的转录文本",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 5.0,
"text": "分段文本",
"tokens": [50364, ...],
"temperature": 0.0,
"avg_logprob": -0.5,
"compression_ratio": 1.2,
"no_speech_prob": 0.1
},
...
],
"language": "zh"
}
实用示例 #
示例 1: 批量转录 #
python
import whisper
import os
model = whisper.load_model("base")
audio_dir = "./audio_files"
output_dir = "./transcripts"
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(audio_dir):
if filename.endswith((".mp3", ".wav", ".m4a")):
audio_path = os.path.join(audio_dir, filename)
result = model.transcribe(audio_path)
output_path = os.path.join(
output_dir,
f"{os.path.splitext(filename)[0]}.txt"
)
with open(output_path, "w", encoding="utf-8") as f:
f.write(result["text"])
print(f"已处理: {filename}")
示例 2: 生成字幕文件 #
python
import whisper
model = whisper.load_model("base")
result = model.transcribe("video.mp3")
def write_srt(segments, output_path):
with open(output_path, "w", encoding="utf-8") as f:
for i, segment in enumerate(segments, 1):
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
text = segment["text"].strip()
f.write(f"{i}\n")
f.write(f"{start} --> {end}\n")
f.write(f"{text}\n\n")
def format_timestamp(seconds):
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
write_srt(result["segments"], "output.srt")
示例 3: 语言检测 #
python
import whisper
model = whisper.load_model("base")
audio = whisper.load_audio("sample.mp3")
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"检测到的语言: {detected_language}")
print(f"置信度: {probs[detected_language]:.2%}")
示例 4: 处理长音频 #
python
import whisper
model = whisper.load_model("base")
result = model.transcribe(
"long_audio.mp3",
language="zh",
verbose=True
)
print(f"总时长: {result['segments'][-1]['end']:.2f} 秒")
print(f"分段数: {len(result['segments'])}")
示例 5: 实时转录 #
python
import whisper
import pyaudio
import wave
import threading
import queue
model = whisper.load_model("base")
audio_queue = queue.Queue()
def record_audio():
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
RECORD_SECONDS = 5
p = pyaudio.PyAudio()
while True:
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
frames = []
for _ in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
stream.stop_stream()
stream.close()
audio_queue.put(b''.join(frames))
def transcribe_audio():
while True:
audio_data = audio_queue.get()
import tempfile
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
wf = wave.open(f.name, 'wb')
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes(audio_data)
wf.close()
result = model.transcribe(f.name)
print(result["text"])
record_thread = threading.Thread(target=record_audio)
transcribe_thread = threading.Thread(target=transcribe_audio)
record_thread.start()
transcribe_thread.start()
命令行参数详解 #
| 参数 | 说明 | 默认值 |
|---|---|---|
| –model | 模型大小 | small |
| –model_dir | 模型目录 | ~/.cache/whisper |
| –device | 设备 (cuda/cpu) | cuda |
| –output_dir | 输出目录 | 当前目录 |
| –output_format | 输出格式 | all |
| –verbose | 详细输出 | False |
| –task | 任务类型 | transcribe |
| –language | 语言代码 | auto |
| –temperature | 采样温度 | 0 |
| –best_of | 最佳候选数 | 5 |
| –beam_size | 束搜索大小 | 5 |
| –patience | 束搜索耐心 | 1.0 |
| –length_penalty | 长度惩罚 | 1.0 |
| –suppress_tokens | 抑制词元 | -1 |
| –initial_prompt | 初始提示 | None |
| –condition_on_previous_text | 条件文本 | True |
| –fp16 | 使用 FP16 | True |
| –temperature_increment_on_fallback | 温度增量 | 0.2 |
| –compression_ratio_threshold | 压缩比阈值 | 2.4 |
| –logprob_threshold | 对数概率阈值 | -1.0 |
| –no_speech_threshold | 无语音阈值 | 0.6 |
性能对比 #
不同模型的速度 #
python
import whisper
import time
models = ["tiny", "base", "small", "medium"]
audio_file = "sample.mp3"
for model_name in models:
model = whisper.load_model(model_name)
start_time = time.time()
result = model.transcribe(audio_file)
elapsed_time = time.time() - start_time
print(f"{model_name}: {elapsed_time:.2f} 秒")
GPU vs CPU #
python
import whisper
import time
audio_file = "sample.mp3"
for device in ["cuda", "cpu"]:
model = whisper.load_model("base", device=device)
start_time = time.time()
result = model.transcribe(audio_file)
elapsed_time = time.time() - start_time
print(f"{device}: {elapsed_time:.2f} 秒")
下一步 #
现在你已经掌握了 Whisper 的基本使用方法,继续学习 语音转录 了解更多高级功能!
最后更新:2026-04-05