Skip to content

Commit cf35b43

Browse files
committed
feat: api
1 parent 4b298c0 commit cf35b43

File tree

12 files changed

+1611
-1
lines changed

12 files changed

+1611
-1
lines changed

Cargo.lock

Lines changed: 357 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
version = "0.0.1"
33
authors = ["Project AIRI"]
44
edition = "2024"
5-
rust-version = "1.80"
5+
rust-version = "1.85"
66
readme = "README.md"
77
homepage = "https://github.com/proj-airi/candle-examples"
88
repository = "https://github.com/proj-airi/candle-examples"
@@ -16,6 +16,7 @@ members = [
1616
"apps/silero-vad-realtime",
1717
"apps/silero-vad-realtime-minimum",
1818
"apps/silero-vad-whisper-realtime",
19+
"apps/silero-vad-whisper-realtime-api",
1920
"apps/whisper-realtime",
2021
]
2122

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
[package]
2+
name = "silero-vad-whisper-realtime-api"
3+
version = "0.1.0"
4+
edition = "2024"
5+
6+
[dependencies]
7+
anyhow = "1.0.98"
8+
byteorder = "1.5.0"
9+
candle-core = { version = "0.9.1" }
10+
candle-nn = { version = "0.9.1" }
11+
candle-transformers = { version = "0.9.1" }
12+
candle-onnx = { version = "0.9.1" }
13+
clap = { version = "4.5.38", features = ["derive"] }
14+
cpal = "0.15.3"
15+
hf-hub = "0.4.2"
16+
rand = "0.9.1"
17+
rubato = "0.16.2"
18+
serde_json = "1.0.140"
19+
symphonia = "0.5.4"
20+
tokenizers = "0.21.1"
21+
tracing-chrome = "0.7.2"
22+
tracing-subscriber = "0.3.19"
23+
tracing = "0.1.41"
24+
tokio = "1.45.1"
25+
crossbeam-channel = "0.5.15"
26+
axum = { version = "0.8.4", features = ["multipart"] }
27+
serde = { version = "1.0.219", features = ["derive"] }
28+
29+
# Server-sent events and HTTP utilities
30+
futures = "0.3.31"
31+
tokio-stream = "0.1.17"
32+
axum-extra = { version = "0.9.5", features = ["typed-header"] }
33+
tower = "0.5.1"
34+
tower-http = { version = "0.6.2", features = ["fs", "cors"] }
35+
36+
[features]
37+
default = []
38+
metal = ["candle-core/metal", "candle-nn/metal", "candle-transformers/metal"]
39+
cuda = ["candle-core/cuda", "candle-nn/cuda", "candle-transformers/cuda"]
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# ASR API - OpenAI Compatible Audio Transcription Service
2+
3+
🎤 一个兼容OpenAI格式的语音转录API服务,支持实时流式响应(SSE),集成了Silero VAD和Whisper模型。
4+
5+
## ✨ 功能特性
6+
7+
- 🔄 **兼容OpenAI API**: 完全兼容OpenAI `/v1/audio/transcriptions` 端点格式
8+
- 📡 **Server-Sent Events (SSE)**: 支持流式响应,实时获取转录结果
9+
- 🎯 **语音活动检测**: 集成Silero VAD,智能检测语音片段
10+
- 🧠 **Whisper转录**: 使用Candle框架实现的高效Whisper模型
11+
- 🚀 **高性能**: 支持GPU加速(CUDA/Metal)
12+
- 🌐 **现代Web界面**: 包含完整的测试页面
13+
14+
## 🚀 快速开始
15+
16+
### 1. 启动服务器
17+
18+
```bash
19+
# 进入项目目录
20+
cd apps/asr-api
21+
22+
# 安装依赖并启动
23+
cargo run --release
24+
```
25+
26+
服务器将在 `http://localhost:3000` 启动。
27+
28+
### 2. 测试API
29+
30+
打开浏览器访问测试页面:
31+
```
32+
http://localhost:3000/test.html
33+
```
34+
35+
或者使用curl命令:
36+
37+
```bash
38+
# 基础转录
39+
curl -X POST http://localhost:3000/v1/audio/transcriptions \
40+
-F "file=@your_audio.wav" \
41+
-F "model=whisper-1"
42+
43+
# 流式转录
44+
curl -X POST "http://localhost:3000/v1/audio/transcriptions?stream=true" \
45+
-F "file=@your_audio.wav" \
46+
-F "model=whisper-1" \
47+
--no-buffer
48+
```
49+
50+
## 📋 API文档
51+
52+
### POST `/v1/audio/transcriptions`
53+
54+
转录音频文件为文本。
55+
56+
#### 请求参数
57+
58+
| 参数 | 类型 | 必需 | 描述 |
59+
|------|------|------|------|
60+
| `file` | File || 要转录的音频文件 |
61+
| `model` | String || 模型名称 (默认: "whisper-1") |
62+
| `language` | String || 音频语言 |
63+
| `prompt` | String || 提示文本 |
64+
| `response_format` | String || 响应格式 (默认: "json") |
65+
| `temperature` | Float || 采样温度 (默认: 0.0) |
66+
| `stream` | Boolean || 启用流式响应 (Query参数) |
67+
68+
#### 支持的音频格式
69+
70+
- WAV
71+
- MP3
72+
- FLAC
73+
- M4A
74+
- 以及Symphonia支持的其他格式
75+
76+
#### 响应格式
77+
78+
**标准响应 (JSON)**:
79+
```json
80+
{
81+
"text": "转录的文本内容"
82+
}
83+
```
84+
85+
**流式响应 (SSE)**:
86+
```
87+
data: {"text": "Processing audio chunk 1 of 4...", "timestamp": 0.5}
88+
89+
data: {"text": "Processing audio chunk 2 of 4...", "timestamp": 1.0}
90+
91+
data: {"text": "转录完成的文本", "timestamp": 2.5}
92+
```
93+
94+
**错误响应**:
95+
```json
96+
{
97+
"error": {
98+
"message": "错误描述",
99+
"type": "invalid_request_error",
100+
"param": "file",
101+
"code": null
102+
}
103+
}
104+
```
105+
106+
## 🛠️ 开发指南
107+
108+
### 项目结构
109+
110+
```
111+
apps/asr-api/
112+
├── src/
113+
│ ├── main.rs # 主服务器文件
114+
│ ├── vad.rs # VAD处理器
115+
│ ├── whisper.rs # Whisper处理器
116+
│ └── audio_manager.rs # 音频缓冲管理
117+
├── melfilters.bytes # Mel滤波器数据
118+
├── melfilters128.bytes # 128维Mel滤波器数据
119+
├── test.html # 测试页面
120+
├── Cargo.toml # 依赖配置
121+
└── README.md # 文档
122+
```
123+
124+
### 核心组件
125+
126+
1. **VAD处理器**: 使用Silero VAD模型检测语音活动
127+
2. **Whisper处理器**: 使用Candle实现的Whisper模型进行转录
128+
3. **音频管理器**: 处理音频缓冲和格式转换
129+
4. **Web服务器**: 基于Axum的高性能HTTP服务器
130+
131+
### 自定义配置
132+
133+
可以通过修改 `AppState::new()` 方法来调整以下参数:
134+
135+
- VAD阈值 (默认: 0.3)
136+
- Whisper模型 (默认: Tiny)
137+
- 设备选择 (自动选择GPU/CPU)
138+
139+
### 添加新功能
140+
141+
1. **支持更多音频格式**: 修改 `convert_audio_to_pcm` 函数
142+
2. **自定义VAD参数**: 在 `VADProcessor::new` 中调整参数
143+
3. **更大的Whisper模型**: 在 `WhisperProcessor::new` 中选择不同模型
144+
145+
## 🔧 高级配置
146+
147+
### 环境变量
148+
149+
```bash
150+
# 设置日志级别
151+
export RUST_LOG=debug
152+
153+
# 强制使用CPU
154+
export CANDLE_FORCE_CPU=1
155+
```
156+
157+
### GPU加速
158+
159+
#### CUDA支持
160+
```bash
161+
cargo run --release --features cuda
162+
```
163+
164+
#### Metal支持 (macOS)
165+
```bash
166+
cargo run --release --features metal
167+
```
168+
169+
## 📊 性能优化
170+
171+
### 推荐配置
172+
173+
- **内存**: 最少8GB RAM
174+
- **GPU**: NVIDIA GTX 1060 6GB+ 或 Apple M1+
175+
- **存储**: SSD推荐,用于模型加载
176+
177+
### 批处理优化
178+
179+
对于大量文件处理,建议:
180+
181+
1. 使用更大的Whisper模型获得更好质量
182+
2. 启用GPU加速
183+
3. 调整VAD参数减少误检
184+
185+
## 🚨 常见问题
186+
187+
### Q: 转录准确率不高怎么办?
188+
A: 尝试以下方法:
189+
- 使用更大的Whisper模型 (medium/large)
190+
- 确保音频质量良好 (16kHz采样率)
191+
- 调整VAD阈值
192+
- 提供语言参数
193+
194+
### Q: 服务器启动慢?
195+
A: 首次启动需要下载模型文件,这是正常现象。模型会缓存到本地。
196+
197+
### Q: 支持实时语音输入吗?
198+
A: 目前只支持文件上传,实时语音输入可以参考 `silero-vad-whisper-realtime` 项目。
199+
200+
### Q: 如何批量处理文件?
201+
A: 可以编写脚本调用API,或者扩展当前代码支持批处理端点。
202+
203+
## 🤝 贡献指南
204+
205+
欢迎提交Issue和Pull Request!
206+
207+
1. Fork项目
208+
2. 创建功能分支
209+
3. 提交改动
210+
4. 发起Pull Request
211+
212+
## 📄 许可证
213+
214+
本项目采用与父项目相同的许可证。
215+
216+
## 🙏 致谢
217+
218+
- [Candle](https://github.com/huggingface/candle) - 高性能ML框架
219+
- [Axum](https://github.com/tokio-rs/axum) - 现代Web框架
220+
- [OpenAI](https://openai.com/) - API设计参考
221+
- [Silero VAD](https://github.com/snakers4/silero-vad) - VAD模型
222+
223+
---
224+
225+
🎯 **提示**: 第一次运行时会自动下载模型文件,请确保网络连接正常。
62.8 KB
Binary file not shown.
101 KB
Binary file not shown.
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
use serde::{Deserialize, Serialize};
2+
3+
#[derive(Debug, Serialize)]
4+
pub struct TranscriptionResponse {
5+
pub text: String,
6+
}
7+
8+
#[derive(Debug, Serialize)]
9+
pub struct StreamChunk {
10+
pub text: String,
11+
#[serde(skip_serializing_if = "Option::is_none")]
12+
pub timestamp: Option<f64>,
13+
}
14+
15+
#[derive(Debug, Serialize)]
16+
pub struct ErrorResponse {
17+
pub error: ErrorDetail,
18+
}
19+
20+
#[derive(Debug, Serialize)]
21+
pub struct ErrorDetail {
22+
pub message: String,
23+
#[serde(rename = "type")]
24+
pub error_type: String,
25+
pub param: Option<String>,
26+
pub code: Option<String>,
27+
}
28+
29+
pub fn default_model() -> String {
30+
"whisper-1".to_string()
31+
}
32+
33+
pub fn default_response_format() -> String {
34+
"json".to_string()
35+
}
36+
37+
pub fn default_temperature() -> f32 {
38+
0.0
39+
}
40+
41+
#[derive(Debug, Deserialize)]
42+
pub struct TranscriptionRequest {
43+
/// The audio file object to transcribe
44+
/// In multipart form, this would be the file field
45+
46+
/// ID of the model to use. Only whisper-1 is currently available.
47+
#[serde(default = "default_model")]
48+
pub model: String,
49+
50+
/// The language of the input audio
51+
pub language: Option<String>,
52+
53+
/// An optional text to guide the model's style or continue a previous audio segment
54+
pub prompt: Option<String>,
55+
56+
/// The format of the transcript output
57+
#[serde(default = "default_response_format")]
58+
pub response_format: String,
59+
60+
/// The sampling temperature, between 0 and 1
61+
#[serde(default = "default_temperature")]
62+
pub temperature: f32,
63+
64+
/// Enable streaming response
65+
#[serde(default)]
66+
pub stream: bool,
67+
}

0 commit comments

Comments
 (0)