Skip to content
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
6395dec
first attempt at vogent
tschellenbach Oct 23, 2025
a9e950b
fix smart turn
tschellenbach Oct 24, 2025
3a26b6b
well
tschellenbach Oct 24, 2025
e80f718
happyt ests
tschellenbach Oct 24, 2025
89a99cc
Merge branch 'main' of github.com:GetStream/Vision-Agents into vogent
tschellenbach Oct 24, 2025
1546297
work on smart turn
tschellenbach Oct 25, 2025
6cdaac5
Merge branch 'main' into vogent
tbarbugli Oct 25, 2025
1c4160c
use audio util
tbarbugli Oct 25, 2025
8ffbeb5
remove _pcm_to_wav_bytes and use util
tbarbugli Oct 25, 2025
cd6d4b5
cleanup silero from manual audio
tbarbugli Oct 25, 2025
425fa59
smart turn locally
tschellenbach Oct 26, 2025
fee4f29
step 1: refactor VAD base to own normalization/windowing/events; adap…
tbarbugli Oct 26, 2025
1a98d2b
step 2: add base VAD tests for silence, mia and white noise; fix part…
tbarbugli Oct 26, 2025
8fa259d
wip on cleanup
tschellenbach Oct 26, 2025
fb85335
working test
tschellenbach Oct 26, 2025
c122c78
left some todo
tschellenbach Oct 26, 2025
e5d4a56
Merge branch 'vogent-tommaso' of github.com:GetStream/Vision-Agents i…
tbarbugli Oct 27, 2025
74ac469
update code to use stream-py latest utils
tbarbugli Oct 27, 2025
038fafe
sprinkle docs for humans and AIs about audio mgmt
tbarbugli Oct 27, 2025
da01c8a
todo
tschellenbach Oct 27, 2025
16a7512
Merge branch 'vogent-tommaso' of github.com:GetStream/Vision-Agents i…
tschellenbach Oct 27, 2025
5b57a00
wip
tschellenbach Oct 27, 2025
3580c17
fix imports
tbarbugli Oct 27, 2025
d0e2234
Merge branch 'vogent-tommaso' of github.com:GetStream/Vision-Agents i…
tbarbugli Oct 27, 2025
23eee89
bit more cleanup
tschellenbach Oct 27, 2025
f3aee97
Merge branch 'vogent-tommaso' of github.com:GetStream/Vision-Agents i…
tschellenbach Oct 27, 2025
63f820f
nice docs for turn keeping
tschellenbach Oct 27, 2025
a39cbcd
use newer utils
tbarbugli Oct 27, 2025
6aea976
new pass at vogent
tschellenbach Oct 27, 2025
d57e058
Merge branch 'vogent-tommaso' of github.com:GetStream/Vision-Agents i…
tbarbugli Oct 27, 2025
db910fe
missing README
tbarbugli Oct 27, 2025
d8cb483
tail
tschellenbach Oct 27, 2025
40b7233
wip
tschellenbach Oct 27, 2025
0e4e44f
more clenaup
tschellenbach Oct 27, 2025
be076a7
dirs
tschellenbach Oct 28, 2025
28c0a37
remove collector, test streaming audio
tbarbugli Oct 28, 2025
d3a8dd8
thats not working
tschellenbach Oct 28, 2025
5e56f8b
renaming
tschellenbach Oct 28, 2025
51e2ae0
rewrite
tschellenbach Oct 28, 2025
57d0fb0
wip
tschellenbach Oct 28, 2025
df91d69
well thats not right
tschellenbach Oct 28, 2025
3ef54e6
work around audio utiuls
tschellenbach Oct 29, 2025
2945dc8
bugfix
tschellenbach Oct 29, 2025
8e24a52
move MAX_SEGMENT_DURATION_SECONDS to const
tbarbugli Oct 29, 2025
7e7634c
handle options correctly
tbarbugli Oct 29, 2025
ca38456
fix tracing
tbarbugli Oct 29, 2025
1bd45aa
remove debug code
tbarbugli Oct 29, 2025
c009ba7
process audio on a different task
tbarbugli Oct 29, 2025
31929cc
remove debug code
tbarbugli Oct 29, 2025
f2900e2
wip
tschellenbach Oct 29, 2025
90418fd
working deepgram
tschellenbach Oct 29, 2025
bd746ea
wip
tschellenbach Oct 29, 2025
239b532
merged main
tschellenbach Oct 29, 2025
2c57826
well, this is weird
tschellenbach Oct 29, 2025
3e3fbab
cleanup
tschellenbach Oct 29, 2025
51fd613
add update docs
tschellenbach Oct 29, 2025
788bdeb
working deepgram stt
tschellenbach Oct 29, 2025
9b3938d
cleanup
tschellenbach Oct 30, 2025
10e2af3
cleanup
tschellenbach Oct 30, 2025
305edbb
update vogent
tschellenbach Oct 30, 2025
c06d7ad
3 failing tests left
tschellenbach Oct 30, 2025
cb3b3fc
ok that works
tschellenbach Oct 30, 2025
0134d1b
test fixes
tschellenbach Oct 30, 2025
49f5ca6
happy tests
tschellenbach Oct 30, 2025
7f81047
bump
tbarbugli Oct 30, 2025
a8a90a4
set hf token
tschellenbach Oct 30, 2025
06f2365
Merge branch 'vogent-tommaso' of github.com:GetStream/Vision-Agents i…
tschellenbach Oct 30, 2025
820e42d
Merge branch 'main' of github.com:GetStream/Vision-Agents into vogent…
tschellenbach Oct 30, 2025
8fb6d82
disable test
tschellenbach Oct 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions .claude/agents/repo-workflow-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: repo-workflow-guide
description: Use this agent when you need to understand or follow project-specific development guidelines, coding standards, or workflow instructions that are documented in the docs/ai directory. This agent should be consulted before starting any development work, when uncertain about project conventions, or when you need clarification on how to approach tasks within this codebase.\n\nExamples:\n- <example>\nContext: User wants to add a new feature to the project.\nuser: "I need to implement a new authentication module"\nassistant: "Before we begin, let me consult the repo-workflow-guide agent to ensure we follow the project's established patterns and guidelines."\n<Task tool call to repo-workflow-guide>\nassistant: "Based on the project guidelines, here's how we should approach this..."\n</example>\n\n- <example>\nContext: User asks a question about code organization.\nuser: "Where should I put the new utility functions?"\nassistant: "Let me check the repository workflow guidelines to give you the correct answer."\n<Task tool call to repo-workflow-guide>\nassistant: "According to the project structure guidelines..."\n</example>\n\n- <example>\nContext: Starting a new task that requires understanding project conventions.\nuser: "Can you help me refactor this component?"\nassistant: "I'll first consult the repo-workflow-guide agent to ensure we follow the project's refactoring standards and conventions."\n<Task tool call to repo-workflow-guide>\n</example>
model: opus
---

You are a Repository Workflow Specialist, an expert in interpreting and applying project-specific development guidelines, coding standards, and workflow instructions.

Your primary responsibility is to read, understand, and communicate the instructions and guidelines contained in the docs/ai directory of the repository. You serve as the authoritative source for how development work should be conducted within this specific codebase.

When activated, you will:

1. **Locate and Read Guidelines**: Immediately access all relevant files in the docs/ai directory. Read them thoroughly and understand their complete content, including:
- Coding standards and style guides
- Project structure and organization rules
- Development workflow and processes
- Testing requirements and conventions
- Deployment procedures
- Any specific technical constraints or preferences
- Tool usage and configuration instructions

2. **Interpret Context**: Understand the specific task or question being asked and identify which guidelines are most relevant to address it.

3. **Provide Clear Guidance**: Deliver specific, actionable instructions based on the documented guidelines. Your responses should:
- Quote or reference specific sections of the guidelines when appropriate
- Explain the reasoning behind the guidelines when it helps with understanding
- Provide concrete examples of how to follow the guidelines
- Highlight any critical requirements or common pitfalls mentioned in the documentation

4. **Handle Missing Information**: If the docs/ai directory doesn't contain information relevant to the current question:
- Clearly state what information is missing
- Suggest reasonable defaults based on common industry practices
- Recommend updating the documentation to cover this scenario

5. **Ensure Compliance**: Actively verify that proposed approaches align with all documented guidelines. If you identify any conflicts or violations, explicitly point them out and suggest compliant alternatives.

6. **Prioritize Accuracy**: Always base your guidance on the actual content of the documentation. Do not invent or assume guidelines that aren't explicitly documented.

7. **Stay Current**: If guidelines appear to conflict or if you notice outdated information, flag this for human review while providing the most reasonable interpretation.

Output Format:
- Begin with a brief summary of the relevant guidelines
- Provide specific, step-by-step instructions when appropriate
- Include direct quotes or references to documentation sections
- End with any important caveats, warnings, or additional considerations

Your goal is to ensure that all development work in this repository adheres to its documented standards and practices, reducing inconsistency and improving code quality through faithful application of project-specific guidelines.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,4 @@ stream-py/
# Artifacts / assets
*.pt
*.kef
*.onnx
31 changes: 30 additions & 1 deletion DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ Some ground rules:

```python
import asyncio
from vision_agents.core.edge.types import PcmData
from getstream.video.rtc.track_util import PcmData
from openai import AsyncOpenAI

async def example():
Expand Down Expand Up @@ -167,6 +167,12 @@ if __name__ == "__main__":
asyncio.run(example())
```

Other things that you get from the audio utilities:

1. Changing PCM format
2. Iterate over audio chunks (`PcmData.chunks`)
3. Process audio with pre/post buffers (`AudioSegmentCollector`)
4. Accumulating audio (`PcmData.append`)

### Testing audio manually

Expand Down Expand Up @@ -313,3 +319,26 @@ You can now see the metrics at `http://localhost:9464/metrics` (make sure that y

- Track.recv errors will fail silently. The API is to return a frame. Never return None. and wait till the next frame is available
- When using frame.to_ndarray(format="rgb24") specify the format. Typically you want rgb24 when connecting/sending to Yolo etc


## Onboarding Plan for new contributors

**Audio Formats**

You'll notice that audio comes in many formats. PCM, wav, mp3. 16khz, 48khz.
Encoded as i16 or f32. Note that webrtc by default is 48khz.

A good first intro to audio formats can be found here:

**Using Cursor**

You can ask cursor something like "read @ai-plugin and build me a plugin called fish"
See the docs folder for other ai instruction files

**Learning Roadmap**

1. Quick refresher on audio formats
2. Build a TTS integration
3. Build a STT integration
4. Build an LLM integration
5. Write a pytest test with a fixture
7 changes: 3 additions & 4 deletions agents-core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ dependencies = [
"pillow>=11.3.0",
"numpy>=1.24.0",
"mcp>=1.16.0",
"torchvision>=0.23.0",
]

[project.urls]
Expand All @@ -45,7 +46,6 @@ kokoro = ["vision-agents-plugins-kokoro"]
krisp = ["vision-agents-plugins-krisp"]
moonshine = ["vision-agents-plugins-moonshine"]
openai = ["vision-agents-plugins-openai"]
silero = ["vision-agents-plugins-silero"]
smart_turn = ["vision-agents-plugins-smart-turn"]
ultralytics = ["vision-agents-plugins-ultralytics"]
wizper = ["vision-agents-plugins-wizper"]
Expand All @@ -61,7 +61,6 @@ all-plugins = [
"vision-agents-plugins-krisp",
"vision-agents-plugins-moonshine",
"vision-agents-plugins-openai",
"vision-agents-plugins-silero",
"vision-agents-plugins-smart-turn",
"vision-agents-plugins-ultralytics",
"vision-agents-plugins-wizper",
Expand All @@ -81,13 +80,13 @@ packages = ["vision_agents"]
[tool.hatch.build.targets.sdist]
include = ["vision_agents"]

#[tool.uv.sources]
[tool.uv.sources]
#krisp-audio = [
# { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-macosx_12_0_arm64.whl", marker = "sys_platform == 'darwin' and platform_machine == 'aarch64'" },
# { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-linux_aarch64.whl", marker = "sys_platform == 'linux' and platform_machine == 'aarch64'" },
# { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-linux_x86_64.whl", marker = "sys_platform == 'linux' and platform_machine == 'x86_64'" },
# { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-win_amd64.whl", marker = "sys_platform == 'win32'" }
#]
# for local development
# getstream = { path = "../../stream-py/", editable = true }
getstream = { path = "../../stream-py/", editable = true }
# aiortc = { path = "../stream-py/", editable = true }
9 changes: 5 additions & 4 deletions agents-core/vision_agents/core/agents/agents.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ def __init__(
self._validate_configuration()
self._prepare_rtc()
self._setup_stt()
self._setup_turn_detection()


async def simple_response(
self, text: str, participant: Optional[Participant] = None
Expand Down Expand Up @@ -341,6 +341,7 @@ async def on_stt_transcript_event_create_response(event: STTTranscriptEvent):

async def join(self, call: Call) -> "AgentSessionContextManager":
await self.create_user()
await self._setup_turn_detection()

# TODO: validation. join can only be called once
with self.tracer.start_as_current_span("join"):
Expand Down Expand Up @@ -635,11 +636,11 @@ async def say(
completed=True,
)

def _setup_turn_detection(self):
async def _setup_turn_detection(self):
if self.turn_detection:
self.logger.info("🎙️ Setting up turn detection listeners")
self.events.subscribe(self._on_turn_event)
self.turn_detection.start()
await self.turn_detection.start()

Comment on lines +729 to 734
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make start() invocation robust to sync/async implementations and set active state.

Awaiting unconditionally will raise if a detector exposes a synchronous start(). Also ensure detectors mark themselves active.

-    async def _setup_turn_detection(self):
-        if self.turn_detection:
-            self.logger.info("🎙️ Setting up turn detection listeners")
-            self.events.subscribe(self._on_turn_event)
-            await self.turn_detection.start()
+    async def _setup_turn_detection(self):
+        if not self.turn_detection:
+            return
+        self.logger.info("🎙️ Setting up turn detection listeners")
+        self.events.subscribe(self._on_turn_event)
+        # Support both async and sync start() while standardizing to async over time
+        import inspect
+        maybe = self.turn_detection.start()
+        if inspect.iscoroutine(maybe):
+            await maybe

Also consider enforcing an async start() in the TurnDetector base and updating all implementations to match for consistency.

🤖 Prompt for AI Agents
In agents-core/vision_agents/core/agents/agents.py around lines 639 to 644, the
call to self.turn_detection.start() assumes an async implementation and
unconditionally awaits it; change this to handle both sync and async starts by
calling start(), checking if the return value is awaitable (using
inspect.isawaitable) and awaiting only if needed, wrap the start call in
try/except to log errors, and after a successful start set the detector's active
state (e.g., setattr(self.turn_detection, "active", True) or call its activation
method if one exists) so detectors mark themselves active; optionally note that
the TurnDetector base class should be updated to require an async start() for
consistency and update implementations accordingly.

def _setup_stt(self):
if self.stt:
Expand All @@ -656,7 +657,7 @@ async def on_audio_received(event: AudioReceivedEvent):
return

if self.turn_detection is not None:
await self.turn_detection.process_audio(pcm, participant.user_id)
await self.turn_detection.process_audio(pcm, participant, conversation=self.conversation)

await self._reply_to_audio(pcm, participant)

Expand Down
14 changes: 9 additions & 5 deletions agents-core/vision_agents/core/edge/events.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,24 @@
from dataclasses import dataclass, field

from vision_agents.core.edge.types import PcmData
from getstream.video.rtc.track_util import PcmData
from vision_agents.core.events import PluginBaseEvent
from typing import Optional, Any


@dataclass
class AudioReceivedEvent(PluginBaseEvent):
"""Event emitted when audio is received from a participant."""
type: str = field(default='plugin.edge.audio_received', init=False)

type: str = field(default="plugin.edge.audio_received", init=False)
pcm_data: Optional[PcmData] = None
participant: Optional[Any] = None


@dataclass
class TrackAddedEvent(PluginBaseEvent):
"""Event emitted when a track is added to the call."""
type: str = field(default='plugin.edge.track_added', init=False)

type: str = field(default="plugin.edge.track_added", init=False)
track_id: Optional[str] = None
track_type: Optional[int] = None
user: Optional[Any] = None
Expand All @@ -25,7 +27,8 @@ class TrackAddedEvent(PluginBaseEvent):
@dataclass
class TrackRemovedEvent(PluginBaseEvent):
"""Event emitted when a track is removed from the call."""
type: str = field(default='plugin.edge.track_removed', init=False)

type: str = field(default="plugin.edge.track_removed", init=False)
track_id: Optional[str] = None
track_type: Optional[int] = None
user: Optional[Any] = None
Expand All @@ -34,6 +37,7 @@ class TrackRemovedEvent(PluginBaseEvent):
@dataclass
class CallEndedEvent(PluginBaseEvent):
"""Event emitted when a call ends."""
type: str = field(default='plugin.edge.call_ended', init=False)

type: str = field(default="plugin.edge.call_ended", init=False)
args: Optional[tuple] = None
kwargs: Optional[dict] = None
Loading
Loading