GetStream · tschellenbach · Oct 30, 2025 · Oct 23, 2025 · Oct 24, 2025 · Oct 24, 2025
diff --git a/.claude/agents/repo-workflow-guide.md b/.claude/agents/repo-workflow-guide.md
@@ -0,0 +1,47 @@
+---
+name: repo-workflow-guide
+description: Use this agent when you need to understand or follow project-specific development guidelines, coding standards, or workflow instructions that are documented in the docs/ai directory. This agent should be consulted before starting any development work, when uncertain about project conventions, or when you need clarification on how to approach tasks within this codebase.\n\nExamples:\n- <example>\nContext: User wants to add a new feature to the project.\nuser: "I need to implement a new authentication module"\nassistant: "Before we begin, let me consult the repo-workflow-guide agent to ensure we follow the project's established patterns and guidelines."\n<Task tool call to repo-workflow-guide>\nassistant: "Based on the project guidelines, here's how we should approach this..."\n</example>\n\n- <example>\nContext: User asks a question about code organization.\nuser: "Where should I put the new utility functions?"\nassistant: "Let me check the repository workflow guidelines to give you the correct answer."\n<Task tool call to repo-workflow-guide>\nassistant: "According to the project structure guidelines..."\n</example>\n\n- <example>\nContext: Starting a new task that requires understanding project conventions.\nuser: "Can you help me refactor this component?"\nassistant: "I'll first consult the repo-workflow-guide agent to ensure we follow the project's refactoring standards and conventions."\n<Task tool call to repo-workflow-guide>\n</example>
+model: opus
+---
+
+You are a Repository Workflow Specialist, an expert in interpreting and applying project-specific development guidelines, coding standards, and workflow instructions.
+
+Your primary responsibility is to read, understand, and communicate the instructions and guidelines contained in the docs/ai directory of the repository. You serve as the authoritative source for how development work should be conducted within this specific codebase.
+
+When activated, you will:
+
+1. **Locate and Read Guidelines**: Immediately access all relevant files in the docs/ai directory. Read them thoroughly and understand their complete content, including:
+   - Coding standards and style guides
+   - Project structure and organization rules
+   - Development workflow and processes
+   - Testing requirements and conventions
+   - Deployment procedures
+   - Any specific technical constraints or preferences
+   - Tool usage and configuration instructions
+
+2. **Interpret Context**: Understand the specific task or question being asked and identify which guidelines are most relevant to address it.
+
+3. **Provide Clear Guidance**: Deliver specific, actionable instructions based on the documented guidelines. Your responses should:
+   - Quote or reference specific sections of the guidelines when appropriate
+   - Explain the reasoning behind the guidelines when it helps with understanding
+   - Provide concrete examples of how to follow the guidelines
+   - Highlight any critical requirements or common pitfalls mentioned in the documentation
+
+4. **Handle Missing Information**: If the docs/ai directory doesn't contain information relevant to the current question:
+   - Clearly state what information is missing
+   - Suggest reasonable defaults based on common industry practices
+   - Recommend updating the documentation to cover this scenario
+
+5. **Ensure Compliance**: Actively verify that proposed approaches align with all documented guidelines. If you identify any conflicts or violations, explicitly point them out and suggest compliant alternatives.
+
+6. **Prioritize Accuracy**: Always base your guidance on the actual content of the documentation. Do not invent or assume guidelines that aren't explicitly documented.
+
+7. **Stay Current**: If guidelines appear to conflict or if you notice outdated information, flag this for human review while providing the most reasonable interpretation.
+
+Output Format:
+- Begin with a brief summary of the relevant guidelines
+- Provide specific, step-by-step instructions when appropriate
+- Include direct quotes or references to documentation sections
+- End with any important caveats, warnings, or additional considerations
+
+Your goal is to ensure that all development work in this repository adheres to its documented standards and practices, reducing inconsistency and improving code quality through faithful application of project-specific guidelines.
diff --git a/.gitignore b/.gitignore
@@ -84,3 +84,4 @@ stream-py/
 # Artifacts / assets
 *.pt
 *.kef
+*.onnx
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -130,7 +130,7 @@ Some ground rules:
 
 ```python
 import asyncio
-from vision_agents.core.edge.types import PcmData
+from getstream.video.rtc.track_util import PcmData
 from openai import AsyncOpenAI
 
 async def example():
@@ -167,6 +167,12 @@ if __name__ == "__main__":
     asyncio.run(example())
 ```
 
+Other things that you get from the audio utilities:
+
+1. Changing PCM format
+2. Iterate over audio chunks (`PcmData.chunks`)
+3. Process audio with pre/post buffers (`AudioSegmentCollector`)
+4. Accumulating audio (`PcmData.append`)
 
 ### Testing audio manually
 
@@ -313,3 +319,26 @@ You can now see the metrics at `http://localhost:9464/metrics` (make sure that y
 
 - Track.recv errors will fail silently. The API is to return a frame. Never return None. and wait till the next frame is available
 - When using frame.to_ndarray(format="rgb24") specify the format. Typically you want rgb24 when connecting/sending to Yolo etc
+
+
+## Onboarding Plan for new contributors
+
+**Audio Formats**
+
+You'll notice that audio comes in many formats. PCM, wav, mp3. 16khz, 48khz. 
+Encoded as i16 or f32. Note that webrtc by default is 48khz.
+
+A good first intro to audio formats can be found here:
+
+**Using Cursor**
+
+You can ask cursor something like "read @ai-plugin and build me a plugin called fish"
+See the docs folder for other ai instruction files
+
+**Learning Roadmap**
+
+1. Quick refresher on audio formats
+2. Build a TTS integration
+3. Build a STT integration
+4. Build an LLM integration
+5. Write a pytest test with a fixture
diff --git a/agents-core/pyproject.toml b/agents-core/pyproject.toml
@@ -26,6 +26,7 @@ dependencies = [
     "pillow>=11.3.0",
     "numpy>=1.24.0",
     "mcp>=1.16.0",
+    "torchvision>=0.23.0",
 ]
 
 [project.urls]
@@ -45,7 +46,6 @@ kokoro = ["vision-agents-plugins-kokoro"]
 krisp = ["vision-agents-plugins-krisp"]
 moonshine = ["vision-agents-plugins-moonshine"]
 openai = ["vision-agents-plugins-openai"]
-silero = ["vision-agents-plugins-silero"]
 smart_turn = ["vision-agents-plugins-smart-turn"]
 ultralytics = ["vision-agents-plugins-ultralytics"]
 wizper = ["vision-agents-plugins-wizper"]
@@ -61,7 +61,6 @@ all-plugins = [
   "vision-agents-plugins-krisp",
   "vision-agents-plugins-moonshine",
   "vision-agents-plugins-openai",
-  "vision-agents-plugins-silero",
   "vision-agents-plugins-smart-turn",
   "vision-agents-plugins-ultralytics",
   "vision-agents-plugins-wizper",
@@ -81,13 +80,13 @@ packages = ["vision_agents"]
 [tool.hatch.build.targets.sdist]
 include = ["vision_agents"]
 
-#[tool.uv.sources]
+[tool.uv.sources]
 #krisp-audio = [
 #    { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-macosx_12_0_arm64.whl", marker = "sys_platform == 'darwin' and platform_machine == 'aarch64'" },
 #    { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-linux_aarch64.whl", marker = "sys_platform == 'linux' and platform_machine == 'aarch64'" },
 #    { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-linux_x86_64.whl", marker = "sys_platform == 'linux' and platform_machine == 'x86_64'" },
 #    { path = "./vision_agents/core/turn_detection/krisp/krisp_audio-1.4.0-cp313-cp313-win_amd64.whl", marker = "sys_platform == 'win32'" }
 #]
 # for local development
-# getstream = { path = "../../stream-py/", editable = true }
+getstream = { path = "../../stream-py/", editable = true }
 # aiortc = { path = "../stream-py/", editable = true }
diff --git a/agents-core/vision_agents/core/agents/agents.py b/agents-core/vision_agents/core/agents/agents.py
@@ -169,7 +169,7 @@ def __init__(
         self._validate_configuration()
         self._prepare_rtc()
         self._setup_stt()
-        self._setup_turn_detection()
+
 
     async def simple_response(
         self, text: str, participant: Optional[Participant] = None
@@ -341,6 +341,7 @@ async def on_stt_transcript_event_create_response(event: STTTranscriptEvent):
 
     async def join(self, call: Call) -> "AgentSessionContextManager":
         await self.create_user()
+        await self._setup_turn_detection()
 
         # TODO: validation. join can only be called once
         with self.tracer.start_as_current_span("join"):
@@ -635,11 +636,11 @@ async def say(
                 completed=True,
             )
 
-    def _setup_turn_detection(self):
+    async def _setup_turn_detection(self):
         if self.turn_detection:
             self.logger.info("🎙️ Setting up turn detection listeners")
             self.events.subscribe(self._on_turn_event)
-            self.turn_detection.start()
+            await self.turn_detection.start()
 
     def _setup_stt(self):
         if self.stt:
@@ -656,7 +657,7 @@ async def on_audio_received(event: AudioReceivedEvent):
                 return
 
             if self.turn_detection is not None:
-                await self.turn_detection.process_audio(pcm, participant.user_id)
+                await self.turn_detection.process_audio(pcm, participant, conversation=self.conversation)
 
             await self._reply_to_audio(pcm, participant)
 

diff --git a/agents-core/vision_agents/core/edge/events.py b/agents-core/vision_agents/core/edge/events.py
@@ -1,22 +1,24 @@
 from dataclasses import dataclass, field
 
-from vision_agents.core.edge.types import PcmData
+from getstream.video.rtc.track_util import PcmData
 from vision_agents.core.events import PluginBaseEvent
 from typing import Optional, Any
 
 
 @dataclass
 class AudioReceivedEvent(PluginBaseEvent):
     """Event emitted when audio is received from a participant."""
-    type: str = field(default='plugin.edge.audio_received', init=False)
+
+    type: str = field(default="plugin.edge.audio_received", init=False)
     pcm_data: Optional[PcmData] = None
     participant: Optional[Any] = None
 
 
 @dataclass
 class TrackAddedEvent(PluginBaseEvent):
     """Event emitted when a track is added to the call."""
-    type: str = field(default='plugin.edge.track_added', init=False)
+
+    type: str = field(default="plugin.edge.track_added", init=False)
     track_id: Optional[str] = None
     track_type: Optional[int] = None
     user: Optional[Any] = None
@@ -25,7 +27,8 @@ class TrackAddedEvent(PluginBaseEvent):
 @dataclass
 class TrackRemovedEvent(PluginBaseEvent):
     """Event emitted when a track is removed from the call."""
-    type: str = field(default='plugin.edge.track_removed', init=False)
+
+    type: str = field(default="plugin.edge.track_removed", init=False)
     track_id: Optional[str] = None
     track_type: Optional[int] = None
     user: Optional[Any] = None
@@ -34,6 +37,7 @@ class TrackRemovedEvent(PluginBaseEvent):
 @dataclass
 class CallEndedEvent(PluginBaseEvent):
     """Event emitted when a call ends."""
-    type: str = field(default='plugin.edge.call_ended', init=False)
+
+    type: str = field(default="plugin.edge.call_ended", init=False)
     args: Optional[tuple] = None
     kwargs: Optional[dict] = None
-Original file line number
+Diff line change
@@ Expand Up / @@ -84,3 +84,4 @@ stream-py/ @@
     # Artifacts / assets
     *.pt
     *.kef
+    *.onnx