-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Engineering Roadmap: 4 Critical Fixes for Production-Ready Deployment #421
Description
Overview
This issue tracks four engineering improvements needed before MiroFish can be reliably deployed in multi-user or commercial contexts. Each is independent and can be tackled in any order.
Fix 1 — Replace OASIS Subprocess with an Observable Async Execution Layer
Problem
simulation_runner.py launches simulation scripts via subprocess.Popen and communicates with them through simulation_ipc.py, which implements IPC by writing/reading JSON files in ipc_commands/ and ipc_responses/ directories. This file-polling pattern has three concrete failure modes:
| Failure | Impact |
|---|---|
| No real-time stdout streaming | Debugging a stuck simulation requires grepping log files; the UI cannot show live progress |
| File-polling race conditions | Two concurrent simulations targeting the same directory can consume each other's command files |
| Silent crash blindness | If the subprocess exits unexpectedly, the polling loop waits silently until timeout (up to 60 s) before surfacing an error |
Fix
Replace subprocess.Popen and the filesystem IPC with asyncio.create_subprocess_exec and per-simulation asyncio.Queue pairs:
# Before (simulation_runner.py)
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# After (simulation_runner_async.py)
process = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.STDOUT,
)
async for line in process.stdout:
await dispatch_log_event(simulation_id, line.decode())Replace SimulationIPCClient / SimulationIPCServer with an AsyncSimulationBridge holding one queue pair per simulation. Crash detection becomes immediate: when the subprocess exits, process.stdout closes in the same event-loop tick, automatically rejecting all pending IPC futures.
Impact
- Real-time log streaming to the frontend (SSE or WebSocket)
- Eliminates filesystem polling race conditions
- Crash detection in < 1 event-loop iteration instead of up to 60 s
- Cleaner test surface: mock a queue instead of a directory
Fix 2 — Add Auth + Input Validation Before Any Multi-User Deployment
Problem
All API endpoints in simulation.py, graph.py, and report.py have zero authentication. Any caller who knows the URL can create unlimited simulations, read any user's data by guessing a simulation_id, and inject arbitrary prompts through the /interview endpoint.
Three specific issues in config.py:
SECRET_KEY = os.environ.get('SECRET_KEY', 'mirofish-secret-key') # hardcoded fallback
DEBUG = os.environ.get('FLASK_DEBUG', 'True').lower() == 'true' # True by default (leaks tracebacks)
# API_KEY key does not existFix
1. API key middleware using constant-time comparison to prevent timing attacks:
def require_api_key(f):
@functools.wraps(f)
def decorated(*args, **kwargs):
key = request.headers.get('X-API-Key') or \
request.headers.get('Authorization', '').removeprefix('Bearer ')
if not key or not hmac.compare_digest(
hashlib.sha256(key.encode()).digest(),
hashlib.sha256(Config.API_KEY.encode()).digest(),
):
return jsonify({"success": False, "error": "Unauthorized"}), 401
return f(*args, **kwargs)
return decorated2. Input validators for every parameter that arrives from HTTP:
_ID_RE = re.compile(r'^[a-zA-Z0-9_-]{1,128}$')
def validate_id(value, field_name="id") -> str:
if not _ID_RE.match(str(value)):
raise ValueError(f"Invalid {field_name} format")
return value
def validate_prompt(value, max_len=4000) -> str:
if len(value.strip()) > max_len:
raise ValueError(f"prompt exceeds {max_len} characters")
return value.strip()3. Secure config defaults — no hardcoded SECRET_KEY, DEBUG=False, API_KEY required, validate() raises on startup rather than returning a list:
SECRET_KEY = os.environ.get('SECRET_KEY', '') # no hardcoded fallback
DEBUG = os.environ.get('FLASK_DEBUG', 'false').lower() == 'true'
API_KEY = os.environ.get('API_KEY', '')
@classmethod
def validate(cls):
missing = [k for k in ('SECRET_KEY', 'API_KEY', 'LLM_API_KEY', 'ZEP_API_KEY')
if not getattr(cls, k)]
if missing:
raise RuntimeError(f"Missing required env vars: {missing}")Impact
- Prevents unauthorized access and resource exhaustion
- Eliminates prompt-injection risk via the interview endpoint
- No refactoring of business logic needed — pure middleware layer
- Required before any multi-user or internet-facing deployment
Fix 3 — Resolve AGPL Exposure Before Any Commercial Wrapper is Built
Problem
MiroFish depends on OASIS (camel-ai/oasis), which is AGPL-3.0 licensed. AGPL copyleft extends to network use: any service that exposes AGPL-covered functionality over a network must publish its complete source under AGPL-3.0. Because MiroFish imports OASIS directly as a library, a closed-source SaaS wrapper built on this stack would violate AGPL unless its source is also published.
Fix Options
| Option | Cost | Source disclosure | Timeline |
|---|---|---|---|
| A — Comply with AGPL | Free | Full stack public | Immediate |
| B — Commercial license from CAMEL-AI | Negotiated | None required | Weeks |
| C — Process isolation | Engineering | OASIS service only | Weeks |
| D — Re-implement simulation layer | High | None required | Months |
Recommended immediate action regardless of option chosen:
- Add a
LICENSING.mddocumenting the AGPL dependency and deployer obligations - Add a licensing section to
README-EN.md
Option C architecture (process isolation cleanly separates AGPL from non-AGPL code):
[Commercial product — no OASIS code, not AGPL-covered]
|
REST/gRPC API
|
[MiroFish OASIS service — AGPL, source published]
Impact
- Removes legal risk before commercial development starts
- Establishes a clear architectural boundary
- Protects future investors and acquirers from open-source license liability
Fix 4 — Prompt Layer Improvements (Fastest ROI, No Refactoring)
Problem
The interview prompt prefix in simulation.py is minimal and produces shallow agent responses:
INTERVIEW_PROMPT_PREFIX = "结合你的人设、所有的过往记忆与行动,不调用任何工具直接用文本回复我:"Profile generation prompts do not request behavioral anchors (posting style, active hours, opinion drift), so all agents behave with roughly uniform patterns, reducing simulation realism.
Fix
Drop-in interview prefix replacement:
INTERVIEW_PROMPT_PREFIX = (
"你是一个拥有独特背景、价值观和社交媒体历史的真实人物。"
"请结合你的完整人设描述、所有过往记忆与社交媒体行动历史,"
"以第一人称、自然口语化的方式直接回答下面的问题。"
"不要调用任何工具,不要解释推理过程,"
"不要用作为AI开头,直接表达你真实的想法、情绪和立场:"
)Add behavioral anchors to the profile generation prompt:
behavioral_anchors:
posting_style: terse | verbose | meme-heavy | data-driven | emotional
active_hours: list of 0-23 integers (at least 4 hours)
stance: supportive | opposing | neutral | observer | amplifier
opinion_drift_rate: 0.0-1.0 (how easily they shift position under social pressure)
influence_weight: 0.5-3.0 (opinion leaders > 2.0, lurkers < 0.8)
Add archetype diversity instruction to simulation config prompt:
"Agent archetypes must be diverse: include opinion leaders (high influence, low drift), lurkers (low activity, observer stance), reactors (high drift, emotional style), and amplifiers (repost-heavy, high activity). Avoid uniform parameters — homogeneity kills simulation validity."
Impact
- Prompt-only change — zero refactoring risk
- Directly increases simulation output quality and interview response depth
- Behavioral anchors enable downstream analysis (filter by stance, segment by activity pattern)
- Archetype diversity produces richer emergent dynamics in the multi-agent simulation
Priority Ordering
| Priority | Fix | Effort | Risk if skipped |
|---|---|---|---|
| 1 | Fix 4 — Prompt improvements | Hours | Poor simulation quality |
| 2 | Fix 2 — Auth + validation | 1-2 days | Security breach on any shared deployment |
| 3 | Fix 3 — AGPL resolution | Low-High depending on path | Legal liability for commercial wrapper |
| 4 | Fix 1 — Async execution layer | 3-5 days | Fragile subprocess management at scale |
Fixes 2 and 3 are blockers for any production or commercial deployment. Fix 4 can be shipped immediately with no risk.