Skip to content

Engineering Roadmap: 4 Critical Fixes for Production-Ready Deployment #421

@Insider77Circle

Description

@Insider77Circle

Overview

This issue tracks four engineering improvements needed before MiroFish can be reliably deployed in multi-user or commercial contexts. Each is independent and can be tackled in any order.


Fix 1 — Replace OASIS Subprocess with an Observable Async Execution Layer

Problem

simulation_runner.py launches simulation scripts via subprocess.Popen and communicates with them through simulation_ipc.py, which implements IPC by writing/reading JSON files in ipc_commands/ and ipc_responses/ directories. This file-polling pattern has three concrete failure modes:

Failure Impact
No real-time stdout streaming Debugging a stuck simulation requires grepping log files; the UI cannot show live progress
File-polling race conditions Two concurrent simulations targeting the same directory can consume each other's command files
Silent crash blindness If the subprocess exits unexpectedly, the polling loop waits silently until timeout (up to 60 s) before surfacing an error

Fix

Replace subprocess.Popen and the filesystem IPC with asyncio.create_subprocess_exec and per-simulation asyncio.Queue pairs:

# Before (simulation_runner.py)
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# After (simulation_runner_async.py)
process = await asyncio.create_subprocess_exec(
    *cmd,
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.STDOUT,
)
async for line in process.stdout:
    await dispatch_log_event(simulation_id, line.decode())

Replace SimulationIPCClient / SimulationIPCServer with an AsyncSimulationBridge holding one queue pair per simulation. Crash detection becomes immediate: when the subprocess exits, process.stdout closes in the same event-loop tick, automatically rejecting all pending IPC futures.

Impact

  • Real-time log streaming to the frontend (SSE or WebSocket)
  • Eliminates filesystem polling race conditions
  • Crash detection in < 1 event-loop iteration instead of up to 60 s
  • Cleaner test surface: mock a queue instead of a directory

Fix 2 — Add Auth + Input Validation Before Any Multi-User Deployment

Problem

All API endpoints in simulation.py, graph.py, and report.py have zero authentication. Any caller who knows the URL can create unlimited simulations, read any user's data by guessing a simulation_id, and inject arbitrary prompts through the /interview endpoint.

Three specific issues in config.py:

SECRET_KEY = os.environ.get('SECRET_KEY', 'mirofish-secret-key')  # hardcoded fallback
DEBUG = os.environ.get('FLASK_DEBUG', 'True').lower() == 'true'   # True by default (leaks tracebacks)
# API_KEY key does not exist

Fix

1. API key middleware using constant-time comparison to prevent timing attacks:

def require_api_key(f):
    @functools.wraps(f)
    def decorated(*args, **kwargs):
        key = request.headers.get('X-API-Key') or \
              request.headers.get('Authorization', '').removeprefix('Bearer ')
        if not key or not hmac.compare_digest(
            hashlib.sha256(key.encode()).digest(),
            hashlib.sha256(Config.API_KEY.encode()).digest(),
        ):
            return jsonify({"success": False, "error": "Unauthorized"}), 401
        return f(*args, **kwargs)
    return decorated

2. Input validators for every parameter that arrives from HTTP:

_ID_RE = re.compile(r'^[a-zA-Z0-9_-]{1,128}$')

def validate_id(value, field_name="id") -> str:
    if not _ID_RE.match(str(value)):
        raise ValueError(f"Invalid {field_name} format")
    return value

def validate_prompt(value, max_len=4000) -> str:
    if len(value.strip()) > max_len:
        raise ValueError(f"prompt exceeds {max_len} characters")
    return value.strip()

3. Secure config defaults — no hardcoded SECRET_KEY, DEBUG=False, API_KEY required, validate() raises on startup rather than returning a list:

SECRET_KEY = os.environ.get('SECRET_KEY', '')         # no hardcoded fallback
DEBUG = os.environ.get('FLASK_DEBUG', 'false').lower() == 'true'
API_KEY = os.environ.get('API_KEY', '')

@classmethod
def validate(cls):
    missing = [k for k in ('SECRET_KEY', 'API_KEY', 'LLM_API_KEY', 'ZEP_API_KEY')
               if not getattr(cls, k)]
    if missing:
        raise RuntimeError(f"Missing required env vars: {missing}")

Impact

  • Prevents unauthorized access and resource exhaustion
  • Eliminates prompt-injection risk via the interview endpoint
  • No refactoring of business logic needed — pure middleware layer
  • Required before any multi-user or internet-facing deployment

Fix 3 — Resolve AGPL Exposure Before Any Commercial Wrapper is Built

Problem

MiroFish depends on OASIS (camel-ai/oasis), which is AGPL-3.0 licensed. AGPL copyleft extends to network use: any service that exposes AGPL-covered functionality over a network must publish its complete source under AGPL-3.0. Because MiroFish imports OASIS directly as a library, a closed-source SaaS wrapper built on this stack would violate AGPL unless its source is also published.

Fix Options

Option Cost Source disclosure Timeline
A — Comply with AGPL Free Full stack public Immediate
B — Commercial license from CAMEL-AI Negotiated None required Weeks
C — Process isolation Engineering OASIS service only Weeks
D — Re-implement simulation layer High None required Months

Recommended immediate action regardless of option chosen:

  1. Add a LICENSING.md documenting the AGPL dependency and deployer obligations
  2. Add a licensing section to README-EN.md

Option C architecture (process isolation cleanly separates AGPL from non-AGPL code):

[Commercial product — no OASIS code, not AGPL-covered]
        |
   REST/gRPC API
        |
[MiroFish OASIS service — AGPL, source published]

Impact

  • Removes legal risk before commercial development starts
  • Establishes a clear architectural boundary
  • Protects future investors and acquirers from open-source license liability

Fix 4 — Prompt Layer Improvements (Fastest ROI, No Refactoring)

Problem

The interview prompt prefix in simulation.py is minimal and produces shallow agent responses:

INTERVIEW_PROMPT_PREFIX = "结合你的人设、所有的过往记忆与行动,不调用任何工具直接用文本回复我:"

Profile generation prompts do not request behavioral anchors (posting style, active hours, opinion drift), so all agents behave with roughly uniform patterns, reducing simulation realism.

Fix

Drop-in interview prefix replacement:

INTERVIEW_PROMPT_PREFIX = (
    "你是一个拥有独特背景、价值观和社交媒体历史的真实人物。"
    "请结合你的完整人设描述、所有过往记忆与社交媒体行动历史,"
    "以第一人称、自然口语化的方式直接回答下面的问题。"
    "不要调用任何工具,不要解释推理过程,"
    "不要用作为AI开头,直接表达你真实的想法、情绪和立场:"
)

Add behavioral anchors to the profile generation prompt:

behavioral_anchors:
  posting_style: terse | verbose | meme-heavy | data-driven | emotional
  active_hours: list of 0-23 integers (at least 4 hours)
  stance: supportive | opposing | neutral | observer | amplifier
  opinion_drift_rate: 0.0-1.0 (how easily they shift position under social pressure)
  influence_weight: 0.5-3.0 (opinion leaders > 2.0, lurkers < 0.8)

Add archetype diversity instruction to simulation config prompt:

"Agent archetypes must be diverse: include opinion leaders (high influence, low drift), lurkers (low activity, observer stance), reactors (high drift, emotional style), and amplifiers (repost-heavy, high activity). Avoid uniform parameters — homogeneity kills simulation validity."

Impact

  • Prompt-only change — zero refactoring risk
  • Directly increases simulation output quality and interview response depth
  • Behavioral anchors enable downstream analysis (filter by stance, segment by activity pattern)
  • Archetype diversity produces richer emergent dynamics in the multi-agent simulation

Priority Ordering

Priority Fix Effort Risk if skipped
1 Fix 4 — Prompt improvements Hours Poor simulation quality
2 Fix 2 — Auth + validation 1-2 days Security breach on any shared deployment
3 Fix 3 — AGPL resolution Low-High depending on path Legal liability for commercial wrapper
4 Fix 1 — Async execution layer 3-5 days Fragile subprocess management at scale

Fixes 2 and 3 are blockers for any production or commercial deployment. Fix 4 can be shipped immediately with no risk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions