Engineering Roadmap: 4 Critical Fixes for Production-Ready Deployment

## Overview

This issue tracks four engineering improvements needed before MiroFish can be reliably deployed in multi-user or commercial contexts. Each is independent and can be tackled in any order.

---

## Fix 1 — Replace OASIS Subprocess with an Observable Async Execution Layer

### Problem

`simulation_runner.py` launches simulation scripts via `subprocess.Popen` and communicates with them through `simulation_ipc.py`, which implements IPC by writing/reading JSON files in `ipc_commands/` and `ipc_responses/` directories. This file-polling pattern has three concrete failure modes:

| Failure | Impact |
|---------|--------|
| No real-time stdout streaming | Debugging a stuck simulation requires grepping log files; the UI cannot show live progress |
| File-polling race conditions | Two concurrent simulations targeting the same directory can consume each other's command files |
| Silent crash blindness | If the subprocess exits unexpectedly, the polling loop waits silently until timeout (up to 60 s) before surfacing an error |

### Fix

Replace `subprocess.Popen` and the filesystem IPC with `asyncio.create_subprocess_exec` and per-simulation `asyncio.Queue` pairs:

```python
# Before (simulation_runner.py)
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# After (simulation_runner_async.py)
process = await asyncio.create_subprocess_exec(
    *cmd,
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.STDOUT,
)
async for line in process.stdout:
    await dispatch_log_event(simulation_id, line.decode())
```

Replace `SimulationIPCClient` / `SimulationIPCServer` with an `AsyncSimulationBridge` holding one queue pair per simulation. Crash detection becomes immediate: when the subprocess exits, `process.stdout` closes in the same event-loop tick, automatically rejecting all pending IPC futures.

### Impact
- Real-time log streaming to the frontend (SSE or WebSocket)
- Eliminates filesystem polling race conditions
- Crash detection in < 1 event-loop iteration instead of up to 60 s
- Cleaner test surface: mock a queue instead of a directory

---

## Fix 2 — Add Auth + Input Validation Before Any Multi-User Deployment

### Problem

All API endpoints in `simulation.py`, `graph.py`, and `report.py` have **zero authentication**. Any caller who knows the URL can create unlimited simulations, read any user's data by guessing a `simulation_id`, and inject arbitrary prompts through the `/interview` endpoint.

Three specific issues in `config.py`:

```python
SECRET_KEY = os.environ.get('SECRET_KEY', 'mirofish-secret-key')  # hardcoded fallback
DEBUG = os.environ.get('FLASK_DEBUG', 'True').lower() == 'true'   # True by default (leaks tracebacks)
# API_KEY key does not exist
```

### Fix

**1. API key middleware** using constant-time comparison to prevent timing attacks:

```python
def require_api_key(f):
    @functools.wraps(f)
    def decorated(*args, **kwargs):
        key = request.headers.get('X-API-Key') or \
              request.headers.get('Authorization', '').removeprefix('Bearer ')
        if not key or not hmac.compare_digest(
            hashlib.sha256(key.encode()).digest(),
            hashlib.sha256(Config.API_KEY.encode()).digest(),
        ):
            return jsonify({"success": False, "error": "Unauthorized"}), 401
        return f(*args, **kwargs)
    return decorated
```

**2. Input validators** for every parameter that arrives from HTTP:

```python
_ID_RE = re.compile(r'^[a-zA-Z0-9_-]{1,128}$')

def validate_id(value, field_name="id") -> str:
    if not _ID_RE.match(str(value)):
        raise ValueError(f"Invalid {field_name} format")
    return value

def validate_prompt(value, max_len=4000) -> str:
    if len(value.strip()) > max_len:
        raise ValueError(f"prompt exceeds {max_len} characters")
    return value.strip()
```

**3. Secure config defaults** — no hardcoded SECRET_KEY, DEBUG=False, API_KEY required, validate() raises on startup rather than returning a list:

```python
SECRET_KEY = os.environ.get('SECRET_KEY', '')         # no hardcoded fallback
DEBUG = os.environ.get('FLASK_DEBUG', 'false').lower() == 'true'
API_KEY = os.environ.get('API_KEY', '')

@classmethod
def validate(cls):
    missing = [k for k in ('SECRET_KEY', 'API_KEY', 'LLM_API_KEY', 'ZEP_API_KEY')
               if not getattr(cls, k)]
    if missing:
        raise RuntimeError(f"Missing required env vars: {missing}")
```

### Impact
- Prevents unauthorized access and resource exhaustion
- Eliminates prompt-injection risk via the interview endpoint
- No refactoring of business logic needed — pure middleware layer
- Required before any multi-user or internet-facing deployment

---

## Fix 3 — Resolve AGPL Exposure Before Any Commercial Wrapper is Built

### Problem

MiroFish depends on **OASIS** (`camel-ai/oasis`), which is AGPL-3.0 licensed. AGPL copyleft extends to **network use**: any service that exposes AGPL-covered functionality over a network must publish its complete source under AGPL-3.0. Because MiroFish imports OASIS directly as a library, a closed-source SaaS wrapper built on this stack would violate AGPL unless its source is also published.

### Fix Options

| Option | Cost | Source disclosure | Timeline |
|--------|------|-------------------|----------|
| **A — Comply with AGPL** | Free | Full stack public | Immediate |
| **B — Commercial license from CAMEL-AI** | Negotiated | None required | Weeks |
| **C — Process isolation** | Engineering | OASIS service only | Weeks |
| **D — Re-implement simulation layer** | High | None required | Months |

**Recommended immediate action** regardless of option chosen:
1. Add a `LICENSING.md` documenting the AGPL dependency and deployer obligations
2. Add a licensing section to `README-EN.md`

**Option C architecture** (process isolation cleanly separates AGPL from non-AGPL code):

```
[Commercial product — no OASIS code, not AGPL-covered]
        |
   REST/gRPC API
        |
[MiroFish OASIS service — AGPL, source published]
```

### Impact
- Removes legal risk before commercial development starts
- Establishes a clear architectural boundary
- Protects future investors and acquirers from open-source license liability

---

## Fix 4 — Prompt Layer Improvements (Fastest ROI, No Refactoring)

### Problem

The interview prompt prefix in `simulation.py` is minimal and produces shallow agent responses:

```python
INTERVIEW_PROMPT_PREFIX = "结合你的人设、所有的过往记忆与行动，不调用任何工具直接用文本回复我："
```

Profile generation prompts do not request behavioral anchors (posting style, active hours, opinion drift), so all agents behave with roughly uniform patterns, reducing simulation realism.

### Fix

**Drop-in interview prefix replacement:**

```python
INTERVIEW_PROMPT_PREFIX = (
    "你是一个拥有独特背景、价值观和社交媒体历史的真实人物。"
    "请结合你的完整人设描述、所有过往记忆与社交媒体行动历史，"
    "以第一人称、自然口语化的方式直接回答下面的问题。"
    "不要调用任何工具，不要解释推理过程，"
    "不要用作为AI开头，直接表达你真实的想法、情绪和立场："
)
```

**Add behavioral anchors to the profile generation prompt:**

```
behavioral_anchors:
  posting_style: terse | verbose | meme-heavy | data-driven | emotional
  active_hours: list of 0-23 integers (at least 4 hours)
  stance: supportive | opposing | neutral | observer | amplifier
  opinion_drift_rate: 0.0-1.0 (how easily they shift position under social pressure)
  influence_weight: 0.5-3.0 (opinion leaders > 2.0, lurkers < 0.8)
```

**Add archetype diversity instruction to simulation config prompt:**
> "Agent archetypes must be diverse: include opinion leaders (high influence, low drift), lurkers (low activity, observer stance), reactors (high drift, emotional style), and amplifiers (repost-heavy, high activity). Avoid uniform parameters — homogeneity kills simulation validity."

### Impact
- Prompt-only change — zero refactoring risk
- Directly increases simulation output quality and interview response depth
- Behavioral anchors enable downstream analysis (filter by stance, segment by activity pattern)
- Archetype diversity produces richer emergent dynamics in the multi-agent simulation

---

## Priority Ordering

| Priority | Fix | Effort | Risk if skipped |
|----------|-----|--------|-----------------|
| 1 | Fix 4 — Prompt improvements | Hours | Poor simulation quality |
| 2 | Fix 2 — Auth + validation | 1-2 days | Security breach on any shared deployment |
| 3 | Fix 3 — AGPL resolution | Low-High depending on path | Legal liability for commercial wrapper |
| 4 | Fix 1 — Async execution layer | 3-5 days | Fragile subprocess management at scale |

Fixes 2 and 3 are blockers for any production or commercial deployment. Fix 4 can be shipped immediately with no risk.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engineering Roadmap: 4 Critical Fixes for Production-Ready Deployment #421

Overview

Fix 1 — Replace OASIS Subprocess with an Observable Async Execution Layer

Problem

Fix

Impact

Fix 2 — Add Auth + Input Validation Before Any Multi-User Deployment

Problem

Fix

Impact

Fix 3 — Resolve AGPL Exposure Before Any Commercial Wrapper is Built

Problem

Fix Options

Impact

Fix 4 — Prompt Layer Improvements (Fastest ROI, No Refactoring)

Problem

Fix

Impact

Priority Ordering

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failure	Impact
No real-time stdout streaming	Debugging a stuck simulation requires grepping log files; the UI cannot show live progress
File-polling race conditions	Two concurrent simulations targeting the same directory can consume each other's command files
Silent crash blindness	If the subprocess exits unexpectedly, the polling loop waits silently until timeout (up to 60 s) before surfacing an error

Option	Cost	Source disclosure	Timeline
A — Comply with AGPL	Free	Full stack public	Immediate
B — Commercial license from CAMEL-AI	Negotiated	None required	Weeks
C — Process isolation	Engineering	OASIS service only	Weeks
D — Re-implement simulation layer	High	None required	Months

Priority	Fix	Effort	Risk if skipped
1	Fix 4 — Prompt improvements	Hours	Poor simulation quality
2	Fix 2 — Auth + validation	1-2 days	Security breach on any shared deployment
3	Fix 3 — AGPL resolution	Low-High depending on path	Legal liability for commercial wrapper
4	Fix 1 — Async execution layer	3-5 days	Fragile subprocess management at scale

Engineering Roadmap: 4 Critical Fixes for Production-Ready Deployment #421

Description

Overview

Fix 1 — Replace OASIS Subprocess with an Observable Async Execution Layer

Problem

Fix

Impact

Fix 2 — Add Auth + Input Validation Before Any Multi-User Deployment

Problem

Fix

Impact

Fix 3 — Resolve AGPL Exposure Before Any Commercial Wrapper is Built

Problem

Fix Options

Impact

Fix 4 — Prompt Layer Improvements (Fastest ROI, No Refactoring)

Problem

Fix

Impact

Priority Ordering

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions