Skip to content

LLM JSON parse retries when model wraps response in markdown code fences #645

@feniix

Description

@feniix

Summary

When using MiniMax models (M2.5-HighSpeed, M2.7-highspeed) as the extraction LLM, every retain and consolidation call triggers 5-11 JSON parse retries because the model wraps its JSON response in markdown code fences:

```json
{"facts": [...]}
```

The OpenAICompatibleLLM provider already has code fence stripping logic, but it only runs for specific providers — MiniMax and other OpenAI-compatible providers hit a different branch with no stripping.

Root cause

In openai_compatible_llm.py, the call() method has three relevant code paths:

Path A — Ollama native redirect (line ~201):

if self.provider == "ollama" and response_format is not None:
    return await self._call_ollama_native(...)

When a response_format is provided (which retain and consolidation always do), Ollama is redirected to a separate method using the native /api/chat endpoint with schema-based format parameter. This path has its own json.loads() without fence stripping. Since both retain and consolidation always pass a response_format, Ollama structured output calls never reach the main parsing logic below.

Path B — Fence stripping, gated to local providers (line ~317):

if self.provider in ("lmstudio", "ollama"):
    # strips ```json ... ``` fences ✓

Due to Path A, only LM Studio actually reaches this branch for structured output calls. Ollama can reach it for unstructured calls (when response_format is None), but the primary use cases (retain, consolidation) always provide a response format and get intercepted by Path A.

Path C — No fence stripping, used by everyone else (else branch):

else:
    json_data = json.loads(content)  # fails on fenced JSON → retry loop

MiniMax, Groq, OpenAI, and any other provider hit this branch. No fence stripping — fenced JSON triggers the retry loop up to max_retries (default 10) times.

Additional detail — json_object mode (line ~291):

if self.provider not in ("lmstudio", "ollama"):
    call_params["response_format"] = {"type": "json_object"}

LM Studio and Ollama are excluded from json_object response format enforcement (they don't support it reliably), getting only schema-in-prompt guidance instead. This is another reason local models may produce fenced JSON — less format enforcement from the API layer.

Proposed fix

Remove the provider gate — apply fence stripping to all providers in the main call() method. It's a no-op when the response is already bare JSON:

# Strip markdown code fences if present (any provider may produce these)
clean_content = content
if "```json" in content:
    clean_content = content.split("```json")[1].split("```")[0].strip()
elif "```" in content:
    clean_content = content.split("```")[1].split("```")[0].strip()
try:
    json_data = json.loads(clean_content)
except json.JSONDecodeError:
    json_data = json.loads(content)

Additionally, _call_ollama_native() (line ~728) could use the same fence stripping as a safety net, even though Ollama models with schema-enforced format typically return bare JSON.

Reproduction

  1. Configure Hindsight with MiniMax:
    HINDSIGHT_API_LLM_PROVIDER=minimax
    HINDSIGHT_API_LLM_MODEL=MiniMax-M2.7-highspeed
    HINDSIGHT_API_LLM_API_KEY=<key>
    
  2. Retain any content:
    curl -X POST http://127.0.0.1:9077/v1/default/banks/test/memories \
      -H 'Content-Type: application/json' \
      -d '{"async":false,"items":[{"content":"Alice works at Google as a senior engineer."}]}'
  3. Observe logs:
    WARNING - JSON parse error from LLM response (attempt 6/11): Expecting value: line 1 column 1 (char 0)
      Model: minimax/MiniMax-M2.7-highspeed
      Content preview: '```json\n{\n  "facts": [...]}\n```'
      Finish reason: stop
    

Impact

  • 5-10x slower retain and consolidation (each call retries 5-11 times)
  • Some operations fail entirely after exhausting all retries
  • Rate limit cascading — retry storm triggers provider rate limits (429)
  • Affects both retain (fact extraction) and consolidation (observation generation)

Affected models

Confirmed on:

  • MiniMax-M2.5-HighSpeed
  • MiniMax-M2.7-highspeed

The same behavior is reported across other AI projects with various models:

Environment

  • Hindsight: v0.4.19
  • Provider: minimax (OpenAI-compatible endpoint)
  • OS: Ubuntu 24.04
  • Deployment: hindsight-api via uvx with embedded PostgreSQL (pg0)

Workaround

Switch to a model/provider that returns bare JSON. We benchmarked 10 models and switched to local Ollama with gemma3:12b — returns bare JSON natively via Ollama's schema-enforced /api/chat endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions