Summary
When using MiniMax models (M2.5-HighSpeed, M2.7-highspeed) as the extraction LLM, every retain and consolidation call triggers 5-11 JSON parse retries because the model wraps its JSON response in markdown code fences:
```json
{"facts": [...]}
```
The OpenAICompatibleLLM provider already has code fence stripping logic, but it only runs for specific providers — MiniMax and other OpenAI-compatible providers hit a different branch with no stripping.
Root cause
In openai_compatible_llm.py, the call() method has three relevant code paths:
Path A — Ollama native redirect (line ~201):
if self.provider == "ollama" and response_format is not None:
return await self._call_ollama_native(...)
When a response_format is provided (which retain and consolidation always do), Ollama is redirected to a separate method using the native /api/chat endpoint with schema-based format parameter. This path has its own json.loads() without fence stripping. Since both retain and consolidation always pass a response_format, Ollama structured output calls never reach the main parsing logic below.
Path B — Fence stripping, gated to local providers (line ~317):
if self.provider in ("lmstudio", "ollama"):
# strips ```json ... ``` fences ✓
Due to Path A, only LM Studio actually reaches this branch for structured output calls. Ollama can reach it for unstructured calls (when response_format is None), but the primary use cases (retain, consolidation) always provide a response format and get intercepted by Path A.
Path C — No fence stripping, used by everyone else (else branch):
else:
json_data = json.loads(content) # fails on fenced JSON → retry loop
MiniMax, Groq, OpenAI, and any other provider hit this branch. No fence stripping — fenced JSON triggers the retry loop up to max_retries (default 10) times.
Additional detail — json_object mode (line ~291):
if self.provider not in ("lmstudio", "ollama"):
call_params["response_format"] = {"type": "json_object"}
LM Studio and Ollama are excluded from json_object response format enforcement (they don't support it reliably), getting only schema-in-prompt guidance instead. This is another reason local models may produce fenced JSON — less format enforcement from the API layer.
Proposed fix
Remove the provider gate — apply fence stripping to all providers in the main call() method. It's a no-op when the response is already bare JSON:
# Strip markdown code fences if present (any provider may produce these)
clean_content = content
if "```json" in content:
clean_content = content.split("```json")[1].split("```")[0].strip()
elif "```" in content:
clean_content = content.split("```")[1].split("```")[0].strip()
try:
json_data = json.loads(clean_content)
except json.JSONDecodeError:
json_data = json.loads(content)
Additionally, _call_ollama_native() (line ~728) could use the same fence stripping as a safety net, even though Ollama models with schema-enforced format typically return bare JSON.
Reproduction
- Configure Hindsight with MiniMax:
HINDSIGHT_API_LLM_PROVIDER=minimax
HINDSIGHT_API_LLM_MODEL=MiniMax-M2.7-highspeed
HINDSIGHT_API_LLM_API_KEY=<key>
- Retain any content:
curl -X POST http://127.0.0.1:9077/v1/default/banks/test/memories \
-H 'Content-Type: application/json' \
-d '{"async":false,"items":[{"content":"Alice works at Google as a senior engineer."}]}'
- Observe logs:
WARNING - JSON parse error from LLM response (attempt 6/11): Expecting value: line 1 column 1 (char 0)
Model: minimax/MiniMax-M2.7-highspeed
Content preview: '```json\n{\n "facts": [...]}\n```'
Finish reason: stop
Impact
- 5-10x slower retain and consolidation (each call retries 5-11 times)
- Some operations fail entirely after exhausting all retries
- Rate limit cascading — retry storm triggers provider rate limits (429)
- Affects both
retain (fact extraction) and consolidation (observation generation)
Affected models
Confirmed on:
- MiniMax-M2.5-HighSpeed
- MiniMax-M2.7-highspeed
The same behavior is reported across other AI projects with various models:
Environment
- Hindsight: v0.4.19
- Provider: minimax (OpenAI-compatible endpoint)
- OS: Ubuntu 24.04
- Deployment: hindsight-api via uvx with embedded PostgreSQL (pg0)
Workaround
Switch to a model/provider that returns bare JSON. We benchmarked 10 models and switched to local Ollama with gemma3:12b — returns bare JSON natively via Ollama's schema-enforced /api/chat endpoint.
Summary
When using MiniMax models (M2.5-HighSpeed, M2.7-highspeed) as the extraction LLM, every
retainandconsolidationcall triggers 5-11 JSON parse retries because the model wraps its JSON response in markdown code fences:The
OpenAICompatibleLLMprovider already has code fence stripping logic, but it only runs for specific providers — MiniMax and other OpenAI-compatible providers hit a different branch with no stripping.Root cause
In
openai_compatible_llm.py, thecall()method has three relevant code paths:Path A — Ollama native redirect (line ~201):
When a
response_formatis provided (whichretainandconsolidationalways do), Ollama is redirected to a separate method using the native/api/chatendpoint with schema-basedformatparameter. This path has its ownjson.loads()without fence stripping. Since bothretainandconsolidationalways pass aresponse_format, Ollama structured output calls never reach the main parsing logic below.Path B — Fence stripping, gated to local providers (line ~317):
Due to Path A, only LM Studio actually reaches this branch for structured output calls. Ollama can reach it for unstructured calls (when
response_format is None), but the primary use cases (retain, consolidation) always provide a response format and get intercepted by Path A.Path C — No fence stripping, used by everyone else (else branch):
MiniMax, Groq, OpenAI, and any other provider hit this branch. No fence stripping — fenced JSON triggers the retry loop up to
max_retries(default 10) times.Additional detail — json_object mode (line ~291):
LM Studio and Ollama are excluded from
json_objectresponse format enforcement (they don't support it reliably), getting only schema-in-prompt guidance instead. This is another reason local models may produce fenced JSON — less format enforcement from the API layer.Proposed fix
Remove the provider gate — apply fence stripping to all providers in the main
call()method. It's a no-op when the response is already bare JSON:Additionally,
_call_ollama_native()(line ~728) could use the same fence stripping as a safety net, even though Ollama models with schema-enforcedformattypically return bare JSON.Reproduction
Impact
retain(fact extraction) andconsolidation(observation generation)Affected models
Confirmed on:
The same behavior is reported across other AI projects with various models:
json_objectresponse formatOpenAIGenericClientEnvironment
Workaround
Switch to a model/provider that returns bare JSON. We benchmarked 10 models and switched to local Ollama with gemma3:12b — returns bare JSON natively via Ollama's schema-enforced
/api/chatendpoint.