LLM JSON parse retries when model wraps response in markdown code fences

## Summary

When using MiniMax models (M2.5-HighSpeed, M2.7-highspeed) as the extraction LLM, every `retain` and `consolidation` call triggers 5-11 JSON parse retries because the model wraps its JSON response in markdown code fences:

````
```json
{"facts": [...]}
```
````

The `OpenAICompatibleLLM` provider already has code fence stripping logic, but it only runs for specific providers — MiniMax and other OpenAI-compatible providers hit a different branch with no stripping.

## Root cause

In `openai_compatible_llm.py`, the `call()` method has three relevant code paths:

**Path A — Ollama native redirect (line ~201):**
```python
if self.provider == "ollama" and response_format is not None:
    return await self._call_ollama_native(...)
```
When a `response_format` is provided (which `retain` and `consolidation` always do), Ollama is redirected to a separate method using the native `/api/chat` endpoint with schema-based `format` parameter. This path has its own `json.loads()` without fence stripping. Since both `retain` and `consolidation` always pass a `response_format`, Ollama structured output calls never reach the main parsing logic below.

**Path B — Fence stripping, gated to local providers (line ~317):**
```python
if self.provider in ("lmstudio", "ollama"):
    # strips ```json ... ``` fences ✓
```
Due to Path A, only LM Studio actually reaches this branch for structured output calls. Ollama can reach it for unstructured calls (when `response_format is None`), but the primary use cases (retain, consolidation) always provide a response format and get intercepted by Path A.

**Path C — No fence stripping, used by everyone else (else branch):**
```python
else:
    json_data = json.loads(content)  # fails on fenced JSON → retry loop
```
MiniMax, Groq, OpenAI, and any other provider hit this branch. No fence stripping — fenced JSON triggers the retry loop up to `max_retries` (default 10) times.

**Additional detail — json_object mode (line ~291):**
```python
if self.provider not in ("lmstudio", "ollama"):
    call_params["response_format"] = {"type": "json_object"}
```
LM Studio and Ollama are excluded from `json_object` response format enforcement (they don't support it reliably), getting only schema-in-prompt guidance instead. This is another reason local models may produce fenced JSON — less format enforcement from the API layer.

## Proposed fix

Remove the provider gate — apply fence stripping to all providers in the main `call()` method. It's a no-op when the response is already bare JSON:

```python
# Strip markdown code fences if present (any provider may produce these)
clean_content = content
if "```json" in content:
    clean_content = content.split("```json")[1].split("```")[0].strip()
elif "```" in content:
    clean_content = content.split("```")[1].split("```")[0].strip()
try:
    json_data = json.loads(clean_content)
except json.JSONDecodeError:
    json_data = json.loads(content)
```

Additionally, `_call_ollama_native()` (line ~728) could use the same fence stripping as a safety net, even though Ollama models with schema-enforced `format` typically return bare JSON.

## Reproduction

1. Configure Hindsight with MiniMax:
   ```
   HINDSIGHT_API_LLM_PROVIDER=minimax
   HINDSIGHT_API_LLM_MODEL=MiniMax-M2.7-highspeed
   HINDSIGHT_API_LLM_API_KEY=<key>
   ```
2. Retain any content:
   ```bash
   curl -X POST http://127.0.0.1:9077/v1/default/banks/test/memories \
     -H 'Content-Type: application/json' \
     -d '{"async":false,"items":[{"content":"Alice works at Google as a senior engineer."}]}'
   ```
3. Observe logs:
   ```
   WARNING - JSON parse error from LLM response (attempt 6/11): Expecting value: line 1 column 1 (char 0)
     Model: minimax/MiniMax-M2.7-highspeed
     Content preview: '```json\n{\n  "facts": [...]}\n```'
     Finish reason: stop
   ```

## Impact

- **5-10x slower** retain and consolidation (each call retries 5-11 times)
- **Some operations fail entirely** after exhausting all retries
- **Rate limit cascading** — retry storm triggers provider rate limits (429)
- Affects both `retain` (fact extraction) and `consolidation` (observation generation)

## Affected models

Confirmed on:
- MiniMax-M2.5-HighSpeed
- MiniMax-M2.7-highspeed

The same behavior is reported across other AI projects with various models:
- [MiroFish PR #72](https://github.com/666ghj/MiroFish/pull/72) — MiniMax M2.5 ignores `json_object` response format
- [mem0 #4401](https://github.com/mem0ai/mem0/issues/4401), [mem0 #4036](https://github.com/mem0ai/mem0/issues/4036) — fence stripping issues
- [chatwoot #13683](https://github.com/chatwoot/chatwoot/issues/13683) — OpenAI-compatible gateways
- [graphiti PR #1291](https://github.com/getzep/graphiti/actions/runs/22614182773) — added fence stripping to `OpenAIGenericClient`

## Environment

- Hindsight: v0.4.19
- Provider: minimax (OpenAI-compatible endpoint)
- OS: Ubuntu 24.04
- Deployment: hindsight-api via uvx with embedded PostgreSQL (pg0)

## Workaround

Switch to a model/provider that returns bare JSON. We benchmarked 10 models and switched to local Ollama with gemma3:12b — returns bare JSON natively via Ollama's schema-enforced `/api/chat` endpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM JSON parse retries when model wraps response in markdown code fences #645

Summary

Root cause

Proposed fix

Reproduction

Impact

Affected models

Environment

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLM JSON parse retries when model wraps response in markdown code fences #645

Description

Summary

Root cause

Proposed fix

Reproduction

Impact

Affected models

Environment

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions