Skip to content

fix(session): recover from JSON corruption in session state files#3278

Open
zealonexp wants to merge 3 commits intoagentscope-ai:mainfrom
zealonexp:fix/session-json-corruption
Open

fix(session): recover from JSON corruption in session state files#3278
zealonexp wants to merge 3 commits intoagentscope-ai:mainfrom
zealonexp:fix/session-json-corruption

Conversation

@zealonexp
Copy link
Copy Markdown

Description

Fix a P0 availability issue where session state JSON files become corrupted due to concurrent write race conditions, causing JSONDecodeError: Extra data on every subsequent request to the affected session.

Root cause: save_session_state, update_session_state, and get_session_state_dict access the same session file concurrently without any locking mechanism. When two coroutines execute open("w") on the same file nearly simultaneously, the OS-level writes interleave, producing a file with a valid JSON object followed by garbage fragments. Once corrupted, every request to that session fails with a 422 error — no recovery is possible without manual intervention.

Fix: Add JSONDecodeError catch with raw_decode fallback at all three json.loads call sites in SafeJSONSession. When corruption is detected, the first valid JSON object is extracted and a warning is logged, instead of crashing the request.

Related Issue: Fixes #3277

Security Considerations: No changes to auth, config handling, or external-facing interfaces. Pure defensive parsing change.

Type of Change

  • Bug fix

Component(s) Affected

  • Core / Backend (app, agents, config, providers, utils, local_models)

Checklist

  • I ran pre-commit run --all-files locally and it passes
  • If pre-commit auto-fixed files, I committed those changes and reran checks
  • I ran tests locally (pytest or as relevant) and they pass
  • Documentation updated (if needed)
  • Ready for review

Testing

New tests added

tests/unit/agents/test_session.py — 7 test cases covering:

Test Scenario
test_load_valid_json Normal JSON loads without error (baseline)
test_load_corrupted_json_extra_data 17-char garbage tail (first observed pattern)
test_load_corrupted_json_real_world_tail 203-char garbage tail (second observed pattern)
test_get_corrupted_json get_session_state_dict recovers from corruption
test_update_corrupted_json update_session_state recovers, then writes clean JSON
test_load_nonexistent Non-existent session returns gracefully
test_get_nonexistent Non-existent session returns empty dict

Local Verification Evidence

pre-commit run --files src/copaw/app/runner/session.py tests/unit/agents/test_session.py
# check python ast.........................................Passed
# check docstring is first.........................................Passed
# detect private key.......................................................Passed
# trim trailing whitespace.................................................Passed
# Add trailing commas......................................................Passed
# mypy.....................................................................Passed
# black....................................................................Passed
# flake8...................................................................Passed
# pylint...................................................................Passed

pytest tests/unit/agents/test_session.py -v
# 7 passed, 1 warning in 0.16s

Additional Notes

Files changed

File Change
src/copaw/app/runner/session.py Add JSONDecodeErrorraw_decode fallback in load_session_state, update_session_state, get_session_state_dict
tests/unit/agents/test_session.py New — 7 test cases for corruption resilience

Recommended follow-ups (out of scope)

  1. asyncio.Lock — Serialize concurrent reads/writes to eliminate race at source
  2. Atomic writes — Write to temp file then os.rename()
  3. Session file rotation — Files grow unboundedly; compaction would help

@github-actions
Copy link
Copy Markdown

Welcome to CoPaw! 🐾

Hi @zealonexp, thank you for your first Pull Request! 🎉

🙌 Join Developer Community

Thanks so much for your contribution! We'd love to invite you to join the official CoPaw developer group! You can find the Discord and DingTalk group links under the "Developer Community" section on our docs page:
https://copaw.agentscope.io/docs/community

We truly appreciate your enthusiasm—and look forward to your future contributions! 😊

We'll review your PR soon.


Tip

⭐ If you find CoPaw useful, please give us a Star!

Star CoPaw

Staying ahead

Star CoPaw on GitHub and be instantly notified of new releases.

Your star helps more developers discover this project! 🐾

@github-actions github-actions bot added the first-time-contributor PR created by a first time contributor label Apr 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the SafeJSONSession class to handle corrupted session files by attempting to recover the first valid JSON object using raw_decode when standard parsing fails. This logic is implemented across load_session_state, update_session_state, and get_session_state_dict, and is verified with new unit tests. Feedback suggests refactoring the duplicated recovery logic into a helper method and addressing potential crashes if raw_decode fails on completely invalid files.

Comment thread src/copaw/app/runner/session.py Outdated
Comment thread src/copaw/app/runner/session.py Outdated
zealonexp added a commit to zealonexp/CoPaw that referenced this pull request Apr 13, 2026
- Extract duplicated raw_decode fallback into _safe_json_loads() helper
- Add total-corruption fallback: returns empty dict instead of raising
- Add 3 test cases for completely garbled files
- Addresses gemini-code-assist review feedback on PR agentscope-ai#3278
@zealonexp
Copy link
Copy Markdown
Author

Update: Branch refreshed + confirmed bug still exists on latest main

Just pushed the latest commit with expanded test coverage (16 test cases total).

Verified: Bug is still present on upstream main (v1.1.1b1, commit c4ea882)

I diffed src/qwenpaw/app/runner/session.py between our PR base and the latest upstream main — byte-identical. All three json.loads() call sites remain unprotected:

Method Risk
load_session_state json.loads(content) — crashes on corruption
update_session_state json.loads(content) — crashes on corruption
get_session_state_dict json.loads(content) — crashes on corruption

The concurrent write race condition is still reproducible: save_session_state and update_session_state both use synchronous open(w) without any locking, while the read methods use aiofiles. When two coroutines write simultaneously, the OS-level writes interleave, producing JSONDecodeError: Extra data — every subsequent request to that session fails with HTTP 422.

Our fix

_safe_json_loads() catches JSONDecodeError and falls back to json.JSONDecoder().raw_decode() to recover the first valid JSON object, logging a warning instead of crashing.

Test coverage

16 test cases covering:

  • Valid JSON (baseline)
  • 17-char garbage tail (first observed pattern)
  • 203-char garbage tail (second observed pattern)
  • Empty file / Multiple concatenated JSON objects / Unicode in corrupted tail
  • Plus 10 more edge cases

All tests pass against the latest upstream main. Ready for review.

@zealonexp
Copy link
Copy Markdown
Author

🔁 Bug still present in v1.1.1-beta.1

I just verified that this concurrency bug still exists in the latest release v1.1.1-beta.1 (released 2026-04-13).

What I checked

Cloned v1.1.1-beta.1 and inspected src/qwenpaw/app/runner/session.py:

  • save_session_state (line 85): uses bare open("w") — no lock, no atomic write
  • update_session_state (line 181): same bare open("w") — no lock, no atomic write
  • get_session_state_dict (line 218): reads without any coordination with concurrent writers

All three methods operate on the same session JSON file with zero concurrency protection. When two async coroutines call save_session_state and update_session_state concurrently on the same session, their open("w") calls interleave at the OS level, producing a file like:

{"conversation": [...]}{"

which causes json.JSONDecodeError: Extra data on every subsequent read.

Reproduction scenario

This is easily triggered in production when:

  1. A user sends a message (triggers save_session_state)
  2. Simultaneously a cron/background task calls update_session_state on the same session
  3. The writes interleave → file corrupted → all future requests to that session fail

Impact

  • Severity: P0 — once corrupted, the session is permanently broken
  • Blast radius: every request to the affected session returns 500
  • Recovery: requires manual file deletion or repair

Recommended fix

At minimum, add file-level locking (e.g., fcntl.flock on Linux / msvcrt.locking on Windows) around all read-modify-write cycles. For a more robust approach, consider atomic writes (write to temp file + os.rename).

Happy to help test any fix — I have been running with the patch from this PR in production for a few days now with zero corruption incidents.

@xieyxclack
Copy link
Copy Markdown
Member

Please resolve the conflict first

SafeJSONSession's save/update/load methods access the same session file
concurrently without locking. When two coroutines write nearly simultaneously,
OS-level writes interleave, producing a file with a valid JSON object
followed by garbage fragments. Once corrupted, every request fails with
JSONDecodeError: Extra data (422).

Add JSONDecodeError catch with raw_decode fallback at all three json.loads
call sites. When corruption is detected, the first valid JSON object is
extracted and a warning is logged instead of crashing.

Closes agentscope-ai#3277
- Extract duplicated raw_decode fallback into _safe_json_loads() helper
- Add total-corruption fallback: returns empty dict instead of raising
- Add 3 test cases for completely garbled files
- Addresses gemini-code-assist review feedback on PR agentscope-ai#3278
- test_load_empty_file / test_get_empty_file / test_update_empty_file
  (zero-byte files)
- test_load_null_bytes (\x00 padding)
- test_load_double_write_overlap (two JSON objects concatenated)
- test_load_whitespace_only (spaces/tabs/newlines only)

All 16 tests pass. Pre-commit hooks clean (black, flake8, pylint 10/10, mypy).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

first-time-contributor PR created by a first time contributor Under Review

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

[Bug] Session state JSON corruption causes persistent 422 errors (Extra data)

2 participants