fix(session): recover from JSON corruption in session state files#3278
fix(session): recover from JSON corruption in session state files#3278zealonexp wants to merge 3 commits intoagentscope-ai:mainfrom
Conversation
|
Hi @zealonexp, thank you for your first Pull Request! 🎉 🙌 Join Developer CommunityThanks so much for your contribution! We'd love to invite you to join the official CoPaw developer group! You can find the Discord and DingTalk group links under the "Developer Community" section on our docs page: We truly appreciate your enthusiasm—and look forward to your future contributions! 😊 We'll review your PR soon. |
There was a problem hiding this comment.
Code Review
This pull request enhances the SafeJSONSession class to handle corrupted session files by attempting to recover the first valid JSON object using raw_decode when standard parsing fails. This logic is implemented across load_session_state, update_session_state, and get_session_state_dict, and is verified with new unit tests. Feedback suggests refactoring the duplicated recovery logic into a helper method and addressing potential crashes if raw_decode fails on completely invalid files.
- Extract duplicated raw_decode fallback into _safe_json_loads() helper - Add total-corruption fallback: returns empty dict instead of raising - Add 3 test cases for completely garbled files - Addresses gemini-code-assist review feedback on PR agentscope-ai#3278
Update: Branch refreshed + confirmed bug still exists on latest mainJust pushed the latest commit with expanded test coverage (16 test cases total). Verified: Bug is still present on upstream main (v1.1.1b1, commit c4ea882)I diffed
The concurrent write race condition is still reproducible: Our fix
Test coverage16 test cases covering:
All tests pass against the latest upstream main. Ready for review. |
🔁 Bug still present in v1.1.1-beta.1I just verified that this concurrency bug still exists in the latest release v1.1.1-beta.1 (released 2026-04-13). What I checkedCloned
All three methods operate on the same session JSON file with zero concurrency protection. When two async coroutines call which causes Reproduction scenarioThis is easily triggered in production when:
Impact
Recommended fixAt minimum, add file-level locking (e.g., Happy to help test any fix — I have been running with the patch from this PR in production for a few days now with zero corruption incidents. |
|
Please resolve the conflict first |
SafeJSONSession's save/update/load methods access the same session file concurrently without locking. When two coroutines write nearly simultaneously, OS-level writes interleave, producing a file with a valid JSON object followed by garbage fragments. Once corrupted, every request fails with JSONDecodeError: Extra data (422). Add JSONDecodeError catch with raw_decode fallback at all three json.loads call sites. When corruption is detected, the first valid JSON object is extracted and a warning is logged instead of crashing. Closes agentscope-ai#3277
- Extract duplicated raw_decode fallback into _safe_json_loads() helper - Add total-corruption fallback: returns empty dict instead of raising - Add 3 test cases for completely garbled files - Addresses gemini-code-assist review feedback on PR agentscope-ai#3278
- test_load_empty_file / test_get_empty_file / test_update_empty_file (zero-byte files) - test_load_null_bytes (\x00 padding) - test_load_double_write_overlap (two JSON objects concatenated) - test_load_whitespace_only (spaces/tabs/newlines only) All 16 tests pass. Pre-commit hooks clean (black, flake8, pylint 10/10, mypy).
1103877 to
3ce06a4
Compare

Description
Fix a P0 availability issue where session state JSON files become corrupted due to concurrent write race conditions, causing
JSONDecodeError: Extra dataon every subsequent request to the affected session.Root cause:
save_session_state,update_session_state, andget_session_state_dictaccess the same session file concurrently without any locking mechanism. When two coroutines executeopen("w")on the same file nearly simultaneously, the OS-level writes interleave, producing a file with a valid JSON object followed by garbage fragments. Once corrupted, every request to that session fails with a 422 error — no recovery is possible without manual intervention.Fix: Add
JSONDecodeErrorcatch withraw_decodefallback at all threejson.loadscall sites inSafeJSONSession. When corruption is detected, the first valid JSON object is extracted and a warning is logged, instead of crashing the request.Related Issue: Fixes #3277
Security Considerations: No changes to auth, config handling, or external-facing interfaces. Pure defensive parsing change.
Type of Change
Component(s) Affected
Checklist
pre-commit run --all-fileslocally and it passespytestor as relevant) and they passTesting
New tests added
tests/unit/agents/test_session.py— 7 test cases covering:test_load_valid_jsontest_load_corrupted_json_extra_datatest_load_corrupted_json_real_world_tailtest_get_corrupted_jsonget_session_state_dictrecovers from corruptiontest_update_corrupted_jsonupdate_session_staterecovers, then writes clean JSONtest_load_nonexistenttest_get_nonexistentLocal Verification Evidence
Additional Notes
Files changed
src/copaw/app/runner/session.pyJSONDecodeError→raw_decodefallback inload_session_state,update_session_state,get_session_state_dicttests/unit/agents/test_session.pyRecommended follow-ups (out of scope)
asyncio.Lock— Serialize concurrent reads/writes to eliminate race at sourceos.rename()