Claude/polymarket accuracy improvements#433
Open
ethernetabc1-source wants to merge 3 commits into666ghj:mainfrom
Open
Claude/polymarket accuracy improvements#433ethernetabc1-source wants to merge 3 commits into666ghj:mainfrom
ethernetabc1-source wants to merge 3 commits into666ghj:mainfrom
Conversation
… LLM retry - Remove `"traceback": traceback.format_exc()` from all 51 HTTP API responses in graph.py, simulation.py, report.py; add exc_info=True to server-side logs so full stack traces are still captured without leaking to clients - Replace CORS wildcard origins="*" with a configurable allowlist read from the new CORS_ORIGINS env var (defaults to localhost dev ports) - Change FLASK_DEBUG default from True → False to prevent accidental debug mode in production deployments - Add MIME magic-byte validation to FileParser._validate_file: blocks PE/ELF/ZIP files disguised with a .pdf/.txt extension, and enforces %PDF header on PDFs - Rewrite LLMClient.chat with exponential-backoff retry (max 3, 2/4/8 s) for RateLimitError/APITimeoutError/APIConnectionError; re-raise APIStatusError (4xx) immediately without wasting retries; surface specific openai exceptions instead of bare Exception - Add comparison_demo.py: self-contained before/after simulation of all four improvements, runnable without real API credentials https://claude.ai/code/session_01TSufK4MuqeYHvT6m3855CE
…use case Translate all service-layer system prompts and user prompt templates from Chinese to English, and remove all China-specific hardcoding: - ontology_generator: rewrite ONTOLOGY_SYSTEM_PROMPT in English; add geopolitical entity type examples (GovernmentOfficial, Military, ThinkTank, Diplomat, Trader, Analyst) for prediction market scenarios; translate user message template - oasis_profile_generator: change system prompt from '使用中文' to 'Write in English'; translate both individual and group persona prompt templates; change country field from '国家(使用中文,如"中国")' to English country names - simulation_config_generator: replace Beijing/Chinese timezone assumptions in time config, event config, and agent config prompts with global news cycle patterns (analyst business hours, trader market hours, evening citizen peaks); update default time config peak hours from [19,20,21,22] to [14..21] UTC range; translate all system prompts and user prompts to English - report_agent: translate PLAN_SYSTEM_PROMPT, PLAN_USER_PROMPT_TEMPLATE, SECTION_SYSTEM_PROMPT_TEMPLATE, SECTION_USER_PROMPT_TEMPLATE, CHAT_SYSTEM_PROMPT_TEMPLATE, and all ReACT loop message strings to English; add 'predicted_probability' (0-100) field to the report outline JSON output; add mandatory "Prediction Verdict" final section with probability estimate and top factors driving the prediction up/down - zep_tools: translate sub-query generation, agent selection, interview question generation, and interview summary prompts to English; replace Chinese quote marks 「」 with standard double quotes; translate fallback strings - .env.example: add CORS_ORIGINS variable; update comments to English https://claude.ai/code/session_01TSufK4MuqeYHvT6m3855CE
…rdict Report outline now outputs a richer probability object: - predicted_probability: integer point estimate (0-100) - probability_low / probability_high: 80% confidence interval bounds - key_upside_factors / key_downside_factors: string arrays PLAN_SYSTEM_PROMPT updated with three-step calibration rule: 1. Base rate anchor (historical frequency of this event type) 2. Simulation signal (did agents escalate or de-escalate?) 3. Market anchor (compare to Polymarket/Metaculus price if provided) This forces probability to be grounded in base rates rather than pure LLM intuition — the main cause of overconfidence in single-run estimates. SECTION_SYSTEM_PROMPT for "Prediction Verdict" section now requires: a) Explicit base-rate statement b) Simulation signal summary c) Market comparison (if price given in scenario) d) Probability verdict with range e) Upside / downside risk bullet lists f) Confidence note explaining main uncertainty source New script: backend/scripts/ensemble_predict.py - Runs N independent simulations for the same simulation_id - Collects predicted_probability + range from each report - Aggregates: mean point estimate, stdev-widened confidence interval - Extracts consensus factors (mentioned in ≥2 runs) - Prints a formatted verdict table and optionally writes JSON output - Usage: python ensemble_predict.py --simulation-id sim_xxx --runs 3 https://claude.ai/code/session_01TSufK4MuqeYHvT6m3855CE
melevsky
approved these changes
Apr 1, 2026
melevsky
left a comment
There was a problem hiding this comment.
Hermes Agent Code Review
Automated review completed ✅
✅ Approved
- Security: Proper credential handling, removed traceback exposure
- Reliability: LLM API retry mechanism for transient errors
- Features: Well-implemented ensemble prediction for accuracy improvement
- Internationalization: English prompt support for global use
💡 Notes
- Large PR (1310+ lines) but changes are well-structured
- Print statements in scripts/demos are acceptable
- Consider adding tests for ensemble functionality in future PRs
Recommendation: Ready for merge. This enhances both security and prediction accuracy.
Reviewed by Hermes Agent at $(date)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.