Traceable Ideological Education RAG

Traceable KG-RAG system for ideological education with citation grounding, multi-agent review, political safety checking, and cross-modal learning interaction.

This repository is part of a National College Student Innovation and Entrepreneurship Training Program project. The current focus is to build a traceable retrieval backend, stable API contract, and evidence-grounded response workflow for ideological education materials.

Status: backend-first prototype with real FAISS smoke validation. Current implementation focuses on retrieval API, schema stability, citation structure, generation scaffolding, source checking, policy checking, and demo validation. The FAISS path can now be enabled for /retrieve through DACHUANG_VECTOR_BACKEND=faiss, while the default path remains the stable lightweight retriever. Full cross-modal interaction, digital-human presentation, and XR sandbox workflows are planned stages.

What It Does

Current implemented capabilities:

Provides a FastAPI backend.
Exposes /health and /retrieve APIs.
Returns a stable retrieval response structure.
Reads processed ideological education text chunks.
Validates a 245-chunk ideological education corpus through a Qwen text-embedding-v4 + FAISS smoke workflow.
Provides an optional FAISS vector backend for retrieve_vector.
Uses a schema-driven API contract.
Includes graph storage, evidence generation, source checking, and policy checking modules.
Maintains demo questions and basic acceptance tests.
Keeps project deliverables separated from core code.

Planned capabilities:

Promote the optional FAISS backend to the default vector path after chunk IDs and GraphSim triples are fully aligned.
Stronger hybrid retrieval with keyword, vector, structured fields, and reranking.
Knowledge graph reasoning for people, events, places, timelines, and ideological concepts.
Multi-agent review workflow for generation, citation auditing, and political safety review.
Timeline, map, event-card, digital-human, or XR-based learning interaction.

Why This Project Exists

Ideological education materials usually involve long historical timelines, complex relationships between people and events, and strict requirements for source-grounded expression. A normal open-domain chatbot may generate plausible but unsupported answers, which is not suitable for teaching support.

This project follows a retrieval-first and audit-first workflow:

retrieve evidence -> generate answer -> check citations -> review safety boundary -> display to learners

Architecture

Target architecture:

User Query
    |
    v
Intent Router
    |
    v
Hybrid Retriever
    |
    v
Knowledge Graph Reasoner
    |
    v
Generation Agent
    |
    v
Citation Auditor
    |
    v
Political Safety Auditor
    |
    v
Cross-modal Interaction Layer

The current repository implements the backend foundation first. Some downstream modules are present as scaffolding or planned integration points, not as a completed production system.

Tech Stack

Backend: FastAPI, Python
Retrieval: hybrid retriever scaffold, FAISS-ready dependency
Graph module: NetworkX-ready graph store
Schema and config: YAML, JSON
Validation and tests: pytest, JSONL validation utilities

Current Repository Scope

src/                   Core backend code
configs/               Schema and API contract
data/                  Runtime demo data
tests/                 Acceptance and module tests
docs/                  Knowledge-base ingestion notes
team_deliverables/     Reports, presentation materials, drafts, and team notes
outputs/               Local outputs
README_run.md          Local running guide
README_architecture.md Retrieval architecture notes

Core code should stay outside team_deliverables/. Runtime data should stay inside data/. Presentation drafts and team materials can stay in team_deliverables/.

Current Retrieval Experiment

The latest local FAISS smoke test uses:

data/processed/text_chunks_sizheng_v1.jsonl
data/processed/text_chunks_sizheng_v2.jsonl
text-embedding-v4
1024-dimensional embeddings
IndexFlatIP with L2 normalization

The 2026-06-23 smoke run on 245 chunks and 10 demo questions reported:

Metric	Value
Recall@1	1.0000
Recall@3	1.0000
Recall@5	1.0000
MRR	1.0000

This is an engineering smoke-test result, not a formal paper experiment. The score includes the smoke script's lightweight section-aware rerank for near-neighbor sections. The validated FAISS path can be enabled through DACHUANG_VECTOR_BACKEND=faiss and DACHUANG_FAISS_INDEX_DIR, while /retrieve keeps the same stable response contract.

Quick Start

Install dependencies:

pip install -r requirements.txt

Start the API:

uvicorn src.api.main:app --reload

Open API docs:

http://127.0.0.1:8000/docs

Health check:

GET /health

Retrieve API:

POST /retrieve

Example request:

{
  "query": "延安时期思想政治教育有什么特点？"
}

Run tests:

python -m pytest tests -q

API Contract

The current /retrieve response includes:

status
project
query
query_entities
vector_hits
graph_hits
hybrid_hits
answer
citations_used
generator_mode
generator_provider
provider_status
used_fallback
source_check
policy_check
agent_trace
final_decision

vector_hits records vector-side evidence candidates, graph_hits records GraphSim/graph-side candidates, and hybrid_hits records fused evidence blocks with citation and scores. citations_used, source_check, policy_check, agent_trace, and final_decision are part of the current three-agent scaffold. Do not use the old citations or debug fields for new frontend or teammate integration.

Roadmap

Phase	Goal
Phase 1	Build KG-RAG backend, schema, retrieval API, and validation questions
Phase 2	Add stronger vector retrieval and knowledge graph modules
Phase 3	Integrate generation, citation auditing, and political safety auditing agents
Phase 4	Build timeline, map, and event-card interaction layers
Phase 5	Explore XR sandbox, digital-human explanation, and classroom demo workflow

Safety and Use Boundary

This system is designed for ideological education support, teaching demonstration, and source-grounded knowledge retrieval. It does not replace teachers, official teaching materials, or final human review.

Generated content should be used only after citation checking and safety review. When evidence is missing, mismatched, or insufficient, the system should report insufficient evidence rather than produce unsupported claims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Traceable Ideological Education RAG

What It Does

Why This Project Exists

Architecture

Tech Stack

Current Repository Scope

Current Retrieval Experiment

Quick Start

API Contract

Roadmap

Safety and Use Boundary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
configs		configs
data		data
docs		docs
scripts		scripts
src		src
team_deliverables		team_deliverables
tests		tests
.gitignore		.gitignore
README.md		README.md
README.zh-CN.md		README.zh-CN.md
README_architecture.md		README_architecture.md
README_run.md		README_run.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Traceable Ideological Education RAG

What It Does

Why This Project Exists

Architecture

Tech Stack

Current Repository Scope

Current Retrieval Experiment

Quick Start

API Contract

Roadmap

Safety and Use Boundary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages