Skip to content

Commit 1a240bc

Browse files
author
Tyler
committed
feat: migrate knowledge graph workflow to JanusGraph with controlled NER ingestion
Adopt JanusGraph as the graph backend and add a review-first NER insertion flow with ontology-aligned predicate validation to reduce noisy writes. Refresh OpenKG movie-domain bootstrap data, conversion tooling, and docs to keep runtime behavior and guidance consistent.
1 parent 55dc251 commit 1a240bc

12 files changed

Lines changed: 16382 additions & 1028 deletions

README.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -87,14 +87,21 @@ Note: If no preloaded file is present, the system will still start but the text
8787
### Preloaded Knowledge Graph (Read-Only)
8888

8989
The system automatically loads a preloaded Chinese knowledge graph if available:
90-
- **Primary Source**: `data/openkg_triples.tsv` - Real OpenKG concept hierarchy data (290 entities, 254 relations)
91-
- **Fallback**: `data/preloaded_knowledge_graph.json` - Alternative format if TSV not available
92-
- **Auto-generation**: Run `python tools/openkg_generator.py` to download fresh OpenKG sample data
93-
- **Format**: TSV format with concept-category relationships (e.g., "移动应用 属于 软件")
94-
- **Data Source**: OpenKG OpenConcepts project from GitHub
90+
- **Primary Source**: `data/openkg_triples.tsv` (currently generated from OpenKG Douban movie subset with sci-fi closure filtering)
91+
- **Fallback**: `data/preloaded_knowledge_graph.json` (used only when TSV is unavailable)
92+
- **Generator**: `python tools/openkg_movie_to_triples.py --scifi-closure --seed-genre 科幻`
93+
- **Storage Backend**: JanusGraph (Gremlin Server) is the single graph backend; TSV/JSON are bootstrap inputs only
9594

9695
The knowledge graph powers entity recognition and context engineering features.
9796

97+
### NER Review-to-Insert Workflow (Controlled Write)
98+
99+
The knowledge-graph page supports a controlled pipeline:
100+
- Extract candidate triples from free text using LLM NER
101+
- Remove unwanted candidates by candidate IDs
102+
- Insert remaining candidates in batch (`can_insert=true` only)
103+
- Enforce ontology alignment: predicates must match the predicate set in `data/openkg_triples.tsv` (both prompt-time constraint and write-time validation)
104+
98105
### Start the System
99106

100107
```bash

README_zh.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -189,14 +189,21 @@ pip install -r requirements.txt
189189
### 预加载知识图谱(只读)
190190

191191
系统会自动加载预加载的中文知识图谱(如果可用):
192-
- **主要来源**`data/openkg_triples.tsv` - 真实的 OpenKG 概念层次数据(290 个实体,254 个关系)
193-
- **备用格式**`data/preloaded_knowledge_graph.json` - 如果 TSV 不可用时的替代格式
194-
- **自动生成**:运行 `python tools/openkg_generator.py` 下载最新的 OpenKG 样本数据
195-
- **格式**:TSV 格式,包含概念-类别关系(例如:"移动应用 属于 软件")
196-
- **数据源**:GitHub 上的 OpenKG OpenConcepts 项目
192+
- **主要来源**`data/openkg_triples.tsv`(当前默认使用 OpenKG Douban 电影子集,含科幻闭包筛选结果)
193+
- **备用格式**`data/preloaded_knowledge_graph.json`(仅在 TSV 不可用时兜底)
194+
- **生成脚本**`python tools/openkg_movie_to_triples.py --scifi-closure --seed-genre 科幻`
195+
- **存储后端**:JanusGraph(Gremlin Server)为唯一图数据后端;TSV/JSON 仅用于初始化补种
197196

198197
知识图谱为实体识别和上下文工程功能提供支持。
199198

199+
### NER 审阅入库流程(受控写入)
200+
201+
知识图谱页支持“文本抽取 -> 人工审阅 -> 入库”的受控流程:
202+
- **抽取候选**:输入文本后调用 LLM 提取候选三元组
203+
- **删除候选**:通过候选 ID 删除不需要的行
204+
- **批量写入**:写入剩余候选(仅 `可入库=true` 生效)
205+
- **本体约束**:谓词需与 `data/openkg_triples.tsv` 谓词集合对齐(前置提示词约束 + 后置写入校验)
206+
200207
### 启动系统
201208

202209
```bash
@@ -385,7 +392,7 @@ graph TB
385392
386393
subgraph "💾 存储层"
387394
FileStore["文件存储<br/>• 模型 (pkl, h5)<br/>• 索引 (json)<br/>• 图像 (png, jpg)"]
388-
DataStore["数据存储<br/>• CTR 数据 (json)<br/>• 知识图谱 (pkl)<br/>• 训练检查点"]
395+
DataStore["数据存储<br/>• CTR 数据 (json)<br/>• 知识图谱初始化数据 (tsv/json)<br/>• 训练检查点"]
389396
end
390397
391398
subgraph "🏗️ 基础设施"
@@ -560,14 +567,14 @@ Testbed/
560567
│ ├── wide_deep_ctr_model.h5 # Wide & Deep CTR
561568
│ ├── wide_deep_ctr_model_tf_serving/ # TF Serving 格式
562569
│ ├── index_data.json # 倒排索引
563-
│ ├── knowledge_graph.pkl # 知识图谱
564570
│ └── images/ # 图像嵌入
565571
│ ├── image_index.json
566572
│ └── image_embeddings.npy
567573
568574
├── data/ # 数据存储
569575
│ ├── preloaded_documents.json # 维基百科语料
570-
│ ├── preloaded_kg.json # 预加载知识图谱
576+
│ ├── preloaded_knowledge_graph.json # 预加载知识图谱(备用)
577+
│ ├── openkg_triples.tsv # OpenKG 三元组(默认初始化输入)
571578
│ ├── ctr_data.json # CTR 训练数据
572579
│ ├── llmops/ # LLMOps 数据集
573580
│ │ ├── dataset_info.json # 数据集注册表
@@ -599,7 +606,7 @@ Testbed/
599606
600607
├── tools/ # 工具
601608
│ ├── model_service_manager.py # 模型服务生命周期
602-
│ ├── openkg_generator.py # 知识图谱数据生成
609+
│ ├── openkg_movie_to_triples.py # OpenKG 电影三元组生成
603610
│ └── performance_monitor.py
604611
605612
├── start_system.py # 主启动脚本

0 commit comments

Comments
 (0)