Workflow automation — whether human-driven, software-driven, or LLM-driven — does not fail because systems cannot reason. It fails because systems cannot reliably retrieve and contextualize the information they need to reason about.
This repository explores information retrieval as the organizing principle of intelligent systems, grounded in two complementary perspectives:
-
Epistemic foundations: All modeling is structured loss. Before any retrieval can happen, ontological commitments must be made — what exists, what matters, what to discard. These commitments determine what a system can ever surface, and they are never neutral.
-
The IR design space: Given bounded attention (human cognition, software memory, LLM context windows), how do we select, represent, retrieve, rank, and assemble information so that downstream reasoning remains reliable?
Every intelligent system — a person reading a report, a search engine ranking results, an LLM generating a response — passes through the same epistemic pipeline:
SYSTEM (high-dimensional reality)
│
│ Ontological Cut — what to capture, what to lose
│
▼
DATA (knower-independent capture)
│
│ Structuring — schema, indexing, representation
│
▼
INFORMATION (structured, queryable state)
│
│ Projection — purpose-relative filtering
│
▼
VIEW (what a specific consumer sees)
│
│ Interpretation — sense-making in context
│
▼
KNOWLEDGE (knower-internal understanding)
At every boundary, dimensionality is reduced and actionability increases. The tradeoffs are unavoidable — a representation that captures everything is just a useless copy of reality. The question is whether these losses are principled and auditable or implicit and ungoverned.
This is the null tool argument: declining to use a structured retrieval system does not avoid representation. It just makes representation implicit, private, and untraceable. Every workflow has an information retrieval strategy; the only question is whether it is explicit.
The retrieval design space spans the full pipeline from raw data to assembled context:
graph LR
A[Design Space] --> B[Data Landscape]
B --> B1[Unstructured]
B --> B2[Semi-structured]
B --> B3[Structured]
A --> C[Query Landscape]
C --> C1[Fact]
C --> C2[Procedural]
C --> C3[Analytical]
C --> C4[Contextual]
A --> D[Representation]
D --> D1[Sparse]
D --> D2[Dense]
D --> D3[Hybrid]
D --> D4[Graph]
D --> D5[Summarized]
A --> E[Retrieval]
E --> E1[Lexical]
E --> E2[Semantic]
E --> E3[Hybrid]
E --> E4[Generative]
A --> F[Reranking & Generation]
F --> F1[Cross-encoders]
F --> F2[LLM rerankers]
F --> F3[RAG]
F --> F4[GenIR]
A --> G[Metadata & Summarization]
A --> H[Context Optimization]
A --> I[Dynamic Retrieval Loop]
A --> J[Post-retrieval Processing]
A --> K[Systems & Evaluation]
| Epistemic Layer | Loss Boundary | IR Pipeline Stage |
|---|---|---|
| System → Data | Ontological cut (what to capture) | Data landscape, ontology design |
| Data → Information | Structuring (how to represent) | Representation, chunking, metadata |
| Information → View | Projection (what to surface) | Retrieval, reranking, context optimization |
| View → Knowledge | Interpretation (how to use it) | Generation, RAG, human-in-the-loop reasoning |
Each transition is a governed loss — an intentional reduction of dimensionality that increases fitness for a specific purpose.
-
All modeling is structured loss. You cannot retrieve what you did not represent. Ontology precedes data, and schema precedes query.
-
Data and queries are co-dependent. The structure of data constrains possible queries. Anticipated queries inform how data should be represented and indexed. Design both together.
-
Retrieval is a control process, not a static lookup. It adapts dynamically to evolving goals, contexts, and feedback. Treat it as a closed loop.
-
Attention constraints are fundamental. Optimizing for utility within bounded context is more valuable than expanding capacity. Context quality beats context quantity.
-
Views are semantic commitments, not summaries. Different consumers need different projections of the same base representation. What a view hides is as important as what it shows.
-
Explicit bias beats implicit bias. A structured retrieval system makes its ontological commitments visible and auditable. The null tool — declining to formalize retrieval — does not eliminate bias; it just makes bias untraceable.
-
Context construction defines reasoning quality. Downstream understanding emerges from how input is selected, ordered, and assembled — not from the reasoning engine alone.
.
├── README.md # This file
├── Design Space.md # Map of content (MOC) for the full design space
├── 00_Presentation_Flow/
│ └── Presentation Flow.md # Slide-deck-style presentation of the full argument
├── 10_Design_Space/
│ ├── 00 Epistemic Foundations.md # System → Data → Information → Knowledge framework
│ ├── 01 Introduction.md # IR as the automation bottleneck
│ ├── 02 Attention & Context.md # Bounded attention as system constraint
│ ├── 03 Central Design Question.md # The core retrieval problem
│ ├── 04 Retrieval as Decision.md # Retrieval choices govern workflow robustness
│ ├── 05 Data vs Query Landscape.md # Co-dependency of data and query design
│ ├── 06 Data Landscape.md # Unstructured / semi-structured / structured
│ ├── 07 Query Landscape.md # Fact / procedural / analytical / contextual
│ ├── 08 Data Representation.md # Sparse, dense, hybrid, graph representations
│ ├── 09 Retrieval Techniques.md # Lexical, semantic, hybrid, generative retrieval
│ ├── 10 Reranking.md # Cross-encoders, LLM rerankers, cascades
│ ├── 11 Generation (RAG, GenIR).md # Retrieval-augmented and generative IR
│ ├── 12 Metadata & Summarization.md # Tags, entity linking, multi-view storage
│ ├── 13 Chunking Strategies.md # Naive, semantic, adaptive chunking
│ ├── 14 Context Optimization.md # Maximizing utility under bounded context
│ ├── 15 Dynamic Retrieval Loop.md # Retrieval as adaptive feedback control
│ ├── 16 Post-retrieval Processing.md# Dedup, clustering, context assembly
│ ├── 17 Systems & Evaluation.md # Vector DBs, benchmarks, metrics
│ ├── 18 Implications for Automation.md # Retrieval-aware workflow design
│ ├── 19 Design Principles.md # Consolidated design guidance
│ └── 20 Conclusion.md # Synthesis
├── 90_References/
│ └── IR References.md # Curated bibliography
└── Templates/
└── Concept Template.md # Template for expanding individual concepts
The epistemic foundations in this repository draw from work in The Epistemic Architecture of Models and Tools, which develops the philosophical grounding for structured loss, the null tool argument, and persona-relative projection in the context of compositional system modeling (MSML, GDS).