Skip to content

Latest commit

 

History

History
343 lines (220 loc) · 26.8 KB

File metadata and controls

343 lines (220 loc) · 26.8 KB

04. Extending DAAF

This guide focuses on the primary extension path: bringing new datasets, data domain expertise, and methodological tooling into DAAF for your own purposes. If you want to make any of these modifications available to the broader community by sharing these changes/extensions back with the DAAF project, see 05. Contributing to DAAF.

Back to main


Table of Contents


The Extension Model: Skills, Agents, and Data-Ingest

Here's the fundamental insight behind DAAF's extensibility: the framework is intended to separate what it knows from how it behaves. This is a really important distinction that makes the whole extension model work, so let me try to explain it clearly.

Building from our initial discussion of agents and skills from 02. Understanding and Working with DAAF, DAAF has two main types of building blocks:

  • Skills are structured knowledge documents. They tell DAAF's agents what they need to know about a specific topic -- a data source, a Python library, a visualization framework, a domain of expertise. Think of skills as extremely thorough, well-organized reference guides that an agent loads into its context when it needs specialized knowledge to do its job, and can be easily shared or transferred across multiple agents.

  • Agents are behavioral protocols. They tell a subagent how to behave -- what steps to follow, what to validate, when to stop, how to format output. Think of agents as detailed job descriptions that define a specific role in the pipeline (the code reviewer, the data planner, the report writer, etc.).

This separation is what makes DAAF extensible without being fragile. When you want DAAF to work with a new dataset, you generally shouldn't need to touch the workflow, the validation logic, or the agent protocols at all. You just add a new skill that teaches the existing agents about the new data. The agents already know how to fetch, clean, transform, and analyze data -- they just need to be told the specifics of your data.

The Three Extension Paths

Extension Type What You're Adding Tool to Use Result
Data source Knowledge about a specific dataset data-ingest agent A new data-source-skill
Methodology Knowledge about a statistical or analytical method skill-authoring skill A new methodology-skill
Domain expertise Knowledge about a content area or field skill-authoring skill A new context-skill

The most common extension path by far -- and the one I'll spend the most time on in this guide -- is adding new data sources. DAAF ships with a dedicated agent specifically for this purpose: the data-ingest agent, which does the heavy lifting of profiling a dataset and generating the skill documentation for you. You still need to review its output (this is always true with DAAF), but it should dramatically reduce the manual effort involved.

For methodology and domain expertise skills, the process is lighter-weight -- you ask DAAF to use the skill-authoring skill, point it at documentation or literature to research, and it drafts a skill for you to review and refine. I'll cover that process too, but it's more straightforward than data ingestion.

Step-by-Step: Profiling a New Dataset with Data-Ingest

The data-ingest agent is DAAF's built-in tool for turning a raw dataset (or online dataset source) into a comprehensive data source skill it can begin using in tandem with other data source skills. It automates the tedious but critical work of profiling every column, detecting coded values, checking data quality, and reconciling what any provided documentation says against what the data actually contains.

Before You Start

You'll need:

  1. A data file or link in a supported format (parquet, CSV, Excel, or TSV). Public data sources are strongly preferred. If you're working with proprietary or sensitive data, please be extremely careful to abide by your organization's AI policy and data protection standards -- Claude will be examining the actual contents of the data.
  2. Any available documentation -- codebooks, data dictionaries, README files, or documentation website URLs. These aren't strictly required, but they dramatically improve the quality of the resulting skill because the agent can cross-reference what the documentation says against what the data actually shows.
  3. A sense of how the data will be used -- what research questions it might inform, what domain it belongs to, and which columns are most important for your purposes.

Where Your New Skill Will Fit in

When you ask DAAF to work with any current data source (say, CCD enrollment data), here's the flow:

  1. The orchestrator dispatches a subagent to explore available data (Stage 2)
  2. The subagent loads the relevant skill (e.g., education-data-source-ccd) into its context
  3. The skill tells the subagent everything it needs: what variables exist, what the coded values mean, what the known pitfalls are, how to access the data
  4. The subagent uses that knowledge to do its job and returns findings to the orchestrator

The key thing to understand: When you add a new data source skill, we just need the orchestrator to know what it is and when it'd be useful so it can instruct its subagents on when to load it for their specific task. To do this, you're primarily adding knowledge to the system at two points:

  1. Exploration (Stage 2-3): Your skill tells agents what data is available, what variables exist, and what caveats to watch for
  2. Context application (Stage 6): Your skill tells agents how to handle coded values, missing data patterns, and source-specific quirks during cleaning

The fetch mechanics (Stage 5) are mostly handled by the query skill and mirror configuration. If your data source is available through the existing mirrors, you may not need to change anything there. If your data comes from a different source entirely, you'll need to either add a new mirror configuration or provide the data files directly.


Preparing Your Data

Place your data file somewhere accessible within the Docker volume (the easiest spot is the research/ directory or a subfolder of it). Exactly where doesn't really matter, as long as you provide Claude with the actual filepaths when you start the conversation. If you have documentation files, put those inside the same folder. See 01. Installation and Quickstart for reminders on managing files within the Docker volume if needed.

A few practical considerations:

  • File size: The agent can handle files up to about 1GB without special handling. For larger files, it'll ask you about a sampling strategy before proceeding.
  • File format: Parquet is ideal (fast, preserves types). CSV works fine but may have type inference quirks. Excel files work using the openpyxl library, included with the standard installation Docker for DAAF.
  • Multiple files: If your data source spans multiple files (e.g., one file per year), start with a single representative file. The skill can document the multi-file structure, but profiling works best on one file at a time.

Running the Data-Ingest Agent

You don't need to invoke the agent directly -- just ask DAAF conversationally. Something like:

I have a new dataset I'd like to profile and integrate into DAAF.
The data file is at: /daaf/research/my-data/state_spending_2023.parquet
I also have a codebook at: /daaf/research/my-data/codebook.xlsx
The documentation website is: https://example.gov/data-documentation

This is state-level education spending data. I'd like to use it
for analyzing per-pupil expenditure trends across states. The most
important columns are probably the ones related to total spending,
enrollment counts, and state identifiers.

DAAF will classify this as a data-ingest task and dispatch the data-ingest agent, which will then execute a systematic profiling protocol:

Phase 1 -- Structural Profile: Basic shape of the data (rows, columns, memory footprint, column types). This gives the agent a bird's-eye view of what it's working with.

Phase 2 -- Column-Level Profile: Detailed statistics for every column -- null rates, unique value counts, distributions, min/max ranges. For numeric columns, it checks for potential coded values (those suspicious negative numbers like -1, -2, -9 that often mean "missing" or "suppressed" rather than being real values). For categorical columns, it enumerates all unique values.

Phase 3 -- Relationship Profiling: Identifying potential key columns (high uniqueness suggests an identifier), foreign keys (naming patterns like _id suffixes), and hierarchical relationships between columns.

Phase 4 -- Quality Profile: Systematic data quality checks -- completeness rates, coded missing value detection, anomalous patterns, potential duplicates.

Phase 5 -- Semantic Interpretation: This is where it gets interesting. The agent uses column names, value patterns, and domain conventions to make educated guesses about what each column means. Every interpretation is explicitly marked as [PRELIMINARY] -- the agent knows it's hypothesizing, not asserting. Column named fips? Probably a FIPS geographic code. Column with values 0 and 1? Probably a binary indicator, but is 1 "Yes" or "Male" or "Urban"? The agent will flag the ambiguity.

If you provided documentation, the agent also runs Documentation Reconciliation (Mode 2): it parses your codebook or data dictionary, extracts every claim it can find (column definitions, expected types, coded value meanings), and then verifies each claim against the actual data. Documentation says there are 50 columns? The agent checks. Codebook says state_code should be a string? The agent confirms or flags the mismatch. This reconciliation is one of the most valuable things the data-ingest agent does -- it catches the disturbingly common case where documentation is outdated or describes a different version of the data than what you actually have.

Reviewing the Profile Output

The agent will return a structured report with:

  • Structural summary: Row/column counts, memory size, format
  • Column summary: Type, null rate, unique count, and notes for every column
  • Coded values detected: Which columns have potential coded values, and whether documentation confirms their meaning
  • Quality assessment: Scores for completeness, documentation accuracy, and coded value coverage
  • Preliminary interpretations: The agent's best guesses for what columns mean, each flagged with a confidence level and basis for the interpretation
  • Discrepancies found: Every case where documentation contradicted observed data, with evidence for both sides
  • User review requested: Explicit questions for you to answer -- which interpretations are correct, how to handle undocumented values, whether missing columns are expected

This review step is not optional. The whole point of marking interpretations as [PRELIMINARY] is that you need to confirm or correct them. The agent has done the mechanical work of profiling, but the semantic understanding -- what these columns actually mean in context -- requires your domain expertise. Take the time to go through the review questions carefully. Your answers will directly determine the quality of the resulting skill.

Once you've provided your feedback, the agent uses your corrections to finalize the skill and writes it to .claude/skills/[skill-name]/. From there, you can start a fresh session with DAAF and ask it to analyze it alongside whatever other datasets you'd like! I'd strongly recommend running it through some simple paces to get it tested and any issues worked out first.


Step-by-Step: Authoring Other Types of New Skills

Methodology Skills (via Skill-Authoring)

For adding knowledge about a statistical method, Python library, or analytical technique, you'll use the skill-authoring skill directly. This is more free-form than data ingestion, and the content depends heavily on what you're documenting. You may find it helpful to refer DAAF to other standard skills this one will be most like. Python library? Try referencing the plotnine or polars skills. Wanting to do something more methodological in nature? Try pointing it to the data-scientist skill. And so on. My hope is that as the community continues to extend DAAF in a few directions, we'll have plenty of exemplars to point to.

Ask DAAF something like:

I'd like to create a new methodology skill for pyfixest
(fixed-effects regression in Python). Please use the
skill-authoring skill to guide the process, and research
the pyfixest documentation online to build a comprehensive
reference. You might refer to the `polars` skill as a model
for some of what it could look like. Please run some initial
explorations and then come back to me with a plan for my
approval.

DAAF will use the skill-authoring skill to guide the process. The skill-authoring skill provides detailed guidance on:

  • Frontmatter requirements: The YAML header that every skill needs, including naming conventions (lowercase-hyphenated, 1-64 chars) and description best practices
  • Body structure patterns: Different organizing patterns depending on whether the skill is workflow-based (sequential steps), task-based (tool collection), reference-based (standards/specs), or capabilities-based (features)
  • Progressive disclosure: How to keep the main SKILL.md under 500 lines by splitting detailed content into references/ files
  • Decision trees: How to write effective navigation trees that help agents find what they need quickly
  • Content limits: SKILL.md body should stay under 500 lines and 5,000 words -- be concise and justify every token

The resulting skill gets placed at .claude/skills/[skill-name]/SKILL.md with optional references/, scripts/, and assets/ subdirectories.

Domain Expertise Skills (via Skill-Authoring)

Same process as methodology skills, but the content focuses on domain knowledge rather than tooling. For example, you might create a skill that documents the nuances of interpreting graduation rate data, or the policy context around school funding formulas, or the methodological considerations for analyzing panel data in education research.

I'd like to create a context skill for understanding Community
Eligibility Provision (CEP) and its impact on free/reduced-price
lunch data. This is critical context for anyone analyzing school
poverty measures after 2014. Please use the skill-authoring skill
and launch a few web searching subagents to research this topic
in depth before coming up with a plan for my approval.

Registering Your New Skill

Here's the part that people will sometimes miss: creating the skill file is not enough. DAAF uses a manual, documentation-based discovery system -- auto-discovery of skills with Claude Code is imperfect and can't always be relied on. After creating a new skill, it needs to be registered in several places to ensure that the orchestrator and agents can find it and know when to use it.

For data source skills, the data-ingest agent will provide you with a specific registration checklist at the end of its report. It looks something like this:

Priority File Section to Update What to Add
1 (Required) CLAUDE.md Data Need Source Skill Lookup table New row mapping data need to skill name
2 (Required) agent_reference/03_SKILL_INVOCATIONS.md Available source skills list New bullet with skill name and description
3 (Required) agents/source-researcher.md Step 1 examples Add skill to example list

The agent will typically offer to make these updates for you -- just confirm and it'll handle the file edits. Note that these registration edits touch core framework files, which means they fall under the "contribution" category if you plan to share them (see When to Extend vs. When to Contribute).

For methodology and domain expertise skills, registration is simpler -- you primarily need to update CLAUDE.md so the orchestrator knows the skill exists and when to recommend loading it.


Adding a New Agent

Adding data sources is the most common extension path, but sometimes you need something different: a new behavioral role in the pipeline. Maybe you need a specialized validator for a particular type of analysis, or a new synthesis pattern for cross-domain work, or a domain-specific planner that understands the constraints of your field. That's when you add a new agent.

This is a less common operation and a more involved one. Agents are deeply wired into the DAAF ecosystem -- they have producer/consumer relationships with other agents, they reference shared protocols, and they need to be discoverable by the orchestrator. The agent-authoring skill exists specifically to guide you through this process and tries to make sure nothing gets missed.

The Agent-Authoring Workflow

Ask DAAF to use the agent-authoring skill:

I need to create a new agent for [describe the behavioral role]. I'd
like this to be an agent focused on [x, y, z], and likely should be 
involved in doing [a, b, c] at [specific part of the research process].
Please use the agent-authoring skill to guide me through the process,
and let me know what more detail would be useful to make sure this is
successful.

The workflow has five phases:

Phase 1: Design (before writing). This is where you get crystal clear on the fundamentals. The agent-authoring skill will make sure you can answer five critical questions:

  1. What does this agent do and why does it exist? (one sentence)
  2. Which pipeline stage(s) does it operate in?
  3. Which existing agents are most similar, and how does yours differ?
  4. Does it need file-write access (general-purpose) or is it read-only (Plan)?
  5. Will it need to invoke any skills?

If any of these answers are vague, the agent-authoring skill will push you to sharpen them. This upfront clarity is genuinely important -- a poorly defined role leads to a poorly functioning agent.

Phase 2: Author. Write the agent definition file following the canonical 12-section template (defined in agent_reference/AGENT_TEMPLATE.md). The required sections include: Identity, Inputs, Core Behaviors, Protocol, Output Format, Boundaries, STOP Conditions, Anti-Patterns, Quality Standards, Invocation, References, and Consumers. The agent-authoring skill provides section-by-section guidance and a self-validation checklist covering everything from minimum anti-pattern counts to expected file length (400-700 lines).

Phase 3: Integrate. This is the step where the most things can go wrong if you're not careful. A new agent needs to be registered across multiple files in the DAAF ecosystem. The agent-authoring skill provides a complete integration checklist organized into tiers:

  • Tier 1 (Mandatory, 6 files): Every new agent must be registered in agents/README.md, CLAUDE.md, README.md, and several other core files
  • Tier 2 (Conditional): Additional updates if the agent maps to a specific pipeline stage
  • Tier 3 (Conditional): Additional updates if the agent affects specific workflow areas

Phase 4: Validate. Verification checks to confirm cross-agent consistency and completeness. The skill provides specific grep commands to run.

Phase 5: Human review. This is non-negotiable. You must review the agent file yourself for accuracy, intention, completeness, and value before it's considered done.

Key Resources

Resource Purpose
agent-authoring skill Full workflow with integration checklist
agent_reference/AGENT_TEMPLATE.md Canonical 12-section template
agents/README.md Current agent landscape, commonly confused pairs, coordination matrix

For changes to existing agents (modifying behavior rather than adding new ones), see 05. Contributing to DAAF.


Testing Your New Extension End-to-End

You've created a new skill (or agent). How do you know it actually works? Here's a practical testing sequence, ordered from lightest to heaviest.

Discovery Test

The simplest test: can DAAF find your new skill and understand what it's for?

What data sources does DAAF know about? Can you tell me about
[your new data source]?

If the skill is properly registered, DAAF should be able to describe the data source, list key variables, and mention important caveats. If it can't find the skill or gives a generic response, check your registration entries in CLAUDE.md and the other files listed in the registration checklist.

Fetch Test

If your data source is accessible through the mirror system (or available as a local file), test that DAAF can actually retrieve and load the data:

Can you fetch [your data source] for [year] and show me the first
few rows and basic summary statistics?

This tests the data access pathway -- the dataset paths in your skill, the mirror configuration, and the basic loading mechanics. The fetch should complete with a CP1 validation (shape, types, missingness checks). If CP1 fails, it usually means the dataset path in your skill doesn't match what's actually available on the mirror, or the expected column structure differs from reality.

Context Test

This tests whether your skill's coded value mappings, missing data codes, and caveats are being correctly applied during data cleaning:

Can you fetch and clean [your data source] for [year], making sure
to handle any coded missing values and apply the source-specific
caveats documented in the skill?

Watch the cleaning script that DAAF produces. It should reference the specific coded values, suppression patterns, and pitfalls documented in your skill. If it's treating -9 as a real numeric value instead of a missing data code, the coded value documentation in your skill may not be clear enough.

Full Pipeline Test

The gold standard: run a simple research question that exercises your new skill through the entire pipeline.

Using [your new data source], can you analyze [simple, well-defined
research question]? Keep the scope narrow -- I just want to verify
the data flows through correctly.

Pick a question that's deliberately simple -- something like "What is the average [measure] by [grouping variable] for [year]?" You're not testing analytical sophistication here, you're testing integration. Does the data flow through fetch, clean, transform, and analysis without errors? Do the coded values get handled correctly? Does the report reference the right caveats?

Methodology/Domain Skill Test

For non-data-source skills, the testing is more straightforward:

I'd like to run a [method from your new skill] analysis on
[some existing DAAF data]. Can you walk me through the approach?

Check that DAAF references your skill's guidance -- the correct function calls, the appropriate assumptions to validate, the known limitations to document.


Submitting Your Extension for Inclusion

If you've created a useful skill or agent and want to share it with the broader DAAF community -- please do! The whole point of this being open-source is that the framework gets better as more people contribute their domain expertise. A skill you create for, say, health survey data or labor market statistics could save someone else weeks of profiling work.

Before You Submit

A few things to check:

  • Quality: Did you thoroughly review the data-ingest output and correct any preliminary interpretations? Skills with [PRELIMINARY] markers still in place aren't ready for sharing.
  • Completeness: Does the skill follow the appropriate template (for data sources)? Does it have at least 2 decision trees? Is the Common Pitfalls section substantive?
  • Privacy: Does the skill reference only publicly accessible data? If it was built from proprietary data, make sure the skill documentation doesn't leak any confidential information or values.
  • Testing: Have you run at least a Discovery Test and a Fetch Test to confirm the skill works end-to-end?

How to Submit

See 05. Contributing to DAAF for the full contribution workflow. The short version: fork the repository, add your skill files, update the registration entries, and submit a pull request. The contribution guide covers pull request formatting, quality standards, and the review process in detail.

If you're not comfortable with the pull request process, you can also open an issue describing your new skill and sharing the files -- the community can help get it integrated.

LEARNINGS.md: The Other Way to Contribute

Even if you're not creating new skills, there's a contribution path that requires almost zero effort: sharing your LEARNINGS.md files. Every time DAAF completes a Full Pipeline project, it produces a LEARNINGS.md file documenting everything it learned about data quirks, process issues, and methodology edge cases along the way. These learnings are written to be immediately actionable -- they often contain specific suggestions for updating skills, improving documentation, or adding new pitfall entries.

If you open an issue with your LEARNINGS.md content, the community can fold those insights back into the framework. This is genuinely one of the most impactful things you can do -- every project run generates practical knowledge that benefits every future project.


Recommended Next Steps

  • 05. Contributing to DAAF — Get involved in developing DAAF! How to file issues via GitHub, support expanding the capabilities of the framework, contribute to educational tutorials and how-to's, and more!
  • 06. FAQ: Philosophy — Grapples with the broader implications of this work, AI automation in general, model advancement pace, approaching the "exponential", environmental ethics, what this means for the next generation of researchers, and more
  • 07. FAQ: Technical Support — Covers frequently asked questions about Docker, issues with Claude Code, usage limits, authentication errors, and other common errors
  • Back to main