Implementation #198 - Refactor completed by Kiro #199

mstackhouse · 2025-12-09T18:17:11Z

This is intentionally going to be a large PR, because I'd like to include a details about the experience of using Kiro to implement the refactoring described in #198.

PR Formalities

The updates to the source code were generally handled by Kiro, but a few notes:

I cross checked that the existing unit tests were not modified. There were several unit tests that were implemented by Kiro, but existing tests have not been modified or removed.
I went through every vignette and cross checked the data within the tables. Where I saw discrepancies, I noticed that we sometimes used random samples without setting a seed. In these cases, I ensured that the data in these tables were correct.

At this point, I'm confident that the refactor is returning equivalent results to the existing development version of Tplyr.

Kiro Experience

My initial testing with Kiro was focused around designing a new proof-of-concept. I've been a fan of Claude Code for a little while now, and I've been using it to bring an idea to life. I know there's a huge ecosystem of tools for Claude Code, and different ways that you can manage planning and implementation. But I just haven't tried it myself.

What stuck out to me about Kiro was Spec mode. If you've been using any sort of AI tools or building anything AI related, you're probably aware that managing context is everything right now. The LLMs take input and generate output, so it becomes your job (or the tools job) to keep that context relevant and manage the window of context that the models can actually digest. Kiro's spec mode works with you to build requirements, create a design, and generate a task list for how the implementation will be conducted. This all fits quite nicely within the UI as well.

Using Kiro on Tplyr

Having used Claude Code and seen how effective these agentic tools have gotten, the potential of what Kiro can do was apparent very quickly. So I embarked on a side quest from my prototype to throw something different at Kiro that would help me stress test it; refactor a large, complicated code base that I wrote to fix design decisions that I regret.

The issue that I was trying to address is summarized in the attached GitHub issue in more detail. To summarize, when Tplyr would process data, we evaluated the functions within the object environments instead of the individual function environments. This was a common pattern throughout the entire code base.

This felt like a good use case for an AI tool based on a few factors:

The issue was specific, but in practice fairly complicated to change
It was a pervasive problem throughout the code base
In making the update, I did not want anything user facing to change. So parity of results with the existing code base was critical.

I think this makes a great challenge for Kiro because the tool needs to first understand the existing architecture, conventions, and problem, and then create and implement a plan. The updates are so widespread that if something went sideways, it would be easy to have to trash everything and start over. The code base is very interconnected, so it would be easy to get lost.

If I were to try to do this update myself, I'd estimate that it would take me about 2 dedicated weeks to get everything completed. Given the complexity of the code base, if a new developer that's never worked on Tplyr came on board and tried to do it, I think it could easily be a month.

Watching Kiro implement

The first thing I asked Kiro to do was analyze the code base and document conventions. Seeing how Kiro plans things out, this felt like a good approach to help manage context about Tplyr up front, so when it started implementing there was already context set aside to provide to the model.

From there, based on my request, Kiro started designing requirements. Here's a little snippet:

Requirement 1: Eliminate evalq() Usage

User Story: As a Tplyr maintainer, I want to eliminate evalq() usage from the codebase, so that code is easier to understand, test, and > maintain.

Acceptance Criteria

WHEN refactoring is complete THEN the system SHALL contain zero uses of evalq() for executing multi-line code blocks in table/layer environments

WHEN refactoring is complete THEN the system SHALL use standard function calls with explicit parameters instead of evalq()

WHEN a function previously using evalq() is refactored THEN the system SHALL preserve all existing functionality

WHEN a function previously using evalq() is refactored THEN the system SHALL maintain the same inputs and outputs from a user perspective

WHEN code is executed THEN the system SHALL NOT create unintended variable bindings in table/layer environments

You can find the full set of documents that Kiro wrote under .kiro/specs/tplyr_refactor - a lot more context was set aside. But based on my problem description, Kiro wrote quite thorough documentation around the requirements of the refactor (and separately, functional requirements), and used that to build a design for the refactor, and ultimately and a task list for implementation.

One of the things that truly impressed me about this process was how smart the implementation approach was. The task list Kiro created was about 30 steps. Here's an example of what those steps look like:

Task List

1. Preparation and Setup

Document all current evalq() usage locations

Establish performance baseline for key functions

Verify test suite is comprehensive and passing

Create refactoring branch

Requirements: All requirements (foundation)

2. Refactor treatment_group_build()

Extract bindings at function start (target, treat_var, pop_data, etc.)

Move all processing logic to function environment

Explicitly bind results (built_target, built_pop_data) at function end

Remove evalq() wrapper

Verify no temporary variables remain in table environment

Requirements: 1.1, 1.3, 1.5, 2.1-2.6, 3.1-3.8

2.1 Write tests for treatment_group_build()

Test that built_target and built_pop_data are created correctly

Test that no temporary variables (fct_levels, grp_i, i) remain in table environment

Test filter error handling

Test treatment group expansion

Test factor handling

Requirements: 12.1-12.5

I made it clear up front that parity with the current code base was critical, so it made sure to measure the current state of the code base before hand. Instead of making a wide swath of updates immediately, it took things incrementally and focused on one part at a time. After each change, it would rerun tests and make sure things were functioning properly before moving on.

One extremely interesting part of this process was when Kiro was trying to update Tplyr's count layers. This is realistically the most complex part of the code base - specifically nested count layers. After Kiro updated the general count layer processing, a chunk of tests started failing. Instead of fixating on resolving the problem right then and there, Kiro figured out the source of the error, but noted that the culprit function was going to be updated later on in another task. It took note of the failures and let them go for the time being. That's an impressive level or strategic thinking - something a lot developers, including myself, can get stuck on.

Results

By the time Kiro was done, the entire code base had been refactored and about 98% of the existing tests were passing. Overall, that's impressive. But let's dig into the failing tests. One of the failing tests was just a the capture of a snapshot error that changed slightly, so this was a non issue. Everything else was essentially cases of variable inheritance within teh objects. Recall the the issue being addressed was that functions were executing within the Tplyr environments. The nice part of this approach was that if a value wasn't present in the layer, it would automatically pull from the parent environment of the table. All I had to manually change to fix this was specify to inherit values when the value was pulled from the layer environments. I would also say this was a reasonable failure on the part of the AI, because the errors this created were unexpected, pretty difficult to trace, and not exactly clear from the package design either.

A Couple Pitfalls

There were a few small issues that I did run into while Kiro was going through its implementation. These were all realistically trivial. A couple times, Kiro would hang and it would be hard to tell what happened. Most of the time this was because there was some sort of interactive thing happening in the terminal. For example, Kiro ran a git command that triggered paging within the terminal window, so it needed me to scroll the window down to the let the command return. Fortunately there's a way to prevent this. Kiro let's you create steering rules to control how the agent goes about doing its tasks. Once I noticed this was happening, I asked it to create a rule to be cautious of the terminal commands being executed and mitigate the possibility of it requiring my interaction.

There were a few more circumstances where the agent started hanging as well and I couldn't find anyplace it was waiting for my interaction. For this, I had to hit cancel on the agencts execution. It recovered and moved on. I'm not sure the source of these issues, but Kiro wrote code and that code as executing, so it could be anything. Ultimately, not a large issue.

Lastly, one thing I admittedly didn't like is that Kiro also added a number of unit tests. In some ways, this is good - this was a substantial change and the tests were aimed at verifying those changes. That said, some of the tests were failing and the failures were false positives. Either Kiro had the wrong idea, or in some cases the unit test file had some changes to the global environment that had trickle effects. Another lesson learned from Tplyr - having good tests is more important than having a lot of tests.

Conclusion

Going back to the beginning, I estimated this task would've taken me - as one of the core developers of the package - about 2 weeks (80-100 hours) of development time. Using Kiro, the planning and updates took about 6-8 hours. The clean-up on my end after the fact took maybe 2-3 hours. Keep in mind that the 6-8 hours spent by Kiro also didn't require my full attention. My dedicated focus was easily half of that. There was a substantial part of that time that I wasn't even at my computer.

I'll note that no one asked me to write this, to test this, or to endorse Kiro. In my current position, I don't have a lot of time code like I used to. That said, I have no shortage of hopes and dreams. Tools like this give me freedom to bring some of those dreams to life. Possibly more importantly, they can help me articulate what I want to do and communicate that with my team.

Long story short, Kiro has blown me a way.

…eline, verify tests

…scalculation. Extremely unobvious.

…ests are passing.

…onal-pattern Refactor/evalq to functional pattern

Pull in extra commit from previous branch

…re causing failures

Copilot

Pull request overview

This PR implements a comprehensive refactoring of Tplyr's internal architecture to eliminate evalq() usage and adopt an Extract-Process-Bind (EPB) pattern. The refactoring was completed using Kiro, an AI-assisted development tool, and represents approximately 6-8 hours of automated work plus 2-3 hours of manual cleanup.

Key achievements:

Refactored 34 internal functions from evalq() pattern to Extract-Process-Bind pattern
Eliminated environment pollution from temporary variables
Improved code testability, maintainability, and clarity
Maintained 100% backward compatibility with no user-facing changes

Reviewed changes

Copilot reviewed 45 out of 46 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
vignettes/denom.Rmd	Added `set.seed()` calls for reproducible random sampling in examples
vignettes/.gitignore	Added Quarto build directory to gitignore
tests/testthat/test-sort.R	Wrapped test code in proper `test_that()` structure and added new sorting tests
tests/testthat/test-pop_data.R	Added comprehensive tests for `build_header_n()` function
tests/testthat/test-nested.R	Added new test file for nested count layer functions
tests/testthat/test-count.R	Minor formatting fix (closing brace)
tests/testthat/_snaps/count.md	Updated error message snapshot to include function name
man/Tplyr.Rd	Added package website URL
R/zzz.R	Added global variable declaration for `tot_fill`
R/stats.R	Refactored risk difference functions to EPB pattern with detailed documentation
R/sort.R	Added `@noRd` tags to internal functions
R/shift.R	Refactored shift layer functions to EPB pattern with comprehensive documentation
R/riskdiff.R	Renamed and refactored `prep_two_way()` to explicit parameter passing
R/process_metadata.R	Refactored metadata functions to EPB pattern with improved structure
R/prebuild.R	Refactored `treatment_group_build()` to EPB pattern, removed empty evalq wrapper
R/pop_data.R	Refactored `build_header_n()` and `add_total_group()` to EPB pattern
R/nested.R	Refactored nested count processing to EPB pattern with clear documentation
R/layer.R	Replaced `evalq()` with direct environment binding
R/gather_defaults.R	Refactored to use `env_get()` instead of `evalq()`
R/desc.R	Refactored descriptive statistics layer to EPB pattern
R/count.R	Extensively refactored count layer functions to EPB pattern
R/collapse_row_labels.R	Removed commented-out browser() call
R/assertions.R	Refactored to use `env_get()` with inheritance
NEWS.md	Added internal changes section documenting refactoring
NAMESPACE	Added S3 method export for `process_metadata.tplyr_statistic`
.kiro/**	Extensive documentation of refactoring process, requirements, and implementation
.Rbuildignore	Added `.kiro` directory to build ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mstackhouse and others added 20 commits December 6, 2025 17:35

Task 1: Preparation and Setup - Document evalq() usage, establish bas…

e50edc2

…eline, verify tests

Add preparation summary document

b2d779c

Save kiro progress

cf408d1

Finish kiro refactor

72547ea

denoms_by didn't pass pass into the outer layer processing causing mi…

b2a3c9c

…scalculation. Extremely unobvious.

Valid snap change

4640c36

Correct to grab inherited formats)

09ed401

Missed inherited values

3d8a26d

Work through inheritance pieces

238cd80

This was all Kiro generated - existing tests pass.

142a3d0

All these tests aren't necessary and were Kiro generated. Necessary t…

48fd09f

…ests are passing.

Extra inheritance and additional snaps from new tests.

432650f

This scared me but it was just lack of a seed on a random sample

11e11fd

new required global binding

2bfb4d2

patching up R CMD check and last failing tests

0496c93

Merge pull request #197 from atorus-research/refactor/evalq-to-functi…

f533461

…onal-pattern Refactor/evalq to functional pattern

Ignore Kiro for R builds

8672198

Merge branch 'refactor/evalq-to-functional-pattern' into kiro_refactor

b7f1dd3

Pull in extra commit from previous branch

Get rid of Kiro generated tests that weren't truly necessary. Some we…

ca40089

…re causing failures

I think I accidentally deleted this.

6173c1c

mstackhouse requested a review from Copilot December 10, 2025 14:40

Copilot started reviewing on behalf of mstackhouse December 10, 2025 14:41 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation #198 - Refactor completed by Kiro #199

Implementation #198 - Refactor completed by Kiro #199

Uh oh!

mstackhouse commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implementation #198 - Refactor completed by Kiro #199

Are you sure you want to change the base?

Implementation #198 - Refactor completed by Kiro #199

Uh oh!

Conversation

mstackhouse commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Formalities

Kiro Experience

Using Kiro on Tplyr

Watching Kiro implement

Requirement 1: Eliminate evalq() Usage

Acceptance Criteria

Task List

Results

A Couple Pitfalls

Conclusion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mstackhouse commented Dec 9, 2025 •

edited

Loading