Develop new Koza API + general refactoring#163
Merged
kevinschaper merged 55 commits intomainfrom Jul 10, 2025
Merged
Conversation
added 27 commits
January 10, 2025 13:01
Without making any changes to functionality, this separates a koza configuration into a ReaderConfiguration, TransformConfiguration, and WriterConfiguration, all contained within a KozaConfiguration.
The big changes are:
1. Taking in a JSON{,L}ReaderConfig object for all configuration
2. Defining iteration via `__iter__()` and `yield`
First, replaces the many named parameters with a single CSVReaderConfig object. Second, uses `__iter__()` and `yield` to define iteration. Third, refactors the header consumption and validation code, and wraps accessing the header in a property on the class.
This adds a new class: KozaRunner, which represents a new way of running Koza transforms. It is a work in progress and still not at feature parity with existing transforms. Essentially, the KozaRunner class takes three parameters: 1. Data (the data to be transformed) 2. A function to transform that data, either all at once or row-by-row 3. A writer that will do something with the transformed output See the documentation in src/koza/runner.py for more details.
This commit makes multiple changes to koza.io.utils.open_resource - Adds support for opening tar files. - Handles archives (zip and tar) in the same way that the old `file_archive` source configuration did: it assumes all files in an archive are of the same format (CSV, JSONL, etc.). It will likely be future work to allow a way to specify that only certain files in an archive should be handled. - Adds more robust checking for gzip compression than checking for a `.gz` extension. - open_resource() now returns one or more SizedResource objects that indicate the size of the resource being opened, and a `.tell()` method that indicates the position being read in that resource. This will be necessary to add some sort of progress bar in the future. - Resources downloaded from the Web now use the same logic as local files to check for compression/archives. - Importantly, the resources returned by `open_resource` *are not automatically closed*. This was inconsistent in the previous version. It is up to the consumer of the function to explicitly close resources. - Adds more tests for compressed and archival formats. - Small typing changes for other koza.io.utils functions, adding Optional where appropriate
This was not working correctly with the discriminated union field
I realized at some point that creating a map from a reader file is just a type of transform. This change in the configuration makes achieving that possible. A map transform is just a transform that relies on two additional configuration keys: `key` and `values`. To make passing those values in a YAML config possible, this commit makes it so that any extra fields in the configuration are parsed into an `extra_fields` field in a transform.
This makes config creation more lenient. Note that this means it's possible to have an empty transform. The lack of a transform would be detected when a KozaRunner is run.
Also remove unnecessary `files=[]` calls, since that is the default as of eaff691.
This allows a transform to be defined as a module (resolvable from PATH), e.g. `mypackage.transforms.example_transform`, rather than having to defined it as a file (`/home/user/code/mypackage/transforms/example_transform.py`) This allows the possibility of creating generic transforms that can be packaged, installed, and re-used, without having to track down the filename of the python file where the transform code is located.
This commit the builds on the changes in a60c607, bfa87d3, and eaff691.
It fully implements the mapping functionality that was present in the
previous method of writing transforms, although with a new API.
Instead of being given a large dict-of-dicts with mappings defined for
terms, a method is passed via the KozaTransform object used in a
transform, where a map lookup is done like so:
def transform(koza: KozaTransform):
term = "example"
mapped_term = koza.lookup(term, "column_b")
...where the map was loaded from a CSV file that might look like this:
id,column_a,column_b
example,alias1,alias2
...resulting in mapped_term evaluating to `"alias2"`.
* Use match statement for header detection logic * Remove unused line_num and line_count variables
Formatting, renaming variables tests, removing unnecessary config params
* Use [project] instead of [tool.poetry] * Set minimum python version to 3.10 * Use pyupgrade lint rules
I had switched to itertuples (from iterrows), but didn't change how rows were interacted with.
Member
|
I'm going to go ahead and bring this into main, but not yet make a release yet. |
kevinschaper
approved these changes
Jul 10, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a huge commit, because it was hard to change one part of Koza without changing everything.
Here are the headlines:
Change the API for writing transforms
The best way to see this is in the diffs for the examples. Take the example
string-w-maptransform. (Here's the split diff for convenience)Old API:
New API:
Note a few things here:
import cli_utils,koza_app = get_koza_app(),koza_app.get_map(),koza_app.get_row()and so on). Instead, you just need to write one function, calledtransform_record, which injects two arguments:koza(which provideskoza.writefor writing,koza.lookupfor mapping, and a few other things), andrecord(thedictrepresenting a row in a CSV file, a line in JSON lines, or an object in a JSON array).koza.lookup(term, map_name)), rather than indexing into a nested dict (koza_map[term]]["entrez"]).kozaargument-- it's a central place for functionality provided to a transform.This was an example of what we used to call a "loop" transform. Here's an example of a "flat" transform. We can take
examples/string/protein-links-detailed.pyas an example.Old:
New:
About the same as the previous example, but this time all of the functionality for the transform is in a function named
transform.In the old API, "loop" and "flat" transforms were differentiated in the YAML config. Now, it's determined by the name of the function you define in your transform module. In both cases, transforms will only be loaded once, whether in what we used to call "loop" or "flat" mode. But:
transform, it will be run once, and it's up to the the consumer to read all the data. (This is equivalent to "flat" mode).transform_record, it will be run for every record from every reader. (Equivalent to "loop" mode)If neither or both of these functions are defined, an error is raised and Koza bails out.
Major overhaul of the runner
To support this new API, there is a (basically) completely new runner class.
At a high level, here's how Koza used to work:
In "loop" mode:
.get_row()and that loading it will have side effects. Let all code run.In "flat" mode:
koza_app.get_row()repeatedly. Run this code repeatedly until one of the following expected exceptions is thrown:NextRowException(meaning that a source has been exhausted and we should move onto the next source), orStopIteration(meaning that all sources have been exhausted). Other exceptions are not expected and will raise an error and stop everything.koza_app.get_row()enough to exhaust all sources, run forever.This is the main code that did that:
koza/src/koza/app.py
Lines 101 to 124 in 8a3bab9
The way to run a transform in python was to run
koza.cli_utils.transform_source:koza/src/koza/cli_utils.py
Lines 39 to 94 in 8a3bab9
In the new API, there is a new class--
KozaRunner-- that takes care of loading a a configuration and kicking off the transform. Here is its__init__function:koza/src/koza/runner.py
Lines 146 to 155 in 4d5bc95
Note that this doesn't take any configuration file. The normal intended way to instantiate a
KozaRunnerobject is with one of these class methods:koza/src/koza/runner.py
Lines 257 to 264 in 4d5bc95
koza/src/koza/runner.py
Lines 311 to 320 in 4d5bc95
When a
KozaRunnerobject is instantiated from a configuration, it looks for a transform:koza/src/koza/runner.py
Lines 268 to 288 in 4d5bc95
(Note that it loads this transform module once, in line 280).
When the
KozaRunnerobject is instantiated, you run the transform with...runner.run():koza/src/koza/runner.py
Lines 197 to 207 in 4d5bc95
Which just delegates to either
run_single(if you have defined atransformfunction in your module), orrun_serial(if you definedtransform_record). As mentioned before,transformis called once, whereastransform_recordis called for every record defined in the source.With all of this said, here is how you run Koza from Python with this new API:
Or, more likely, given a declarative YAML configuration:
That's basically all the command line interface does (along with parsing options and so on). Long story short, I moved almost all the wiring into one class--
KozaRunner-- and removed a ton of plumbing logic away from thecli_utilsmodule (which was confusingly part of the main Koza API).(Side effect: this is vastly easier to test).
New configuration format
Speaking of configuration. Koza's configuration, which was a massive pile of top-level YAML options, has been changed to be a nested series of configurations for three sections:
reader,writer, andtransform. Here's a comparison, fromexamples/string-w-map/map-protein-links-detailed.yaml.Old config:
New config:
This may seem like a small change, but compare the old Pydantic class for
SourceConfigto the new separated config classes:KozaConfigReaderConfigWriterConfigTransformConfigThis makes it drastically easier to add/remove/change options. It also drastically simplifies how configurations are passed to readers and writers. Readers should and do only know about
ReaderConfigs; writers should and do only know aboutWriterConfigs. For a dramatic example, check outCSVReader.Old
__init__:koza/src/koza/io/reader/csv_reader.py
Lines 43 to 58 in 8a3bab9
And here's where it was created:
koza/src/koza/model/source.py
Lines 38 to 48 in 8a3bab9
And here's the
__post_init__method in the parent class that did a lot of validation for CSV configs. (To be clear, this was the configuration for everything, not just CSV readers):koza/src/koza/model/config/source_config.py
Lines 201 to 288 in 8a3bab9
..And in the new API:
koza/src/koza/io/reader/csv_reader.py
Lines 40 to 46 in 4d5bc95
Instantiation site:
koza/src/koza/model/source.py
Lines 53 to 56 in 4d5bc95
Parent model
__post_init__:koza/src/koza/model/koza.py
Lines 44 to 64 in 4d5bc95
🎉
Want to write a new writer?
In the old code: add a bunch of options to the
SourceConfigpydantic class in the top level, alongsideheader_mode,filters,transform_module, and the couple dozen other ones. Create a writer with an__init__class that duplicates all the names and types of all those options you added toSourceConfig. When you create an instance of that class, duplicate those names a third time. Need to change or add an option? I hope you remembered to make your edits in all three of those places. (Also: you can't rely on Pydantic to mark any of your options as required, because those options will never have values when people are not using your writer. Better do a bunch of validation logic in the writer itself, which won't run until the transform starts running. Or maybe you could add to the mess ofSourceConfig, adding more logic to its__post_init__).In the new code: inherit from
WriterConfigand extend it. Have your writer take aconfig: MyNewWriterConfigparameter. Test it! Good job.Misc.
Removals
linkmlyourself). We could add this back in easily.(...more to come...)
(note to self: no special mapping transform, extra fields in transforms, no difference between
filesandfile_archive)