Develop new Koza API + general refactoring by ptgolden · Pull Request #163 · monarch-initiative/koza

ptgolden · 2025-01-14T21:15:21Z

This is a huge commit, because it was hard to change one part of Koza without changing everything.

Here are the headlines:

Change the API for writing transforms

The best way to see this is in the diffs for the examples. Take the example string-w-map transform. (Here's the split diff for convenience)

Old API:

import uuid

from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction

from koza.cli_utils import get_koza_app

source_name = "map-protein-links-detailed"
map_name = "entrez-2-string"

koza_app = get_koza_app(source_name)
row = koza_app.get_row()
koza_map = koza_app.get_map(map_name)

from loguru import logger

logger.info(koza_map)

gene_a = Gene(id="NCBIGene:" + koza_map[row["protein1"]]["entrez"])
gene_b = Gene(id="NCBIGene:" + koza_map[row["protein2"]]["entrez"])

pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
    id="uuid:" + str(uuid.uuid1()),
    subject=gene_a.id,
    object=gene_b.id,
    predicate="biolink:interacts_with",
    knowledge_level="not_provided",
    agent_type="not_provided",
)

koza_app.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)

New API:

import uuid

from biolink_model.datamodel.pydanticmodel_v2 import Gene, PairwiseGeneToGeneInteraction

from koza.runner import KozaTransform


def transform_record(koza: KozaTransform, record: dict):
    a = record["protein1"]
    b = record["protein2"]
    mapped_a = koza.lookup(a, "entrez")
    mapped_b = koza.lookup(b, "entrez")
    gene_a = Gene(id="NCBIGene:" + mapped_a)
    gene_b = Gene(id="NCBIGene:" + mapped_b)

    pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
        id="uuid:" + str(uuid.uuid1()),
        subject=gene_a.id,
        object=gene_b.id,
        predicate="biolink:interacts_with",
        knowledge_level="not_provided",
        agent_type="not_provided",
    )

    koza.write(gene_a, gene_b, pairwise_gene_to_gene_interaction)

Note a few things here:

No more song and dance preamble in the transform (import cli_utils, koza_app = get_koza_app(), koza_app.get_map(), koza_app.get_row() and so on). Instead, you just need to write one function, called transform_record, which injects two arguments: koza (which provides koza.write for writing, koza.lookup for mapping, and a few other things), and record (the dict representing a row in a CSV file, a line in JSON lines, or an object in a JSON array).
Following from (1), we do not assume that this file will be reloaded on every record. Instead, one function is run over and over for each record. This should be a more intuitive way to think about how Koza transforms run. (It may pave the way for transform that run in parallel, too).
Map lookups are done with a function (koza.lookup(term, map_name)), rather than indexing into a nested dict (koza_map[term]]["entrez"]).
Abstractly, more functionality could be added to the koza argument-- it's a central place for functionality provided to a transform.

This was an example of what we used to call a "loop" transform. Here's an example of a "flat" transform. We can take examples/string/protein-links-detailed.py as an example.

Old:

import re
import uuid

from biolink_model.datamodel.pydanticmodel_v2 import PairwiseGeneToGeneInteraction, Protein

from koza.cli_utils import get_koza_app

koza_app = get_koza_app('protein-links-detailed')

while (row := koza_app.get_row()) is not None:
    protein_a = Protein(id='ENSEMBL:' + re.sub(r'\d+\.', '', row['protein1']))
    protein_b = Protein(id='ENSEMBL:' + re.sub(r'\d+\.', '', row['protein2']))

    pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
        id="uuid:" + str(uuid.uuid1()),
        subject=protein_a.id,
        object=protein_b.id,
        predicate="biolink:interacts_with",
        knowledge_level="not_provided",
        agent_type="not_provided",
    )

    koza_app.write(protein_a, protein_b, pairwise_gene_to_gene_interaction)

New:

import re
import uuid

from biolink_model.datamodel.pydanticmodel_v2 import PairwiseGeneToGeneInteraction, Protein

from koza.runner import KozaTransform


def transform(koza: KozaTransform):
    for row in koza.data:
        protein_a = Protein(id="ENSEMBL:" + re.sub(r"\d+\.", "", row["protein1"]))
        protein_b = Protein(id="ENSEMBL:" + re.sub(r"\d+\.", "", row["protein2"]))

        pairwise_gene_to_gene_interaction = PairwiseGeneToGeneInteraction(
            id="uuid:" + str(uuid.uuid1()),
            subject=protein_a.id,
            object=protein_b.id,
            predicate="biolink:interacts_with",
            knowledge_level="not_provided",
            agent_type="not_provided",
        )

        koza.write(protein_a, protein_b, pairwise_gene_to_gene_interaction)

About the same as the previous example, but this time all of the functionality for the transform is in a function named transform.

In the old API, "loop" and "flat" transforms were differentiated in the YAML config. Now, it's determined by the name of the function you define in your transform module. In both cases, transforms will only be loaded once, whether in what we used to call "loop" or "flat" mode. But:

If there is a function named transform, it will be run once, and it's up to the the consumer to read all the data. (This is equivalent to "flat" mode).
If there is a function named transform_record, it will be run for every record from every reader. (Equivalent to "loop" mode)

If neither or both of these functions are defined, an error is raised and Koza bails out.

Major overhaul of the runner

To support this new API, there is a (basically) completely new runner class.

At a high level, here's how Koza used to work:

In "loop" mode:

Load the python module defined in the transform.
Assume that this code has top-level code that calls .get_row() and that loading it will have side effects. Let all code run.
Once the top-level code in module has run, reload it and goto (2) until all rows have been processed.

In "flat" mode:

Load the python module defined in the transform.
Assume that this code has top-level code that calls koza_app.get_row() repeatedly. Run this code repeatedly until one of the following expected exceptions is thrown: NextRowException (meaning that a source has been exhausted and we should move onto the next source), or StopIteration (meaning that all sources have been exhausted). Other exceptions are not expected and will raise an error and stop everything.
If neither of those expected exceptions are thrown (for example, if you managed to not call koza_app.get_row() enough to exhaust all sources, run forever.

This is the main code that did that:

koza/src/koza/app.py

Lines 101 to 124 in 8a3bab9

    
           if self.source.config.transform_mode == 'flat': 
        
               while True: 
        
                   try: 
        
                       if is_first: 
        
                           transform_module = importlib.import_module(transform_code) 
        
                           is_first = False 
        
                       else: 
        
                           importlib.reload(transform_module) 
        
                   except MapItemException as mie: 
        
                       if self.logger: 
        
                           self.logger.debug(f"{str(mie)} not found in map") 
        
                   except NextRowException: 
        
                       continue 
        
                   except ValidationError: 
        
                       if self.logger: 
        
                           self.logger.error(f"Validation error while processing: {self.source.last_row}") 
        
                       raise ValidationError 
        
                   except StopIteration: 
        
                       break 
        
           elif self.source.config.transform_mode == 'loop': 
        
               if transform_code not in sys.modules.keys(): 
        
                   importlib.import_module(transform_code) 
        
               else: 
        
                   importlib.reload(importlib.import_module(transform_code))

The way to run a transform in python was to run koza.cli_utils.transform_source:

koza/src/koza/cli_utils.py

Lines 39 to 94 in 8a3bab9

    
           def transform_source( 
        
               source: str, 
        
               output_dir: str, 
        
               output_format: OutputFormat = OutputFormat("tsv"), 
        
               global_table: str = None, 
        
               local_table: str = None, 
        
               schema: str = None, 
        
               node_type: str = None, 
        
               edge_type: str = None, 
        
               row_limit: int = None, 
        
               verbose: bool = None, 
        
               log: bool = False, 
        
           ): 
        
               """Create a KozaApp object, process maps, and run the transform 
        
               Args: 
        
                   source (str): Path to source metadata file 
        
                   output_dir (str): Path to output directory 
        
                   output_format (OutputFormat, optional): Output format. Defaults to OutputFormat('tsv'). 
        
                   global_table (str, optional): Path to global translation table. Defaults to None. 
        
                   local_table (str, optional): Path to local translation table. Defaults to None. 
        
                   schema (str, optional): Path to schema file. Defaults to None. 
        
                   row_limit (int, optional): Number of rows to process. Defaults to None. 
        
                   verbose (bool, optional): Verbose logging. Defaults to None. 
        
                   log (bool, optional): Log to file. Defaults to False. 
        
               """ 
        
               logger = get_logger(name=Path(source).name if log else None, verbose=verbose) 
        
               with open(source, "r") as source_fh: 
        
                   source_config = PrimaryFileConfig(**yaml.load(source_fh, Loader=UniqueIncludeLoader)) 
        
               # Set name and transform code if not provided 
        
               if not source_config.name: 
        
                   source_config.name = Path(source).stem 
        
               if not source_config.transform_code: 
        
                   filename = f"{Path(source).parent / Path(source).stem}.py" 
        
                   if not Path(filename).exists(): 
        
                       filename = Path(source).parent / "transform.py" 
        
                   if not Path(filename).exists(): 
        
                       raise FileNotFoundError(f"Could not find transform file for {source}") 
        
                   source_config.transform_code = filename 
        
               koza_source = Source(source_config, row_limit) 
        
               logger.debug(f"Source created: {koza_source.config.name}") 
        
               translation_table = get_translation_table( 
        
                   global_table if global_table else source_config.global_table, 
        
                   local_table if local_table else source_config.local_table, 
        
                   logger, 
        
               ) 
        
               koza_app = _set_koza_app( 
        
                   koza_source, translation_table, output_dir, output_format, schema, node_type, edge_type, logger 
        
               ) 
        
               koza_app.process_maps() 
        
               koza_app.process_sources()

In the new API, there is a new class-- KozaRunner-- that takes care of loading a a configuration and kicking off the transform. Here is its __init__ function:

koza/src/koza/runner.py

Lines 146 to 155 in 4d5bc95

    
           class KozaRunner: 
        
               def __init__( 
        
                   self, 
        
                   data: Iterator[Record], 
        
                   writer: KozaWriter, 
        
                   mapping_filenames: list[str] | None = None, 
        
                   extra_transform_fields: dict[str, Any] | None = None, 
        
                   transform_record: Callable[[KozaTransform, Record], None] | None = None, 
        
                   transform: Callable[[KozaTransform], None] | None = None, 
        
               ):

Note that this doesn't take any configuration file. The normal intended way to instantiate a KozaRunner object is with one of these class methods:

koza/src/koza/runner.py

Lines 257 to 264 in 4d5bc95

    
           @classmethod 
        
           def from_config( 
        
               cls, 
        
               config: KozaConfig, 
        
               output_dir: str = "", 
        
               row_limit: int = 0, 
        
               show_progress: bool = False, 
        
           ):

koza/src/koza/runner.py

Lines 311 to 320 in 4d5bc95

    
           @classmethod 
        
           def from_config_file( 
        
               cls, 
        
               config_filename: str, 
        
               output_dir: str = "", 
        
               output_format: OutputFormat | None = None, 
        
               row_limit: int = 0, 
        
               show_progress: bool = False, 
        
               overrides: dict | None = None, 
        
           ):

When a KozaRunner object is instantiated from a configuration, it looks for a transform:

koza/src/koza/runner.py

Lines 268 to 288 in 4d5bc95

    
           if config.transform.code: 
        
               transform_code_path = Path(config.transform.code) 
        
               parent_path = transform_code_path.absolute().parent 
        
               module_name = transform_code_path.stem 
        
               logger.debug(f"Adding `{parent_path}` to system path to load transform module") 
        
               sys.path.append(str(parent_path)) 
        
               # FIXME: Remove this from sys.path 
        
           elif config.transform.module: 
        
               module_name = config.transform.module 
        
           if module_name: 
        
               logger.debug(f"Loading module `{module_name}`") 
        
               transform_module = importlib.import_module(module_name) 
        
           transform = getattr(transform_module, "transform", None) 
        
           if transform: 
        
               logger.debug(f"Found transform function `{module_name}.transform`") 
        
           transform_record = getattr(transform_module, "transform_record", None) 
        
           if transform_record: 
        
               logger.debug(f"Found transform function `{module_name}.transform_record`") 
        
           source = Source(config, row_limit=row_limit, show_progress=show_progress)

(Note that it loads this transform module once, in line 280).

When the KozaRunner object is instantiated, you run the transform with... runner.run():

koza/src/koza/runner.py

Lines 197 to 207 in 4d5bc95

    
           def run(self): 
        
               if callable(self.transform) and callable(self.transform_record): 
        
                   raise ValueError("Can only define one of `transform` or `transform_record`") 
        
               elif callable(self.transform): 
        
                   self.run_single() 
        
               elif callable(self.transform_record): 
        
                   self.run_serial() 
        
               else: 
        
                   raise NoTransformException("Must define one of `transform` or `transform_record`") 
        
               self.writer.finalize()

Which just delegates to either run_single (if you have defined a transform function in your module), or run_serial (if you defined transform_record). As mentioned before, transform is called once, whereas transform_record is called for every record defined in the source.

With all of this said, here is how you run Koza from Python with this new API:

from Koza import KozaConfig, KozaRunner

runner = KozaRunner.from_config(KozaConfig({
    "reader": { ... },
    "writer": { ... },
    "transform": { ... },
}))
runner.run()

Or, more likely, given a declarative YAML configuration:

from koza import KozaRunner

config, runner = KozaRunner.from_file("myconfig.yaml")
runner.run()

That's basically all the command line interface does (along with parsing options and so on). Long story short, I moved almost all the wiring into one class-- KozaRunner-- and removed a ton of plumbing logic away from the cli_utils module (which was confusingly part of the main Koza API).

(Side effect: this is vastly easier to test).

New configuration format

Speaking of configuration. Koza's configuration, which was a massive pile of top-level YAML options, has been changed to be a nested series of configurations for three sections: reader, writer, and transform. Here's a comparison, from examples/string-w-map/map-protein-links-detailed.yaml.

Old config:

name: 'map-protein-links-detailed'

delimiter: ' '

files:
  - './examples/data/string.tsv'
  - './examples/data/string2.tsv'

metadata: !include './examples/string-w-map/metadata.yaml'

columns:
  - 'protein1'
  - 'protein2'
  - 'neighborhood'
  - 'fusion'
  - 'cooccurence'
  - 'coexpression'
  - 'experimental'
  - 'database'
  - 'textmining'
  - 'combined_score' : 'int'

filters:
  - inclusion: 'include'
    column: 'combined_score'
    filter_code: 'lt'
    value: 700

depends_on:
  - './examples/maps/entrez-2-string.yaml'

transform_mode: 'flat'

node_properties:
  - 'id'
  - 'category'
  - 'provided_by'

edge_properties:
  - 'id'
  - 'subject'
  - 'predicate'
  - 'object'
  - 'category'
  - 'relation'
  - 'provided_by'

New config:

name: 'map-protein-links-detailed'

metadata: !include './examples/string-w-map/metadata.yaml'

reader:
  format: csv
  delimiter: ' '
  files:
    - './examples/data/string.tsv'
    - './examples/data/string2.tsv'

  columns:
    - 'protein1'
    - 'protein2'
    - 'neighborhood'
    - 'fusion'
    - 'cooccurence'
    - 'coexpression'
    - 'experimental'
    - 'database'
    - 'textmining'
    - 'combined_score' : 'int'

transform:
  filters:
    - inclusion: 'include'
      column: 'combined_score'
      filter_code: 'lt'
      value: 700
  mappings:
    - './examples/maps/entrez-2-string.yaml'

writer:
  node_properties:
    - 'id'
    - 'category'
    - 'provided_by'

  edge_properties:
    - 'id'
    - 'subject'
    - 'predicate'
    - 'object'
    - 'category'
    - 'relation'
    - 'provided_by'

This may seem like a small change, but compare the old Pydantic class for SourceConfig to the new separated config classes:

This makes it drastically easier to add/remove/change options. It also drastically simplifies how configurations are passed to readers and writers. Readers should and do only know about ReaderConfigs; writers should and do only know about WriterConfigs. For a dramatic example, check out CSVReader.

Old __init__:

koza/src/koza/io/reader/csv_reader.py

Lines 43 to 58 in 8a3bab9

    
           def __init__( 
        
               self, 
        
               io_str: IO[str], 
        
               field_type_map: Dict[str, FieldType] = None, 
        
               delimiter: str = ",", 
        
               header: Union[int, HeaderMode] = HeaderMode.infer, 
        
               header_delimiter: str = None, 
        
               header_prefix: str = None, 
        
               dialect: str = "excel", 
        
               skip_blank_lines: bool = True, 
        
               name: str = "csv file", 
        
               comment_char: str = "#", 
        
               row_limit: int = None, 
        
               *args, 
        
               **kwargs, 
        
           ):

And here's where it was created:

koza/src/koza/model/source.py

Lines 38 to 48 in 8a3bab9

    
           CSVReader( 
        
               resource_io, 
        
               name=config.name, 
        
               field_type_map=config.field_type_map, 
        
               delimiter=config.delimiter, 
        
               header=config.header, 
        
               header_delimiter=config.header_delimiter, 
        
               header_prefix=config.header_prefix, 
        
               comment_char=self.config.comment_char, 
        
               row_limit=self.row_limit, 
        
           )

And here's the __post_init__ method in the parent class that did a lot of validation for CSV configs. (To be clear, this was the configuration for everything, not just CSV readers):

koza/src/koza/model/config/source_config.py

Lines 201 to 288 in 8a3bab9

    
           def __post_init__(self): 
        
               # Get files as paths, or extract them from an archive 
        
               if self.file_archive: 
        
                   files = self.extract_archive() 
        
               else: 
        
                   files = self.files 
        
               files_as_paths: List[Path] = [] 
        
               for file in files: 
        
                   if isinstance(file, str): 
        
                       files_as_paths.append(Path(file)) 
        
                   else: 
        
                       files_as_paths.append(file) 
        
               object.__setattr__(self, "files", files_as_paths) 
        
               # If metadata looks like a file path attempt to load it from the yaml 
        
               if self.metadata and isinstance(self.metadata, str): 
        
                   try: 
        
                       with open(self.metadata, "r") as meta: 
        
                           object.__setattr__(self, "metadata", DatasetDescription(**yaml.safe_load(meta))) 
        
                   except Exception as e: 
        
                       raise ValueError(f"Unable to load metadata from {self.metadata}: {e}") 
        
               # Format tab as delimiter 
        
               if self.delimiter in ["tab", "\\t"]: 
        
                   object.__setattr__(self, "delimiter", "\t") 
        
               # Filter columns 
        
               filtered_columns = [column_filter.column for column_filter in self.filters] 
        
               all_columns = [] 
        
               if self.columns: 
        
                   all_columns = [next(iter(column)) if isinstance(column, Dict) else column for column in self.columns] 
        
               if self.header == HeaderMode.none and not self.columns: 
        
                   raise ValueError( 
        
                       "there is no header and columns have not been supplied\n" 
        
                       "configure the 'columns' field or set header to the 0-based" 
        
                       "index in which it appears in the file, or set this value to" 
        
                       "'infer'" 
        
                   ) 
        
               for column in filtered_columns: 
        
                   if column not in all_columns: 
        
                       raise (ValueError(f"Filter column {column} not in column list")) 
        
               for column_filter in self.filters: 
        
                   if column_filter.filter_code in ["lt", "gt", "lte", "gte"]: 
        
                       if not isinstance(column_filter.value, (int, float)): 
        
                           raise ValueError(f"Filter value must be int or float for operator {column_filter.filter_code}") 
        
                   elif column_filter.filter_code == "eq": 
        
                       if not isinstance(column_filter.value, (str, int, float)): 
        
                           raise ValueError( 
        
                               f"Filter value must be string, int or float for operator {column_filter.filter_code}" 
        
                           ) 
        
                   elif column_filter.filter_code == "in": 
        
                       if not isinstance(column_filter.value, List): 
        
                           raise ValueError(f"Filter value must be List for operator {column_filter.filter_code}") 
        
               # Check for conflicting configurations 
        
               if self.format == FormatType.csv and self.required_properties: 
        
                   raise ValueError( 
        
                       "CSV specified but required properties have been configured\n" 
        
                       "Either set format to jsonl or change properties to columns in the config" 
        
                   ) 
        
               if self.columns and self.format != FormatType.csv: 
        
                   raise ValueError( 
        
                       "Columns have been configured but format is not csv\n" 
        
                       "Either set format to csv or change columns to properties in the config" 
        
                   ) 
        
               if self.json_path and self.format != FormatType.json: 
        
                   raise ValueError( 
        
                       "iterate_over has been configured but format is not json\n" 
        
                       "Either set format to json or remove iterate_over in the configuration" 
        
                   ) 
        
               # Create a field_type_map if columns are supplied 
        
               if self.columns: 
        
                   field_type_map = {} 
        
                   for field in self.columns: 
        
                       if isinstance(field, str): 
        
                           field_type_map[field] = FieldType.str 
        
                       else: 
        
                           if len(field) != 1: 
        
                               raise ValueError("Field type map contains more than one key") 
        
                           for key, val in field.items(): 
        
                               field_type_map[key] = val 
        
                   object.__setattr__(self, "field_type_map", field_type_map)

..And in the new API:

koza/src/koza/io/reader/csv_reader.py

Lines 40 to 46 in 4d5bc95

    
           def __init__( 
        
               self, 
        
               io_str: IO[str], 
        
               config: CSVReaderConfig, 
        
               *args: Any, 
        
               **kwargs: Any, 
        
           ):

Instantiation site:

koza/src/koza/model/source.py

Lines 53 to 56 in 4d5bc95

    
           CSVReader( 
        
               resource.reader, 
        
               config=reader_config, 
        
           )

Parent model __post_init__:

koza/src/koza/model/koza.py

Lines 44 to 64 in 4d5bc95

    
           def __post_init__(self): 
        
               # If metadata looks like a file path attempt to load it from the yaml 
        
               if self.metadata and isinstance(self.metadata, str): 
        
                   try: 
        
                       with open(self.metadata) as meta: 
        
                           object.__setattr__(self, "metadata", DatasetDescription(**yaml.safe_load(meta))) 
        
                   except Exception as e: 
        
                       raise ValueError(f"Unable to load metadata from {self.metadata}: {e}") from e 
        
               if self.reader.format == InputFormat.csv and self.reader.columns is not None: 
        
                   filtered_columns = OrderedSet([column_filter.column for column_filter in self.transform.filters]) 
        
                   all_columns = OrderedSet( 
        
                       [column if isinstance(column, str) else list(column.keys())[0] for column in self.reader.columns] 
        
                   ) 
        
                   extra_filtered_columns = filtered_columns - all_columns 
        
                   if extra_filtered_columns: 
        
                       quote = "'" 
        
                       raise ValueError( 
        
                           "One or more filter columns not present in designated CSV columns:" 
        
                           f" {', '.join([f'{quote}{c}{quote}' for c in extra_filtered_columns])}" 
        
                       )

🎉

Want to write a new writer?

In the old code: add a bunch of options to the SourceConfig pydantic class in the top level, alongside header_mode, filters, transform_module, and the couple dozen other ones. Create a writer with an __init__ class that duplicates all the names and types of all those options you added to SourceConfig. When you create an instance of that class, duplicate those names a third time. Need to change or add an option? I hope you remembered to make your edits in all three of those places. (Also: you can't rely on Pydantic to mark any of your options as required, because those options will never have values when people are not using your writer. Better do a bunch of validation logic in the writer itself, which won't run until the transform starts running. Or maybe you could add to the mess of SourceConfig, adding more logic to its __post_init__).

In the new code: inherit from WriterConfig and extend it. Have your writer take a config: MyNewWriterConfig parameter. Test it! Good job.

Misc.

Better type annotations everywhere.
Drop support for Python <3.10, add support for 3.13
Add a bunch of tests. (Which in turn led to fixing a number of bugs).
Progress bars. Cool! Check out 144ebca.
Stream input from compressed files. (i.e. there's no need to extract them first).

Removals

No more translation tables (global or local). Use maps or SSSOM instead. (Need to expand on this)
No more linkml validation on every row (which, to be clear, caused huge overhead when enabled. Also I haven't seen people use it). Instead: transform, then validate (by just using linkml yourself). We could add this back in easily.
Got rid of lots of unused code. See df7baa2 626979f 645835c 2f3f4f0 e05c813

(...more to come...)

(note to self: no special mapping transform, extra fields in transforms, no difference between files and file_archive)

Without making any changes to functionality, this separates a koza configuration into a ReaderConfiguration, TransformConfiguration, and WriterConfiguration, all contained within a KozaConfiguration.

The big changes are: 1. Taking in a JSON{,L}ReaderConfig object for all configuration 2. Defining iteration via `__iter__()` and `yield`

First, replaces the many named parameters with a single CSVReaderConfig object. Second, uses `__iter__()` and `yield` to define iteration. Third, refactors the header consumption and validation code, and wraps accessing the header in a property on the class.

This adds a new class: KozaRunner, which represents a new way of running Koza transforms. It is a work in progress and still not at feature parity with existing transforms. Essentially, the KozaRunner class takes three parameters: 1. Data (the data to be transformed) 2. A function to transform that data, either all at once or row-by-row 3. A writer that will do something with the transformed output See the documentation in src/koza/runner.py for more details.

This commit makes multiple changes to koza.io.utils.open_resource - Adds support for opening tar files. - Handles archives (zip and tar) in the same way that the old `file_archive` source configuration did: it assumes all files in an archive are of the same format (CSV, JSONL, etc.). It will likely be future work to allow a way to specify that only certain files in an archive should be handled. - Adds more robust checking for gzip compression than checking for a `.gz` extension. - open_resource() now returns one or more SizedResource objects that indicate the size of the resource being opened, and a `.tell()` method that indicates the position being read in that resource. This will be necessary to add some sort of progress bar in the future. - Resources downloaded from the Web now use the same logic as local files to check for compression/archives. - Importantly, the resources returned by `open_resource` *are not automatically closed*. This was inconsistent in the previous version. It is up to the consumer of the function to explicitly close resources. - Adds more tests for compressed and archival formats. - Small typing changes for other koza.io.utils functions, adding Optional where appropriate

This was not working correctly with the discriminated union field

I realized at some point that creating a map from a reader file is just a type of transform. This change in the configuration makes achieving that possible. A map transform is just a transform that relies on two additional configuration keys: `key` and `values`. To make passing those values in a YAML config possible, this commit makes it so that any extra fields in the configuration are parsed into an `extra_fields` field in a transform.

This makes config creation more lenient. Note that this means it's possible to have an empty transform. The lack of a transform would be detected when a KozaRunner is run.

Also remove unnecessary `files=[]` calls, since that is the default as of eaff691.

This allows a transform to be defined as a module (resolvable from PATH), e.g. `mypackage.transforms.example_transform`, rather than having to defined it as a file (`/home/user/code/mypackage/transforms/example_transform.py`) This allows the possibility of creating generic transforms that can be packaged, installed, and re-used, without having to track down the filename of the python file where the transform code is located.

Addresses #137

This commit the builds on the changes in a60c607, bfa87d3, and eaff691. It fully implements the mapping functionality that was present in the previous method of writing transforms, although with a new API. Instead of being given a large dict-of-dicts with mappings defined for terms, a method is passed via the KozaTransform object used in a transform, where a map lookup is done like so: def transform(koza: KozaTransform): term = "example" mapped_term = koza.lookup(term, "column_b") ...where the map was loaded from a CSV file that might look like this: id,column_a,column_b example,alias1,alias2 ...resulting in mapped_term evaluating to `"alias2"`.

* Use match statement for header detection logic * Remove unused line_num and line_count variables

Formatting, renaming variables tests, removing unnecessary config params

* Use [project] instead of [tool.poetry] * Set minimum python version to 3.10 * Use pyupgrade lint rules

I had switched to itertuples (from iterrows), but didn't change how rows were interacted with.

kevinschaper · 2025-07-10T20:06:18Z

I'm going to go ahead and bring this into main, but not yet make a release yet.

Patrick Golden added 27 commits January 10, 2025 13:01

Refactor main koza configuration file

db4fc50

Without making any changes to functionality, this separates a koza configuration into a ReaderConfiguration, TransformConfiguration, and WriterConfiguration, all contained within a KozaConfiguration.

Refactor JSON{,L} readers

88f40e8

The big changes are: 1. Taking in a JSON{,L}ReaderConfig object for all configuration 2. Defining iteration via `__iter__()` and `yield`

Refactor the CSV reader

cdbc6f6

First, replaces the many named parameters with a single CSVReaderConfig object. Second, uses `__iter__()` and `yield` to define iteration. Third, refactors the header consumption and validation code, and wraps accessing the header in a property on the class.

Refactor writers to use WriterConfig rather than named parameters

0f0bd6b

Move metadata from transform config to KozaConfig

ce1a47e

Add ability to override output format when creating a KozaRunner

caa10f4

Make CSV the default type of reader

cb3e0c6

This was not working correctly with the discriminated union field

Add a PassthroughWriter that simply store written data from a transform

0ceb760

Add back missing header_delimiter option in CSVReaderConfig

73fa492

Provide defaults for all Config options

c12442a

This makes config creation more lenient. Note that this means it's possible to have an empty transform. The lack of a transform would be detected when a KozaRunner is run.

Add tests to CSVReader

939769e

Also remove unnecessary `files=[]` calls, since that is the default as of eaff691.

Add ability to override configuration fields from a transform YAML

b5617a3

Addresses #137

Move transform integration tests to new API

0d98af1

Remove unused tests and test configuration using old API

df7baa2

Fix filter tests

2bcb4e2

Re-implement row_limit

fcf8314

Fix testing for sources with multiple files

cadffde

Remove unused MapDict class

626979f

Remove unused TranslationTable class

645835c

Breakup model/config/source_config.py into smaller modules

9cc6678

Move CLI to new API

1c2d8fb

Add a bit more logging to the runner to match messages in cli_utils.py

a8525fe

ptgolden mentioned this pull request Jan 14, 2025

What is the purpose of "is not None" in the the transform.py idiom? #157

Closed

Patrick Golden added 2 commits January 15, 2025 09:48

Add ruff linting rules, set line length to 120

32a96e3

Use ruff as formatter, remove black dependency

d1c128e

Patrick Golden and others added 24 commits January 28, 2025 13:10

Move split_file function into a tools directory

530f317

Handle row limit in Source rather than individual readers

bc06a99

Add coverage package and make target

287e532

Linting log_utils.py

e870fad

Add caplog pytest fixture

0dc3f0c

Linting sssom_config.py

77fcf06

Linting source_config.py

8f82086

Small edits in CSVReader

bfde4f9

* Use match statement for header detection logic * Remove unused line_num and line_count variables

Linting io/utils.py

e6e78f6

Linting utils/row_filter.py

a5f97a0

Refactor and fix types on UniqueIncludeYamlLoader

02085dd

Clarify some unit tests

a572d26

Formatting, renaming variables tests, removing unnecessary config params

Remove unused code from old API

2f3f4f0

Temporarily disable validator integration test

15d8ce7

Update pyproject.toml

9e480d9

* Use [project] instead of [tool.poetry] * Set minimum python version to 3.10 * Use pyupgrade lint rules

Formatting for new pyupgrade (with 3.10+) rules

70fb33d

Fix formatting errors that weren't able to be fixed automically

0a861c7

Fix poetry deprecation warning about dev depdendency group

12aaad6

Remove unused linkml validation code (& linkml dependency)

e05c813

Fix one typing and one runtime error in SSSOMConfig

6e15bee

I had switched to itertuples (from iterrows), but didn't change how rows were interacted with.

Bump sssom-py to 0.4.15 and get rid of pandas warning

86770ae

Fix type assertion for gzip.open

99d3637

Update GH test action to test against 3.10-3.13

8b3f1ea

Re-resolve lockfile

4d5bc95

kevinschaper self-requested a review July 10, 2025 20:05

kevinschaper approved these changes Jul 10, 2025

View reviewed changes

kevinschaper merged commit ab47894 into main Jul 10, 2025
4 checks passed

kevinschaper deleted the koza-api-new branch July 10, 2025 20:06

kevinschaper mentioned this pull request Aug 26, 2025

Update to latest koza API monarch-initiative/zfin-orthology-ingest#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop new Koza API + general refactoring#163

Develop new Koza API + general refactoring#163
kevinschaper merged 55 commits intomainfrom
koza-api-new

ptgolden commented Jan 14, 2025 •

edited

Loading

Uh oh!

kevinschaper commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if self.source.config.transform_mode == 'flat':
	while True:
	try:
	if is_first:
	transform_module = importlib.import_module(transform_code)
	is_first = False
	else:
	importlib.reload(transform_module)
	except MapItemException as mie:
	if self.logger:
	self.logger.debug(f"{str(mie)} not found in map")
	except NextRowException:
	continue
	except ValidationError:
	if self.logger:
	self.logger.error(f"Validation error while processing: {self.source.last_row}")
	raise ValidationError
	except StopIteration:
	break
	elif self.source.config.transform_mode == 'loop':
	if transform_code not in sys.modules.keys():
	importlib.import_module(transform_code)
	else:
	importlib.reload(importlib.import_module(transform_code))

	def transform_source(
	source: str,
	output_dir: str,
	output_format: OutputFormat = OutputFormat("tsv"),
	global_table: str = None,
	local_table: str = None,
	schema: str = None,
	node_type: str = None,
	edge_type: str = None,
	row_limit: int = None,
	verbose: bool = None,
	log: bool = False,
	):
	"""Create a KozaApp object, process maps, and run the transform

	Args:
	source (str): Path to source metadata file
	output_dir (str): Path to output directory
	output_format (OutputFormat, optional): Output format. Defaults to OutputFormat('tsv').
	global_table (str, optional): Path to global translation table. Defaults to None.
	local_table (str, optional): Path to local translation table. Defaults to None.
	schema (str, optional): Path to schema file. Defaults to None.
	row_limit (int, optional): Number of rows to process. Defaults to None.
	verbose (bool, optional): Verbose logging. Defaults to None.
	log (bool, optional): Log to file. Defaults to False.
	"""
	logger = get_logger(name=Path(source).name if log else None, verbose=verbose)

	with open(source, "r") as source_fh:
	source_config = PrimaryFileConfig(**yaml.load(source_fh, Loader=UniqueIncludeLoader))

	# Set name and transform code if not provided
	if not source_config.name:
	source_config.name = Path(source).stem

	if not source_config.transform_code:
	filename = f"{Path(source).parent / Path(source).stem}.py"
	if not Path(filename).exists():
	filename = Path(source).parent / "transform.py"
	if not Path(filename).exists():
	raise FileNotFoundError(f"Could not find transform file for {source}")
	source_config.transform_code = filename

	koza_source = Source(source_config, row_limit)
	logger.debug(f"Source created: {koza_source.config.name}")
	translation_table = get_translation_table(
	global_table if global_table else source_config.global_table,
	local_table if local_table else source_config.local_table,
	logger,
	)

	koza_app = _set_koza_app(
	koza_source, translation_table, output_dir, output_format, schema, node_type, edge_type, logger
	)
	koza_app.process_maps()
	koza_app.process_sources()

	class KozaRunner:
	def __init__(
	self,
	data: Iterator[Record],
	writer: KozaWriter,
	mapping_filenames: list[str] \| None = None,
	extra_transform_fields: dict[str, Any] \| None = None,
	transform_record: Callable[[KozaTransform, Record], None] \| None = None,
	transform: Callable[[KozaTransform], None] \| None = None,
	):

	@classmethod
	def from_config(
	cls,
	config: KozaConfig,
	output_dir: str = "",
	row_limit: int = 0,
	show_progress: bool = False,
	):

	@classmethod
	def from_config_file(
	cls,
	config_filename: str,
	output_dir: str = "",
	output_format: OutputFormat \| None = None,
	row_limit: int = 0,
	show_progress: bool = False,
	overrides: dict \| None = None,
	):

	if config.transform.code:
	transform_code_path = Path(config.transform.code)
	parent_path = transform_code_path.absolute().parent
	module_name = transform_code_path.stem
	logger.debug(f"Adding `{parent_path}` to system path to load transform module")
	sys.path.append(str(parent_path))
	# FIXME: Remove this from sys.path
	elif config.transform.module:
	module_name = config.transform.module

	if module_name:
	logger.debug(f"Loading module `{module_name}`")
	transform_module = importlib.import_module(module_name)

	transform = getattr(transform_module, "transform", None)
	if transform:
	logger.debug(f"Found transform function `{module_name}.transform`")
	transform_record = getattr(transform_module, "transform_record", None)
	if transform_record:
	logger.debug(f"Found transform function `{module_name}.transform_record`")
	source = Source(config, row_limit=row_limit, show_progress=show_progress)

	def run(self):
	if callable(self.transform) and callable(self.transform_record):
	raise ValueError("Can only define one of `transform` or `transform_record`")
	elif callable(self.transform):
	self.run_single()
	elif callable(self.transform_record):
	self.run_serial()
	else:
	raise NoTransformException("Must define one of `transform` or `transform_record`")

	self.writer.finalize()

	def __init__(
	self,
	io_str: IO[str],
	field_type_map: Dict[str, FieldType] = None,
	delimiter: str = ",",
	header: Union[int, HeaderMode] = HeaderMode.infer,
	header_delimiter: str = None,
	header_prefix: str = None,
	dialect: str = "excel",
	skip_blank_lines: bool = True,
	name: str = "csv file",
	comment_char: str = "#",
	row_limit: int = None,
	*args,
	**kwargs,
	):

	CSVReader(
	resource_io,
	name=config.name,
	field_type_map=config.field_type_map,
	delimiter=config.delimiter,
	header=config.header,
	header_delimiter=config.header_delimiter,
	header_prefix=config.header_prefix,
	comment_char=self.config.comment_char,
	row_limit=self.row_limit,
	)

	def __post_init__(self):
	# Get files as paths, or extract them from an archive
	if self.file_archive:
	files = self.extract_archive()
	else:
	files = self.files

	files_as_paths: List[Path] = []
	for file in files:
	if isinstance(file, str):
	files_as_paths.append(Path(file))
	else:
	files_as_paths.append(file)
	object.__setattr__(self, "files", files_as_paths)

	# If metadata looks like a file path attempt to load it from the yaml
	if self.metadata and isinstance(self.metadata, str):
	try:
	with open(self.metadata, "r") as meta:
	object.__setattr__(self, "metadata", DatasetDescription(**yaml.safe_load(meta)))
	except Exception as e:
	raise ValueError(f"Unable to load metadata from {self.metadata}: {e}")

	# Format tab as delimiter
	if self.delimiter in ["tab", "\\t"]:
	object.__setattr__(self, "delimiter", "\t")

	# Filter columns
	filtered_columns = [column_filter.column for column_filter in self.filters]

	all_columns = []
	if self.columns:
	all_columns = [next(iter(column)) if isinstance(column, Dict) else column for column in self.columns]

	if self.header == HeaderMode.none and not self.columns:
	raise ValueError(
	"there is no header and columns have not been supplied\n"
	"configure the 'columns' field or set header to the 0-based"
	"index in which it appears in the file, or set this value to"
	"'infer'"
	)

	for column in filtered_columns:
	if column not in all_columns:
	raise (ValueError(f"Filter column {column} not in column list"))

	for column_filter in self.filters:
	if column_filter.filter_code in ["lt", "gt", "lte", "gte"]:
	if not isinstance(column_filter.value, (int, float)):
	raise ValueError(f"Filter value must be int or float for operator {column_filter.filter_code}")
	elif column_filter.filter_code == "eq":
	if not isinstance(column_filter.value, (str, int, float)):
	raise ValueError(
	f"Filter value must be string, int or float for operator {column_filter.filter_code}"
	)
	elif column_filter.filter_code == "in":
	if not isinstance(column_filter.value, List):
	raise ValueError(f"Filter value must be List for operator {column_filter.filter_code}")

	# Check for conflicting configurations
	if self.format == FormatType.csv and self.required_properties:
	raise ValueError(
	"CSV specified but required properties have been configured\n"
	"Either set format to jsonl or change properties to columns in the config"
	)
	if self.columns and self.format != FormatType.csv:
	raise ValueError(
	"Columns have been configured but format is not csv\n"
	"Either set format to csv or change columns to properties in the config"
	)
	if self.json_path and self.format != FormatType.json:
	raise ValueError(
	"iterate_over has been configured but format is not json\n"
	"Either set format to json or remove iterate_over in the configuration"
	)

	# Create a field_type_map if columns are supplied
	if self.columns:
	field_type_map = {}
	for field in self.columns:
	if isinstance(field, str):
	field_type_map[field] = FieldType.str
	else:
	if len(field) != 1:
	raise ValueError("Field type map contains more than one key")
	for key, val in field.items():
	field_type_map[key] = val
	object.__setattr__(self, "field_type_map", field_type_map)

Conversation

ptgolden commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change the API for writing transforms

Major overhaul of the runner

New configuration format

Misc.

Removals

Uh oh!

kevinschaper commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ptgolden commented Jan 14, 2025 •

edited

Loading