Skip to content

Commit

Permalink
Finally settled on annotated examples as reference documentation for …
Browse files Browse the repository at this point in the history
…source & source_file, copied README.md for the MkDocs index.md, still need to replace README.md content with something slimmer & more appropriate.
  • Loading branch information
kevinschaper committed Aug 18, 2021
1 parent dbb052f commit c097915
Show file tree
Hide file tree
Showing 6 changed files with 174 additions and 75 deletions.
62 changes: 62 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
## Koza Data Transformation Framework

![pupa](img/pupa.png)

*Disclaimer*: Koza is in pre-alpha

Transform csv, json, and jsonl and converting them to a target
csv, json, or jsonl format based on your dataclass model. Koza also can output
data in the [KGX format](https://github.com/biolink/kgx/blob/master/specification/kgx-format.md#kgx-format-as-tsv)


##### Highlights

- Author data transforms in semi-declarative Python
- Configure source files, expected columns/json properties and path filters, field filters, and metadata in yaml
- Create or import mapping files to be used in ingests (eg id mapping, type mappings)
- Create and use translation tables to map between source and target vocabularies


### Installation

```
pip install koza
```

### Getting Started

#### Writing an ingest

[Ingest Configuration](ingest_configuration.md)

#### Running an ingest

Send a local or remove csv file through Koza to get some basic information (headers, number of rows)

```bash
koza validate \
--file https://raw.githubusercontent.com/monarch-initiative/koza/dev/tests/resources/source-files/string.tsv \
--delimiter ' '
```

Sending a json or jsonl formatted file will confirm if the file is valid json or jsonl
```bash
koza validate \
--file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \
--format jsonl
```

```bash
koza validate \
--file ./examples/data/ddpheno.json.gz \
--format json \
--compression gzip
```

##### Example: transforming StringDB

```bash
koza transform --source examples/string/metadata.yaml

koza transform --source examples/string-declarative/metadata.yaml
```
107 changes: 107 additions & 0 deletions docs/ingest_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
## Ingest Configuration

### Source (aka metadata.yaml)

The Source File provides metadata for the description of the dataset and the list of Source Files to be ingested

```yaml
name: 'somethingbase'

dataset_description:
ingest_title: 'SomethingBase'
ingest_url: 'https://somethingbase.org'
description: 'SomethingBase: A Website With Some Data'
rights: 'https://somethingbase.org/rights.html'

# The list of source files should map
source_files:
- 'gene-information.yaml'
- 'gene-to-phenotype.yaml'
```
### Source File(s)
This YAML file sets properties for the ingest of a single file type from a within a Source.
```yaml
name: 'name-of-ingest'

format: 'json' # Options are json or csv, default is csv

# Sets file compression, options are gzip and none, default is none
compression: 'gzip'

# list of files to be ingested
files:
- './data/really-cool-data-1.json.gz'
- './data/really-cool-data-2.json.gz'

# in a JSON ingest, this will be the path to the array to be iterated over as the input collection
json_path:
- 'data'

# Ordered list of columns for CSV files, data type can be specified as float, int or str
columns:
- 'protein1'
- 'protein2'
- 'combined_score' : 'int'

# Specify a delimiter for CSV files, default is a comma
delimiter: '\t'

# Optional delimiter for header row
header_delimiter: '|'

# Boolean to configure presence of header, default is true
has_header: 'False'

# Number of lines to be ignored at the head of an ingest data file, default is 0
skip_lines: 10

# Boolean to skip blank lines, default is true
skip_blank_lines: True

# Set pre-defined source_file properties (like column lists) for common file formats.
# Options: 'gpi' and 'oban'
# Additional standard formats can be added in source_config.py.
standard_format: 'gpi'

# include a map file
depends_on:
- './examples/maps/alliance-gene.yaml'

# The filter DSL allows including or excluding rows based on filter blocks
filters:
- inclusion: 'include' # 'include' to include rows matching, 'exclude' to exclude rows that match
column: 'combined_score'
# filter_code (with 'in' expecting a list of values)
filter_code: 'lt' # options: 'gt', 'ge', 'lt', 'le', 'eq', 'ne', 'in'
value: 700
- inclusion: 'exclude'
column: 'protein1'
filter_code: 'in' # 'in' expects the value to be a list and checks that the column value is matched within the list
value:
- 'ABC1'
- 'XYZ4'

# node and edge categories are required to avoid empty KGX files, the order here isn't important
node_properties:
- 'id'
- 'category'
- 'provided_by'

edge_properties:
- 'id'
- 'subject'
- 'predicate'
- ...

#In 'flat' mode, the transform operates on a single row and looping doesn't need to be specified
#In 'loop' mode, the transform code is executed only once and so the loop code that iterates over rows must be contained within the transform code
# The default is 'flat'
transform_mode: 'loop'

# Python code to run for ingest. Default is the same file name as the source_file yaml, but with a .py extension
# You probably don't need to set this property
transform_code: 'name-of-ingest.py'
```
40 changes: 0 additions & 40 deletions docs/source.md

This file was deleted.

33 changes: 0 additions & 33 deletions docs/source_file.md
Original file line number Diff line number Diff line change
@@ -1,33 +0,0 @@
### Source File

This YAML file sets properties for the ingest of a single file type from a within a [Source](http://example.com?link-to-source).

#### Example

name: 'gene-to-phenotype'

format: 'json'

files:
- './data/PHENOTYPE_RGD.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_RGD.json.gz"
- './data/PHENOTYPE_MGI.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_MGI.json.gz"
- './data/PHENOTYPE_WB.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_WB.json.gz"
- './data/PHENOTYPE_HUMAN.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_HUMAN.json.gz"

json_path:
- 'data'

depends_on:
- './mingestibles/maps/alliance-gene.yaml'

node_properties:
- 'id'
- 'category'
- 'provided_by'

edge_properties:
- 'id'
- 'subject'
- 'predicate'
- ...

2 changes: 0 additions & 2 deletions koza/model/config/source_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,13 +151,11 @@ class SourceFileConfig:
name: str
files: List[Union[str, Path]]
format: FormatType = FormatType.csv
path: Path = None
standard_format: StandardFormat = None
file_metadata: DatasetDescription = None
columns: List[Union[str, Dict[str, FieldType]]] = None
node_properties: List[str] = None
edge_properties: List[str] = None
required_properties: List[str] = None
delimiter: str = None
header_delimiter: str = None
has_header: bool = True
Expand Down
5 changes: 5 additions & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
site_name: Koza
nav:
- Home: '../'
- Ingest Configuration: 'ingest_configuration.md'

0 comments on commit c097915

Please sign in to comment.