-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Finally settled on annotated examples as reference documentation for …
…source & source_file, copied README.md for the MkDocs index.md, still need to replace README.md content with something slimmer & more appropriate.
- Loading branch information
1 parent
dbb052f
commit c097915
Showing
6 changed files
with
174 additions
and
75 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
## Koza Data Transformation Framework | ||
|
||
 | ||
|
||
*Disclaimer*: Koza is in pre-alpha | ||
|
||
Transform csv, json, and jsonl and converting them to a target | ||
csv, json, or jsonl format based on your dataclass model. Koza also can output | ||
data in the [KGX format](https://github.com/biolink/kgx/blob/master/specification/kgx-format.md#kgx-format-as-tsv) | ||
|
||
|
||
##### Highlights | ||
|
||
- Author data transforms in semi-declarative Python | ||
- Configure source files, expected columns/json properties and path filters, field filters, and metadata in yaml | ||
- Create or import mapping files to be used in ingests (eg id mapping, type mappings) | ||
- Create and use translation tables to map between source and target vocabularies | ||
|
||
|
||
### Installation | ||
|
||
``` | ||
pip install koza | ||
``` | ||
|
||
### Getting Started | ||
|
||
#### Writing an ingest | ||
|
||
[Ingest Configuration](ingest_configuration.md) | ||
|
||
#### Running an ingest | ||
|
||
Send a local or remove csv file through Koza to get some basic information (headers, number of rows) | ||
|
||
```bash | ||
koza validate \ | ||
--file https://raw.githubusercontent.com/monarch-initiative/koza/dev/tests/resources/source-files/string.tsv \ | ||
--delimiter ' ' | ||
``` | ||
|
||
Sending a json or jsonl formatted file will confirm if the file is valid json or jsonl | ||
```bash | ||
koza validate \ | ||
--file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \ | ||
--format jsonl | ||
``` | ||
|
||
```bash | ||
koza validate \ | ||
--file ./examples/data/ddpheno.json.gz \ | ||
--format json \ | ||
--compression gzip | ||
``` | ||
|
||
##### Example: transforming StringDB | ||
|
||
```bash | ||
koza transform --source examples/string/metadata.yaml | ||
|
||
koza transform --source examples/string-declarative/metadata.yaml | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
## Ingest Configuration | ||
|
||
### Source (aka metadata.yaml) | ||
|
||
The Source File provides metadata for the description of the dataset and the list of Source Files to be ingested | ||
|
||
```yaml | ||
name: 'somethingbase' | ||
|
||
dataset_description: | ||
ingest_title: 'SomethingBase' | ||
ingest_url: 'https://somethingbase.org' | ||
description: 'SomethingBase: A Website With Some Data' | ||
rights: 'https://somethingbase.org/rights.html' | ||
|
||
# The list of source files should map | ||
source_files: | ||
- 'gene-information.yaml' | ||
- 'gene-to-phenotype.yaml' | ||
``` | ||
### Source File(s) | ||
This YAML file sets properties for the ingest of a single file type from a within a Source. | ||
```yaml | ||
name: 'name-of-ingest' | ||
|
||
format: 'json' # Options are json or csv, default is csv | ||
|
||
# Sets file compression, options are gzip and none, default is none | ||
compression: 'gzip' | ||
|
||
# list of files to be ingested | ||
files: | ||
- './data/really-cool-data-1.json.gz' | ||
- './data/really-cool-data-2.json.gz' | ||
|
||
# in a JSON ingest, this will be the path to the array to be iterated over as the input collection | ||
json_path: | ||
- 'data' | ||
|
||
# Ordered list of columns for CSV files, data type can be specified as float, int or str | ||
columns: | ||
- 'protein1' | ||
- 'protein2' | ||
- 'combined_score' : 'int' | ||
|
||
# Specify a delimiter for CSV files, default is a comma | ||
delimiter: '\t' | ||
|
||
# Optional delimiter for header row | ||
header_delimiter: '|' | ||
|
||
# Boolean to configure presence of header, default is true | ||
has_header: 'False' | ||
|
||
# Number of lines to be ignored at the head of an ingest data file, default is 0 | ||
skip_lines: 10 | ||
|
||
# Boolean to skip blank lines, default is true | ||
skip_blank_lines: True | ||
|
||
# Set pre-defined source_file properties (like column lists) for common file formats. | ||
# Options: 'gpi' and 'oban' | ||
# Additional standard formats can be added in source_config.py. | ||
standard_format: 'gpi' | ||
|
||
# include a map file | ||
depends_on: | ||
- './examples/maps/alliance-gene.yaml' | ||
|
||
# The filter DSL allows including or excluding rows based on filter blocks | ||
filters: | ||
- inclusion: 'include' # 'include' to include rows matching, 'exclude' to exclude rows that match | ||
column: 'combined_score' | ||
# filter_code (with 'in' expecting a list of values) | ||
filter_code: 'lt' # options: 'gt', 'ge', 'lt', 'le', 'eq', 'ne', 'in' | ||
value: 700 | ||
- inclusion: 'exclude' | ||
column: 'protein1' | ||
filter_code: 'in' # 'in' expects the value to be a list and checks that the column value is matched within the list | ||
value: | ||
- 'ABC1' | ||
- 'XYZ4' | ||
|
||
# node and edge categories are required to avoid empty KGX files, the order here isn't important | ||
node_properties: | ||
- 'id' | ||
- 'category' | ||
- 'provided_by' | ||
|
||
edge_properties: | ||
- 'id' | ||
- 'subject' | ||
- 'predicate' | ||
- ... | ||
|
||
#In 'flat' mode, the transform operates on a single row and looping doesn't need to be specified | ||
#In 'loop' mode, the transform code is executed only once and so the loop code that iterates over rows must be contained within the transform code | ||
# The default is 'flat' | ||
transform_mode: 'loop' | ||
|
||
# Python code to run for ingest. Default is the same file name as the source_file yaml, but with a .py extension | ||
# You probably don't need to set this property | ||
transform_code: 'name-of-ingest.py' | ||
``` |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +0,0 @@ | ||
### Source File | ||
|
||
This YAML file sets properties for the ingest of a single file type from a within a [Source](http://example.com?link-to-source). | ||
|
||
#### Example | ||
|
||
name: 'gene-to-phenotype' | ||
|
||
format: 'json' | ||
|
||
files: | ||
- './data/PHENOTYPE_RGD.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_RGD.json.gz" | ||
- './data/PHENOTYPE_MGI.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_MGI.json.gz" | ||
- './data/PHENOTYPE_WB.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_WB.json.gz" | ||
- './data/PHENOTYPE_HUMAN.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_HUMAN.json.gz" | ||
|
||
json_path: | ||
- 'data' | ||
|
||
depends_on: | ||
- './mingestibles/maps/alliance-gene.yaml' | ||
|
||
node_properties: | ||
- 'id' | ||
- 'category' | ||
- 'provided_by' | ||
|
||
edge_properties: | ||
- 'id' | ||
- 'subject' | ||
- 'predicate' | ||
- ... | ||
|
||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
site_name: Koza | ||
nav: | ||
- Home: '../' | ||
- Ingest Configuration: 'ingest_configuration.md' | ||
|