Finally settled on annotated examples as reference documentation for …

…source & source_file, copied README.md for the MkDocs index.md, still need to replace README.md content with something slimmer & more appropriate.
monarch-initiative · Aug 18, 2021 · c097915 · c097915
1 parent dbb052f
commit c097915
Show file tree

Hide file tree

Showing 6 changed files with 174 additions and 75 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,62 @@
+## Koza Data Transformation Framework
+
+![pupa](img/pupa.png)
+
+*Disclaimer*: Koza is in pre-alpha
+
+Transform csv, json, and jsonl and converting them to a target
+csv, json, or jsonl format based on your dataclass model.  Koza also can output
+data in the [KGX format](https://github.com/biolink/kgx/blob/master/specification/kgx-format.md#kgx-format-as-tsv)
+
+
+##### Highlights
+
+- Author data transforms in semi-declarative Python
+- Configure source files, expected columns/json properties and path filters, field filters, and metadata in yaml
+- Create or import mapping files to be used in ingests (eg id mapping, type mappings)
+- Create and use translation tables to map between source and target vocabularies
+
+
+### Installation
+
+```
+pip install koza
+```
+
+### Getting Started
+
+#### Writing an ingest
+
+[Ingest Configuration](ingest_configuration.md)
+
+#### Running an ingest
+
+Send a local or remove csv file through Koza to get some basic information (headers, number of rows)
+
+```bash
+koza validate \
+  --file https://raw.githubusercontent.com/monarch-initiative/koza/dev/tests/resources/source-files/string.tsv \
+  --delimiter ' '
+```
+
+Sending a json or jsonl formatted file will confirm if the file is valid json or jsonl
+```bash
+koza validate \
+  --file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \
+  --format jsonl
+```
+
+```bash
+koza validate \
+  --file ./examples/data/ddpheno.json.gz \
+  --format json \
+  --compression gzip
+```
+
+##### Example: transforming StringDB
+
+```bash
+koza transform --source examples/string/metadata.yaml 
+
+koza transform --source examples/string-declarative/metadata.yaml 
+```
diff --git a/docs/ingest_configuration.md b/docs/ingest_configuration.md
@@ -0,0 +1,107 @@
+## Ingest Configuration
+
+### Source (aka metadata.yaml)
+
+The Source File provides metadata for the description of the dataset and the list of Source Files to be ingested
+
+```yaml
+name: 'somethingbase'
+
+dataset_description:
+  ingest_title: 'SomethingBase'
+  ingest_url: 'https://somethingbase.org'
+  description: 'SomethingBase: A Website With Some Data'
+  rights: 'https://somethingbase.org/rights.html'
+
+# The list of source files should map 
+source_files:
+  - 'gene-information.yaml'
+  - 'gene-to-phenotype.yaml'
+```
+
+### Source File(s)
+
+This YAML file sets properties for the ingest of a single file type from a within a Source.
+
+```yaml
+name: 'name-of-ingest'
+
+format: 'json' # Options are json or csv, default is csv
+
+# Sets file compression, options are gzip and none, default is none
+compression: 'gzip'
+
+# list of files to be ingested
+files:
+- './data/really-cool-data-1.json.gz'
+- './data/really-cool-data-2.json.gz'
+
+# in a JSON ingest, this will be the path to the array to be iterated over as the input collection
+json_path:
+- 'data'
+
+# Ordered list of columns for CSV files, data type can be specified as float, int or str
+columns:
+    - 'protein1'
+    - 'protein2'
+    - 'combined_score' : 'int'
+
+# Specify a delimiter for CSV files, default is a comma
+delimiter: '\t'
+
+# Optional delimiter for header row
+header_delimiter: '|' 
+
+# Boolean to configure presence of header, default is true
+has_header: 'False'
+
+# Number of lines to be ignored at the head of an ingest data file, default is 0
+skip_lines: 10 
+
+# Boolean to skip blank lines, default is true
+skip_blank_lines: True
+
+# Set pre-defined source_file properties (like column lists) for common file formats. 
+# Options: 'gpi' and 'oban'
+# Additional standard formats can be added in source_config.py. 
+standard_format: 'gpi'
+
+# include a map file
+depends_on:
+- './examples/maps/alliance-gene.yaml'
+
+# The filter DSL allows including or excluding rows based on filter blocks
+filters: 
+- inclusion: 'include' # 'include' to include rows matching, 'exclude' to exclude rows that match
+  column: 'combined_score'
+  # filter_code  (with 'in' expecting a list of values)
+  filter_code: 'lt' # options: 'gt', 'ge', 'lt', 'le', 'eq', 'ne', 'in'  
+  value: 700
+- inclusion: 'exclude'
+  column: 'protein1'
+  filter_code: 'in' # 'in' expects the value to be a list and checks that the column value is matched within the list
+  value: 
+    - 'ABC1'
+    - 'XYZ4'
+
+# node and edge categories are required to avoid empty KGX files, the order here isn't important  
+node_properties:
+- 'id'
+- 'category'
+- 'provided_by'
+
+edge_properties:
+- 'id'
+- 'subject'
+- 'predicate'
+- ...
+
+#In 'flat' mode, the transform operates on a single row and looping doesn't need to be specified
+#In 'loop' mode, the transform code is executed only once and so the loop code that iterates over rows must be contained within the transform code
+# The default is 'flat'
+transform_mode: 'loop'
+
+# Python code to run for ingest. Default is the same file name as the source_file yaml, but with a .py extension
+# You probably don't need to set this property
+transform_code: 'name-of-ingest.py'
+```
diff --git a/docs/source.md b/docs/source.md
diff --git a/docs/source_file.md b/docs/source_file.md
@@ -1,33 +0,0 @@
-### Source File 
-
-This YAML file sets properties for the ingest of a single file type from a within a [Source](http://example.com?link-to-source). 
-
-#### Example
-
-    name: 'gene-to-phenotype'
-
-    format: 'json'
-
-    files:
-    - './data/PHENOTYPE_RGD.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_RGD.json.gz"
-    - './data/PHENOTYPE_MGI.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_MGI.json.gz"
-    - './data/PHENOTYPE_WB.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_WB.json.gz"
-    - './data/PHENOTYPE_HUMAN.json.gz' # "https://fms.alliancegenome.org/download/PHENOTYPE_HUMAN.json.gz"
-
-    json_path:
-    - 'data'
-
-    depends_on:
-    - './mingestibles/maps/alliance-gene.yaml'
-
-    node_properties:
-    - 'id'
-    - 'category'
-    - 'provided_by'
-
-    edge_properties:
-    - 'id'
-    - 'subject'
-    - 'predicate'
-    - ...
-

diff --git a/koza/model/config/source_config.py b/koza/model/config/source_config.py
@@ -151,13 +151,11 @@ class SourceFileConfig:
     name: str
     files: List[Union[str, Path]]
     format: FormatType = FormatType.csv
-    path: Path = None
     standard_format: StandardFormat = None
     file_metadata: DatasetDescription = None
     columns: List[Union[str, Dict[str, FieldType]]] = None
     node_properties: List[str] = None
     edge_properties: List[str] = None
-    required_properties: List[str] = None
     delimiter: str = None
     header_delimiter: str = None
     has_header: bool = True

diff --git a/mkdocs.yaml b/mkdocs.yaml
@@ -0,0 +1,5 @@
+site_name: Koza
+nav:
+- Home: '../'
+- Ingest Configuration: 'ingest_configuration.md'
+