Skip to content
This repository has been archived by the owner on Aug 20, 2024. It is now read-only.

Add array support for samplesheets #128

Merged
merged 9 commits into from
Jan 30, 2024

Conversation

awgymer
Copy link
Collaborator

@awgymer awgymer commented Oct 26, 2023

Had to open afresh as my attempt to rebase after I branched off another feature branch spectacularly failed

This Draft PR demonstrates the feasibility of adding limited support for simple array fields to the fromSamplesheet method (they is already undocumented support for full jsonschema spec in the validateFile method).

The proposal for what to support would be:

  • only supported in samplesheet validation (I don't think CLI params have a syntax for specifying an array so not often used?)
  • simple item arrays only [spec]
  • only supported for input fields, not for meta
  • no support for this in TSV/CSV (how would that even logically work)
  • only formats already supported are supported within the array

The code so far has no tests in the plugin but has been demonstrated to work almost as it does for flat files with one major exception:

  • There is currently no support for exists checking with file/directory format fields within an array.

Array file/directory existence is now supported.

The following samplesheet.yaml:

- sample: mysample1_10
  fastq: 
    - input1_10_R1.fq.gz
    - input1_10_R2.fq.gz
  strandedness: forward
- sample: mysample1_11
  fastq: 
    - input1_11_R1.fq.gz
    - input1_11_R2.fq.gz
  strandedness: forward
- sample: mysample1_12
  fastq: 
    - input1_12_R1.fq.gz
    - input1_12_R2.fq.gz
  strandedness: forward
- sample: mysample1_13
  fastq: 
    - input1_13_R1.fq.gz
    - input1_13_R2.fq.gz
  strandedness: forward
- sample: mysample1_14
  fastq: 
    - input1_14_R1.fq.gz
    - input1_14_R2.fq.gz
  strandedness: forward

With the following schema.json:

{
    "$schema": "http://json-schema.org/draft-07/schema",
    "$id": "https://raw.githubusercontent.com/nf-validation/example/master/assets/schema_input.json",
    "title": "nf-validation example - params.input schema",
    "description": "Schema for the file provided with params.input",
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "sample": {
          "type": "string",
          "pattern": "^\\S+$",
          "errorMessage": "Sample name must be provided and cannot contain spaces"
        },
        "fastq": {
          "type": "array",
          "items": {
              "type": "string",
              "pattern": "^\\S+\\.f(ast)?q\\.gz$"
        },
          "errorMessage": "FastQ files cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
        },
        "strandedness": {
          "type": "string",
          "errorMessage": "Strandedness must be provided and be one of 'forward', 'reverse' or 'unstranded'",
          "enum": ["forward", "reverse", "unstranded"]
        }
      },
      "required": ["sample", "fastq", "strandedness"]
    }
  }

Gives the following in nextflow:

Channel.fromSamplesheet("input").view()
---
[mysample1_10, [input1_10_R1.fq.gz, input1_10_R2.fq.gz], forward]
[mysample1_11, [input1_11_R1.fq.gz, input1_11_R2.fq.gz], forward]
[mysample1_12, [input1_12_R1.fq.gz, input1_12_R2.fq.gz], forward]
[mysample1_13, [input1_13_R1.fq.gz, input1_13_R2.fq.gz], forward]
[mysample1_14, [input1_14_R1.fq.gz, input1_14_R2.fq.gz], forward]

Things to consider:

  • warnings if we detect CSV/TSV and "type": "array" (we don't currently do this and it technically is supported for validation)
  • warnings for "type": "array" with "prefixItems" (this is the tuple-type validation)
  • check that length and uniqueness constraints work as intended

N.B.: I tried to use polymorphism (is this the right term) to have different methods depending on the input type, but everything just went to the method declared for Object (the fallback) so if anyone can help me understand that I would appreciate it

@adamrtalbot
Copy link
Collaborator

Relates to #81

@awgymer awgymer marked this pull request as ready for review December 13, 2023 16:37
Copy link
Collaborator

@mirpedrol mirpedrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good! thanks for the contribution 😄

We should add tests for this. You can add a samplesheet and json schema with the required structure in testResources here, and use it as in other tests.
Also add documentation, I think samplesheet schema specification will be the best place.
Don't forget to update the changelog! 😄

@awgymer
Copy link
Collaborator Author

awgymer commented Dec 21, 2023

Ok. So. State of play after this latest set of changes:

  • tests are now correct and working for YAML/JSON
  • Simple array support (not tuple arrays) has been added for YAML/JSON format samplesheets.
  • tests demonstrate support for most functionality around the array type

What is shown to be supported:

  • max/min items is supported
  • uniqueness of the items within the array is supported
  • nested arrays are shown to work
  • top-level default works properly for array types (default for the nested level of arrays does not do anything - but a nested array entry can have a default!)
  • file and directory paths work from within array including nested arrays!
  • array type fields in meta

What is not shown to be supported (some of this may work by chance):

  • the prefixItems (tuple) or contains type array schemas
  • the custom unique field applied to array type fields
  • an enum (choices) value of an array field
  • dependentRequired (probably just needs someone to write the correct test to show it works)
  • any of the anyOf or allOf type keywords from the jsonschema spec

There is also obviously not any support yet in nf-core/tools for building of array type fields.

N.B. It would let me merge this branch without any approvals which seems risky - should probably make it so all PRs require at least 1 approval!

@nvnieuwk
Copy link
Collaborator

Hi, I'll take a look now. But just a heads up that a lot of the non-supported features will probably work if #141 is finished

Copy link
Collaborator

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome LGTM!

@awgymer
Copy link
Collaborator Author

awgymer commented Dec 21, 2023

What does changing to jsonschemafriend provide that can help with the additional features?

@nvnieuwk
Copy link
Collaborator

It has support for all major versions of the JSON schema, while the current library is missing support the two newest ones (which is why we had to implement dependentRequired and unique separately before as they are feature of these newer versions)

@awgymer
Copy link
Collaborator Author

awgymer commented Dec 21, 2023

Oooh unique as part of schema would be good!

But the issue isn't the validation of most of those things - I think anything that is valid jsonschema Draft7 will validate OK. It's rather the conversion to an NFType and the support in the nf-core/tools builders/linters.

Plus we don't have any internal checks to warn against features of a schema that are not supported for csv/tsv

Copy link
Collaborator

@mirpedrol mirpedrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing!!

I think we would also add a note in the documentation, your comment above seems a good summary to add there, especially the limitations.

And don't forget to add your contribution to the CHANGLOG 😄

@nvnieuwk
Copy link
Collaborator

@awgymer can you also add an error if the user gives a CSV or TSV samplesheet when there is an array in the schema?

@awgymer
Copy link
Collaborator Author

awgymer commented Dec 22, 2023

@awgymer can you also add an error if the user gives a CSV or TSV samplesheet when there is an array in the schema?

Yes. I will need to work out the best place to raise this error.

…samplesheet is passed. Add tests for this behaviour
@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

Now raises an error if array used with an incompatible type of samplesheet:

ERROR ~ ERROR: Validation of pipeline parameters failed!

 -- Check '.nextflow.log' file for details
Using {"type": "array"} in schema with a ".csv" samplesheet is not supported

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Jan 4, 2024

Awesome!!! So this should be done now? Or should I wait a bit more for a review?

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

This should be complete with the suggestions from the prior reviews now! So feel free to go ahead and re-review the changes

Copy link
Collaborator

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

There is a slight wrinkle I did not account for:
The spec we are currently running against (v7) allows items to be an array of schema (as well as just a single item).
This is to allow tuple-type array validation. In newer schema versions this behaviour is not allowed and tuple-type uses the keyword prefixItems instead.

This means that currently someone could supply a valid v7 schema with tuple-typeitems array, which would then break when it tried to validate because we make the assumption items is only ever a single value.

I'm not sure if we should just leave it unsupported and breaking, or add an explicit error if we detect items is an array?

This also ties in to whether we ever intend for a single nf-validation version to support multiple json-schema versions?

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Jan 4, 2024

I'd prefer to add an error for now as we are actively looking into adding support for all types of JSON schema. So let's try to not break much that could impact that transition 😁

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

So, as always, it's a case of "supported in validation but not in conversion".

Where do you think is best to raise the error? Only in the converter code?

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Jan 4, 2024

So, as always, it's a case of "supported in validation but not in conversion".

Where do you think is best to raise the error? Only in the converter code?

Yeah the conversion will always be more limited I think, but it's getting better (thanks to you mostly now 😉) IMO it should only be in the places where it can cause issues in the future so only in the conversion code should be fine

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

Turns out that I have lied!

If you make items in an array type an array we actually get a somewhat obscure java error:

ERROR ~ No signature of method: java.util.ArrayList.containsKey() is applicable for argument types: (String) values: [exists]
Possible solutions: contains(java.lang.Object), containsAll(java.util.Collection), contains(java.lang.Object), containsAll([Ljava.lang.Object;)

This is all due to the collectExists method which makes the assumption that if a field has an items keyword (as arrays do) that it will be a Map

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Jan 4, 2024

Turns out that I have lied!

If you make items in an array type an array we actually get a somewhat obscure java error:

ERROR ~ No signature of method: java.util.ArrayList.containsKey() is applicable for argument types: (String) values: [exists]
Possible solutions: contains(java.lang.Object), containsAll(java.util.Collection), contains(java.lang.Object), containsAll([Ljava.lang.Object;)

This is all due to the collectExists method which makes the assumption that if a field has an items keyword (as arrays do) that it will be a Map

Ah that needs fixing too then 🔨

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

I don't know if there is some good way to effectively have a schema for our schema 😂

The exists thing will always be broken for tuple-type validation unless I (or someone else) can logic-up a good method for how to only check a value if it's supposed to be a path.

This is because tuple-type supports e.g. [<filepath>, <number>, <number>] as a valid array field but we just collect which fields are marked as exists and to handle this properly we would need to collect the index within an array field too.

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

My best long-term proposal for this would be to make the format contain the exists information (i.e. file-path-exists) and have validators for ones which need to exist and ones which don't.

However obviously that would be backwards-incompatible with everyone's current schema

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Jan 4, 2024

My best long-term proposal for this would be to make the format contain the exists information (i.e. file-path-exists) and have validators for ones which need to exist and ones which don't.

However obviously that would be backwards-incompatible with everyone's current schema

We considered this at first but eventually decided not to do it with the formats because it would add a load of new formats that would have to be added into Tower and the nf-core CLI (but I agree in hindsight that it would maybe have the better solution). Also we're going to start up some discussions to only support the 2020-12 draft of JSON schema

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

What shall we do in the current situation then?

I think array support is useful and the current implementation works for all schema versions for standard array type, but it broken for tuple-type (and never worked before).

Since this was never supported previously, is it possible to push it through with a clear note in the docs that tuple-type is not (and never has been) supported for either validation or channel conversion?

@mirpedrol
Copy link
Collaborator

I don't know if there is some good way to effectively have a schema for our schema

We have a meta-schema for the parameters schema, we could add one for the samplesheet schema if this helps. Also, something to consider for the schema-builder tooling once we implement it for samplesheet schemas.

@nvnieuwk
Copy link
Collaborator

nvnieuwk commented Jan 4, 2024

As far as I know, no one uses the tuple type because we don't validate for it yet in the current schema. But yeah let's add it to the docs and the meta-schema, it won't hurt anyone :p

@awgymer
Copy link
Collaborator Author

awgymer commented Jan 4, 2024

So tuple isn't it's own type.

What happens is you define "type": "array"
Arrays can be:

  • List-type - all values must be same type and there is a single schema, arbitrary length array
  • Tuple-type - value position is meaningful, each value has it's own schema, specific/fixed length array

Pre 2020 both types used the items keyword to define the schema(s) of the array items. If items is an array then it's for tuple-type. If items is just a single schema then it's list-type.

2020 changes this to make tuple-type have it's own keyword prefixItems instead.

So I think for our "meta-schema" we need to have it define items as a property keyword, where it must be an object (that should be recursively the same definition as the possible options for what can be in properties technically 😬)?

@nvnieuwk
Copy link
Collaborator

So tuple isn't it's own type.

What happens is you define "type": "array" Arrays can be:

  • List-type - all values must be same type and there is a single schema, arbitrary length array
  • Tuple-type - value position is meaningful, each value has it's own schema, specific/fixed length array

Pre 2020 both types used the items keyword to define the schema(s) of the array items. If items is an array then it's for tuple-type. If items is just a single schema then it's list-type.

2020 changes this to make tuple-type have it's own keyword prefixItems instead.

So I think for our "meta-schema" we need to have it define items as a property keyword, where it must be an object (that should be recursively the same definition as the possible options for what can be in properties technically 😬)?

From what I've seen during my tests in #141, the properties keyword still works perfectly, so I would maybe suggest keeping that for now to keep the changes minimal

@nvnieuwk
Copy link
Collaborator

So feel free to merge this @awgymer 🚀

@awgymer awgymer merged commit 981c126 into nextflow-io:master Jan 30, 2024
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants