Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an ontology of genomic region file formats in JSON format #92

Open
donaldcampbelljr opened this issue Jan 28, 2025 · 1 comment
Assignees

Comments

@donaldcampbelljr
Copy link
Member

We should also create an ontology of genomic region file formats in JSON format to be hosted on PEPhub:
narrowPeak + broadPeak vs bed 6+4 etc

Originally posted by @donaldcampbelljr in #91

@donaldcampbelljr
Copy link
Member Author

donaldcampbelljr commented Feb 13, 2025

Here is a first pass of what this could look like, just need to determine how to handle the nonstrict versions of the format (this schema only has ns_narrowpeak):

type: object
properties:
  chrom:
    type: string
    description: "Name of the chromosome (e.g., chr1, chrX)."
  chromStart:
    type: integer
    description: "Starting position of the feature (0-based)."
    minimum: 0
  chromEnd:
    type: integer
    description: "Ending position of the feature (0-based, exclusive)."
    minimum: 0
  spec_compliant_columns:
    type: integer
    description: "Number of columns compliant with the ucsc BED specification."
    minimum: 3
  non_spec_compliant_columns:
    type: integer
    description: "Number of additional, non-compliant BED columns."
    minimum: 0
  bed_format:
    type: string
    enum:
      - ucsc_bed
      - encode_narrowpeak
      - encode_broadpeak
      - encode_rna_elements
      - encode_gappedpeak
      - unknown_bed_format
  format_specific_details:
    type: object
    description: "Details specific to the detected BED format."
    oneOf:
      - $ref: '#/definitions/narrowpeak_details'
      - $ref: '#/definitions/broadpeak_details'
      - $ref: '#/definitions/gappedpeak_details'
      - $ref: '#/definitions/rna_elements_details'
      - $ref: '#/definitions/ucsc_bed_details'
      - type: object
        properties: {}

definitions:
  narrowpeak_details:
    type: object
    properties:
      name:
        type: string
        description: "Name of the BED line."
      score:
        type: integer
        description: "Score between 0 and 1000."
        minimum: 0
        maximum: 1000
      strand:
        type: string
        enum: [".", "+", "-"]
        description: "Strand. Either '.', '+', or '-'."
      signalValue:
        type: number
        description: "Measurement of overall enrichment."
      pValue:
        type: number
        description: "Measurement of statistical significance (-log10)."
      qValue:
        type: number
        description: "Measurement of statistical significance using FDR (-log10)."
      peak:
        type: integer
        description: "Point-source called for this peak; 0-based offset."
    required:
      - name
      - score
      - strand
      - signalValue
      - pValue
      - qValue
      - peak

  broadpeak_details:
    type: object
    properties:
      name:
        type: string
        description: "Name of the BED line."
      score:
        type: integer
        description: "Score between 0 and 1000."
        minimum: 0
        maximum: 1000
      strand:
        type: string
        enum: [".", "+", "-"]
        description: "Strand. Either '.', '+', or '-'."
      signalValue:
        type: number
        description: "Measurement of overall enrichment."
      pValue:
        type: number
        description: "Measurement of statistical significance (-log10)."
      qValue:
        type: number
        description: "Measurement of statistical significance using FDR (-log10)."
    required:
      - name
      - score
      - strand
      - signalValue
      - pValue
      - qValue

  gappedpeak_details:
    type: object
    properties:
      name:
        type: string
        description: "Name of the BED line."
      score:
        type: integer
        description: "Score between 0 and 1000."
        minimum: 0
        maximum: 1000
      strand:
        type: string
        enum: [".", "+", "-"]
        description: "Strand. Either '.', '+', or '-'."
      thickStart:
        type: integer
        description: "Starting position at which the feature is drawn thickly."
      thickEnd:
        type: integer
        description: "Ending position at which the feature is drawn thickly."
      itemRgb:
        type: string
        description: "RGB value of the form R,G,B."
      blockCount:
        type: integer
        description: "Number of blocks (exons)."
      blockSizes:
        type: string
        description: "Comma-separated list of block sizes."
      blockStarts:
        type: string
        description: "Comma-separated list of block starts relative to chromStart."
      signalValue:
        type: number
        description: "Measurement of overall enrichment."
      pValue:
        type: number
        description: "Measurement of statistical significance (-log10)."
      qValue:
        type: number
        description: "Measurement of statistical significance using FDR (-log10)."
    required:
      - name
      - score
      - strand
      - thickStart
      - thickEnd
      - itemRgb
      - blockCount
      - blockSizes
      - blockStarts
      - signalValue
      - pValue
      - qValue

  rna_elements_details:
    type: object
    properties:
      name:
        type: string
        description: "Name of the BED line."
      score:
        type: integer
        description: "Score between 0 and 1000."
        minimum: 0
        maximum: 1000
      strand:
        type: string
        enum: [".", "+", "-"]
        description: "Strand. Either '.', '+', or '-'."
      level:
        type: number
        description: "Expression level, e.g. RPKM or FPKM."
      signif:
        type: number
        description: "Statistical significance, e.g. IDR."
      score2:
        type: integer
        description: "Additional measurement/count, e.g. number of reads."
    required:
      - name
      - score
      - strand
      - level
      - signif
      - score2

  ucsc_bed_details:
    type: object
    properties:
      name:
        type: string
        description: "Name of the BED line."
      score:
        type: integer
        description: "Score between 0 and 1000."
        minimum: 0
        maximum: 1000
      strand:
        type: string
        enum: [".", "+", "-"]
        description: "Strand. Either '.', '+', or '-'."
      thickStart:
        type: integer
        description: "Starting position at which the feature is drawn thickly."
      thickEnd:
        type: integer
        description: "Ending position at which the feature is drawn thickly."
      itemRgb:
        type: string
        description: "RGB value of the form R,G,B."
      blockCount:
        type: integer
        description: "Number of blocks (exons)."
      blockSizes:
        type: string
        description: "Comma-separated list of block sizes."
      blockStarts:
        type: string
        description: "Comma-separated list of block starts relative to chromStart."

required:
  - chrom
  - chromStart
  - chromEnd
  - spec_compliant_columns
  - non_spec_compliant_columns
  - bed_format
  - format_specific_details

Edit: Updated to remove ns_narrowpeak for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant