Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend bed formats to bed classifier, e.g. gappedPeak #91

Open
donaldcampbelljr opened this issue Jan 28, 2025 · 5 comments
Open

Extend bed formats to bed classifier, e.g. gappedPeak #91

donaldcampbelljr opened this issue Jan 28, 2025 · 5 comments

Comments

@donaldcampbelljr
Copy link
Member

We should add more bed formats, e.g.
https://genome.ucsc.edu/FAQ/FAQformat.html

@donaldcampbelljr
Copy link
Member Author

We should also create an ontology of genomic region file formats in JSON format to be hosted on PEPhub:
narrowPeak + broadPeak vs bed 6+4 etc

@donaldcampbelljr
Copy link
Member Author

gappedPeak has now been added to the classifier

@donaldcampbelljr
Copy link
Member Author

I investigated historical formats this morning.

Do we want to add these to bedbase?

ENCODE tagAlign: BED3+3 format (historical)

  • This format used hg18.
  • GEO only had ~ 142 files, interestingly none of them followed the spec where column 4 is a reported sequence, instead all files had 'N' in column 4.
  • ENCODE had over 5000 results, many if not all were labeled as archived.

ENCODE pairedTagAlign: BED6+2 format (historical)

  • None found in GEO or ENCODE when searching

ENCODE peptideMapping: BED6+4 format

  • These simply have a .bed extension
  • No hits on GEO, only 68 hits on ENCODE
  • They are similar to narrowPeaks but we could possibly use the 9th column to help differentiate between the two (int vs float).

@donaldcampbelljr
Copy link
Member Author

No to TagAlign.
No to pairedTagAlign.

The above two are actually reads.

No to peptideMapping -> appears to be historical, only a small set of data available.

@donaldcampbelljr
Copy link
Member Author

Investigating ENCODE RNA elements: BED6 + 3 scores format
https://genome.ucsc.edu/FAQ/FAQformat.html#format11

There are no examples currently uploaded to bedbase out of the 1500 bed6+3's. After some internet research, I did find an example here.

I added some logic with the above commit which checks to see if the 9th column is an integer value (but NOT -1).

Example test file:

chr1	56969	57098	id-1	19	.	-1	-1	10
chr1	180739	180871	id-2	144	.	-1	-1	20
chr1	181108	181267	id-3	76	.	-1	-1	30

vs a broadpeak:

DEBUG: COLUMN TYPE VALUE 6 float64 3.4429 7 float64 5.7737 8 float64 4.2324
('bed6+3', 'broadpeak')
DEBUG: COLUMN TYPE VALUE 6 int64 -1 7 int64 -1 8 int64 10
('bed6+3', 'encode_rna_elements')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant