Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check bed classifier #82

Open
khoroshevskyi opened this issue Nov 3, 2024 · 22 comments
Open

check bed classifier #82

khoroshevskyi opened this issue Nov 3, 2024 · 22 comments

Comments

@khoroshevskyi
Copy link
Member

khoroshevskyi commented Nov 3, 2024

https://pephub.databio.org/bedbase/gse266949?tag=samples
Files apparently are narrowPeak, but it classified as bed4+6 and as bed file. Investigate it

https://bedbase.org/bedset/gse266949

p.s. Also it is a good example how genome validator predicted reference genome

donaldcampbelljr added a commit that referenced this issue Jan 30, 2025
@donaldcampbelljr
Copy link
Member

Pulling the recent data, there was a gap of about ~10% where files were classified as 6+3 but not as broadPeaks.

After some investigation, I realized that we are only reading 4 rows and that this window would sometimes miss the floats in columns 7, 8, 9 and thus, they would be classified as int64 and fail broadPeak classification. I increased the rows from 4 ->15->20->25->30 until I found that 30 rows reclassifies all of the 6+3 as broad peaks.

broadPeaks: 145 
not broadpeak:0

Another, more efficient approach would be to relax the type required in those columns to be either float or int64.

Image

@donaldcampbelljr
Copy link
Member

I believe the reason we should not relax the logic is because then we would have false positives with normal bed6+3 files:
Bed spec

  • col 7 thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.
  • col 8 thickEnd - The ending position at which the feature is drawn thickly (for example the stop codon in gene displays).
  • col 9 itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.

The above columns would be in integer format.

@donaldcampbelljr
Copy link
Member

For narrowPeaks:

I am seeing some not classified IF narrowPeak is not in the name (original file did include it but if you re-pull it with just the digest, it will not classify as narrowPeak).

classified
bbb0a619c84d9b6b645fdd59a7179520

not classified
9098a59f469cd779b0c0f2824a807a01

Seems like ~ 60 columns down, the 0's become floats which will flag it then as a narrowPeak.

@donaldcampbelljr
Copy link
Member

Out of ~15,300 6+4 files, only 11 were classified as bed instead of narrowPeak, extending the amount of rows being read by pandas re-classifies all of those as narrowPeaks.

@donaldcampbelljr
Copy link
Member

So what about files classified as narrowPeaks?

Interestingly, if we increase the rows read into the classifier (from 4 to 60), we actually start to see files being classified as bed4+6, bed.

Comparison on 10% of the Uploaded Data:

1400 files, 4 rows
narrowpeak: 1400
not narrowpeak:0

1400 files, 60 rows
narrowpeak: 1307
not narrowpeak:93

Why?

because of this check:

elif col == 4:  
    if df[col].dtype == "int" and df[col].between(0, 1000).all():  
        bedtype += 1  
    else:  
        n = num_cols - bedtype  
        return f"bed{bedtype}+{n}", bed_type_named

Some of these files do not fit the column 5 spec that it should be an integer value between 0 and 1000.

Some examples:

569304341e282330677bee56fd45db0a
row 45 has a value at 1613

d091b4b1e97ad3c284235d4d43082078
row 7 has a value of 26589

66788888eaea21c069763798ff719c33
row 12 has a value 2238

f759ec1fd104ab1db5a1d01200807937
row 57 has a value 1546

Question and Thoughts

So, is the 0-1000 range a guideline or a rule? (We have had this conversation in the past, I believe)

The original file names have narrowPeak in the filename so the uploaders considered them narrowPeak.

@donaldcampbelljr
Copy link
Member

And now for files that were originally classified as broadPeaks, what happens if we increase the number of rows:

4 rows
broadpeak: 1400
not broadpeak:0

60 rows
broadpeak: 1395
not broadpeak:5

@donaldcampbelljr
Copy link
Member

Similar to the narrowPeaks, the 5 files classified as not broakpeak when reading in 60 rows instead of 4 rows, were classified this way because one of the rows within the first 60 had a value of more than 1000.

@donaldcampbelljr
Copy link
Member

donaldcampbelljr commented Feb 3, 2025

We can also check to see if there were any files classified as narrowPeak but had the bed_type = bed4+6 or were classified as broadPeak and have the bed_type = bed4+5. Both of these scenarios could happen due to this code in the classifier:

        if num_cols == 9 and ("broadpeak" in bed or "broadPeak" in bed): 
            bed_type_named = "broadpeak"
        elif num_cols == 10 and ("narrowpeak" in bed or "narrowPeak" in bed):
            bed_type_named = "narrowpeak"
        else:
            bed_type_named = "bed"

So, if the bed_format were placed in the file name or path and yet the columns were not true to the specification.

However, our current data shows 0 hits for this scenario.

@donaldcampbelljr
Copy link
Member

Question and Thoughts

So, is the 0-1000 range a guideline or a rule? (We have had this conversation in the past, I believe)

The original file names have narrowPeak in the filename so the uploaders considered them narrowPeak.

Instead of checking the range for the first 60 rows, we could compute the median value and see if that is between 0 and 1000.

  if df[col].dtype == "int" and 0 <= df[col].median() <= 1000:
  #if df[col].dtype == "int" and df[col].between(0, 1000).all():

@donaldcampbelljr
Copy link
Member

In the above commit, I've removed checking for "broadpeak" and "narrowpeak" in the input string because:

  • It was checking the entire path (not just the file path)
  • We are given the input format separately, so we can continue to keep track of that and compare with our own results.
  • Depending on how the bed file is input into the function, it will not see the 'true' file name,e.g. if the file is zipped, it was only checking the .gz file path not the file within the archive (which may have the format in the name).

@donaldcampbelljr
Copy link
Member

I've now added gappedPeak to the classifier per issue #91

Similar to broadPeak and narrowPeak files, if column 5 has any values over 1000, they will be classified as ('bed4+11', 'bed') instead of ('bed12+3', 'gappedpeak')

@donaldcampbelljr
Copy link
Member

We did discuss the column 5 being over 1000 here: #34 (comment)

May be worth re-visiting.

@donaldcampbelljr
Copy link
Member

For gappedPeak files pulled from GeoFetch (212 total):
gappedPeak: 123
not gappedPeak:89

The items classified as not gappedPeaks were due to either:

  1. value above 1000 in column 5
  2. '.' in column 5

@donaldcampbelljr
Copy link
Member

donaldcampbelljr commented Feb 4, 2025

Do we want to record the values for column 5, could re-visit this downstream after opening up file for stats processing?

Move it to a spot where we've already opened it in memory (geniml RegionSet).

We are going with strict interpretation.

@donaldcampbelljr
Copy link
Member

donaldcampbelljr commented Feb 12, 2025

Added Encode RNA elements yesterday here: 3e87b0f

@donaldcampbelljr
Copy link
Member

Investigating reading all rows vs 60. One must set the parameter low_memory=False and remove the nrows parameters. Reading the entire file is, as expected, slower. It also will change the classification as there may be a row value towards the end of the file that changes how the column is classified (e.g. int vs float) or may be above 1000 in the case of the score column.

Some metrics:

60 rows

narrowpeak: 1307
not narrowpeak:93

real 0m5.841s
user 0m4.819s
sys 0m1.974s

all rows

DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(bed, sep="\t", header=None, skiprows=row_count)

narrowpeak: 1138
not narrowpeak:262

real 2m48.225s
user 2m32.057s
sys 0m16.787s

Setting low_memory=False

narrowpeak: 1138
not narrowpeak:262

real 3m0.093s
user 2m30.672s
sys 0m26.109s

broadPeaks

60 rows

broadpeak: 1395
not broadpeak:5

real 0m6.293s
user 0m5.027s
sys 0m2.081s

all rows

broadpeak: 1339
not broadpeak:61

real 3m10.305s
user 2m50.379s
sys 0m17.236s

@donaldcampbelljr
Copy link
Member

Move it to a spot where we've already opened it in memory (geniml RegionSet).

The above comment refers to using RegionSet from geniml.io to read the entire bed file and send it to the classifier. We wanted to investigate using this workflow so that the pipeline opens the entire bedfile once.

I've added code that allows the classification function to take a dataframe as input: 8b3aff3

However, when testing this functionality (while it works and provides the same classification results), I found that creating a RegionSet and then converting it to a dataframe using RegionSet.to_pandas() is much slower.


narrowpeak: 1138 
not narrowpeak:262

real    25m19.463s
user    24m56.004s
sys     0m12.930s

@donaldcampbelljr
Copy link
Member

Ok, I've now added logic so that a file that conforms to narrowPeak except the 5th column being greater than 1000 will now be classified as a ns_narrowpeak (for #100).

I've checked to ensure that it will still properly classify as a bed4+6 in certain situations.

I did check the files that had been classified as bed4+6 and found that all would re-classify as bed6+4, ns_narrowpeak

This was referenced Feb 12, 2025
@donaldcampbelljr
Copy link
Member

Also updated regex to handle bed12+0 files for accurately.

@donaldcampbelljr
Copy link
Member

donaldcampbelljr commented Feb 13, 2025

We could also have nonstrict broadPeak, gappedPeak, and even bed as well if their scores are over 1000.
- [ ] implement nonstrict broadPeak
- [ ] implement nonstrict gappedPeak
- [ ] implement nonstrict bed (whats the best way to represent this?)

@donaldcampbelljr
Copy link
Member

Ok, I've now added logic so that a file that conforms to narrowPeak except the 5th column being greater than 1000 will now be classified as a ns_narrowpeak (for #100).

I've checked to ensure that it will still properly classify as a bed4+6 in certain situations.

I did check the files that had been classified as bed4+6 and found that all would re-classify as bed6+4, ns_narrowpeak

I decided to remove this logic for the time being as I realized we would need to have a non-strict format for all the other bed_formats, and this did not feel like the right approach.

Suggestion: if we want to capture how much data could be classified differently without a strict interpretation for Column 5 - score, we could use a flag to enable a non-strict mode for the classifier that would allow the bed file to increment if the score were positive integer values (so 0<score, instead of 0<=score<=1000).

@donaldcampbelljr
Copy link
Member

Marking this as solved after changes in related issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants