check bed classifier #82

khoroshevskyi · 2024-11-03T15:02:45Z

https://pephub.databio.org/bedbase/gse266949?tag=samples
Files apparently are narrowPeak, but it classified as bed4+6 and as bed file. Investigate it

https://bedbase.org/bedset/gse266949

p.s. Also it is a good example how genome validator predicted reference genome

…curate, fixes broadPeak classification #82

donaldcampbelljr · 2025-01-30T19:35:10Z

Pulling the recent data, there was a gap of about ~10% where files were classified as 6+3 but not as broadPeaks.

After some investigation, I realized that we are only reading 4 rows and that this window would sometimes miss the floats in columns 7, 8, 9 and thus, they would be classified as int64 and fail broadPeak classification. I increased the rows from 4 ->15->20->25->30 until I found that 30 rows reclassifies all of the 6+3 as broad peaks.

broadPeaks: 145 
not broadpeak:0

Another, more efficient approach would be to relax the type required in those columns to be either float or int64.

donaldcampbelljr · 2025-01-30T19:41:09Z

I believe the reason we should not relax the logic is because then we would have false positives with normal bed6+3 files:
Bed spec

col 7 thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.
col 8 thickEnd - The ending position at which the feature is drawn thickly (for example the stop codon in gene displays).
col 9 itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.

The above columns would be in integer format.

donaldcampbelljr · 2025-01-31T16:53:26Z

For narrowPeaks:

I am seeing some not classified IF narrowPeak is not in the name (original file did include it but if you re-pull it with just the digest, it will not classify as narrowPeak).

classified
bbb0a619c84d9b6b645fdd59a7179520

not classified
9098a59f469cd779b0c0f2824a807a01

Seems like ~ 60 columns down, the 0's become floats which will flag it then as a narrowPeak.

donaldcampbelljr · 2025-01-31T17:21:01Z

Out of ~15,300 6+4 files, only 11 were classified as bed instead of narrowPeak, extending the amount of rows being read by pandas re-classifies all of those as narrowPeaks.

donaldcampbelljr · 2025-01-31T21:07:10Z

So what about files classified as narrowPeaks?

Interestingly, if we increase the rows read into the classifier (from 4 to 60), we actually start to see files being classified as bed4+6, bed.

Comparison on 10% of the Uploaded Data:

1400 files, 4 rows
narrowpeak: 1400
not narrowpeak:0

1400 files, 60 rows
narrowpeak: 1307
not narrowpeak:93

Why?

because of this check:

elif col == 4:  
    if df[col].dtype == "int" and df[col].between(0, 1000).all():  
        bedtype += 1  
    else:  
        n = num_cols - bedtype  
        return f"bed{bedtype}+{n}", bed_type_named

Some of these files do not fit the column 5 spec that it should be an integer value between 0 and 1000.

Some examples:

569304341e282330677bee56fd45db0a
row 45 has a value at 1613

d091b4b1e97ad3c284235d4d43082078
row 7 has a value of 26589

66788888eaea21c069763798ff719c33
row 12 has a value 2238

f759ec1fd104ab1db5a1d01200807937
row 57 has a value 1546

Question and Thoughts

So, is the 0-1000 range a guideline or a rule? (We have had this conversation in the past, I believe)

The original file names have narrowPeak in the filename so the uploaders considered them narrowPeak.

donaldcampbelljr · 2025-01-31T21:41:56Z

And now for files that were originally classified as broadPeaks, what happens if we increase the number of rows:

4 rows
broadpeak: 1400
not broadpeak:0

60 rows
broadpeak: 1395
not broadpeak:5

donaldcampbelljr · 2025-02-03T20:14:36Z

Similar to the narrowPeaks, the 5 files classified as not broakpeak when reading in 60 rows instead of 4 rows, were classified this way because one of the rows within the first 60 had a value of more than 1000.

donaldcampbelljr · 2025-02-03T20:22:50Z

We can also check to see if there were any files classified as narrowPeak but had the bed_type = bed4+6 or were classified as broadPeak and have the bed_type = bed4+5. Both of these scenarios could happen due to this code in the classifier:

        if num_cols == 9 and ("broadpeak" in bed or "broadPeak" in bed): 
            bed_type_named = "broadpeak"
        elif num_cols == 10 and ("narrowpeak" in bed or "narrowPeak" in bed):
            bed_type_named = "narrowpeak"
        else:
            bed_type_named = "bed"

So, if the bed_format were placed in the file name or path and yet the columns were not true to the specification.

However, our current data shows 0 hits for this scenario.

donaldcampbelljr · 2025-02-03T20:44:03Z

Question and Thoughts

So, is the 0-1000 range a guideline or a rule? (We have had this conversation in the past, I believe)

The original file names have narrowPeak in the filename so the uploaders considered them narrowPeak.

Instead of checking the range for the first 60 rows, we could compute the median value and see if that is between 0 and 1000.

  if df[col].dtype == "int" and 0 <= df[col].median() <= 1000:
  #if df[col].dtype == "int" and df[col].between(0, 1000).all():

donaldcampbelljr · 2025-02-03T21:27:28Z

In the above commit, I've removed checking for "broadpeak" and "narrowpeak" in the input string because:

It was checking the entire path (not just the file path)
We are given the input format separately, so we can continue to keep track of that and compare with our own results.
Depending on how the bed file is input into the function, it will not see the 'true' file name,e.g. if the file is zipped, it was only checking the .gz file path not the file within the archive (which may have the format in the name).

donaldcampbelljr · 2025-02-03T22:19:57Z

I've now added gappedPeak to the classifier per issue #91

Similar to broadPeak and narrowPeak files, if column 5 has any values over 1000, they will be classified as ('bed4+11', 'bed') instead of ('bed12+3', 'gappedpeak')

donaldcampbelljr · 2025-02-04T15:19:23Z

We did discuss the column 5 being over 1000 here: #34 (comment)

May be worth re-visiting.

donaldcampbelljr · 2025-02-04T17:30:39Z

For gappedPeak files pulled from GeoFetch (212 total):
gappedPeak: 123
not gappedPeak:89

The items classified as not gappedPeaks were due to either:

value above 1000 in column 5
'.' in column 5

donaldcampbelljr · 2025-02-04T18:52:14Z

Do we want to record the values for column 5, could re-visit this downstream after opening up file for stats processing?

Move it to a spot where we've already opened it in memory (geniml RegionSet).

We are going with strict interpretation.

donaldcampbelljr · 2025-02-12T14:48:30Z

Added Encode RNA elements yesterday here: 3e87b0f

donaldcampbelljr · 2025-02-12T14:55:12Z

Investigating reading all rows vs 60. One must set the parameter low_memory=False and remove the nrows parameters. Reading the entire file is, as expected, slower. It also will change the classification as there may be a row value towards the end of the file that changes how the column is classified (e.g. int vs float) or may be above 1000 in the case of the score column.

Some metrics:

60 rows

narrowpeak: 1307
not narrowpeak:93

real 0m5.841s
user 0m4.819s
sys 0m1.974s

all rows

DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv(bed, sep="\t", header=None, skiprows=row_count)

narrowpeak: 1138
not narrowpeak:262

real 2m48.225s
user 2m32.057s
sys 0m16.787s

Setting low_memory=False

narrowpeak: 1138
not narrowpeak:262

real 3m0.093s
user 2m30.672s
sys 0m26.109s

broadPeaks

60 rows

broadpeak: 1395
not broadpeak:5

real 0m6.293s
user 0m5.027s
sys 0m2.081s

all rows

broadpeak: 1339
not broadpeak:61

real 3m10.305s
user 2m50.379s
sys 0m17.236s

donaldcampbelljr · 2025-02-12T17:54:02Z

Move it to a spot where we've already opened it in memory (geniml RegionSet).

The above comment refers to using RegionSet from geniml.io to read the entire bed file and send it to the classifier. We wanted to investigate using this workflow so that the pipeline opens the entire bedfile once.

I've added code that allows the classification function to take a dataframe as input: 8b3aff3

However, when testing this functionality (while it works and provides the same classification results), I found that creating a RegionSet and then converting it to a dataframe using RegionSet.to_pandas() is much slower.


narrowpeak: 1138 
not narrowpeak:262

real    25m19.463s
user    24m56.004s
sys     0m12.930s

donaldcampbelljr · 2025-02-12T21:02:50Z

Ok, I've now added logic so that a file that conforms to narrowPeak except the 5th column being greater than 1000 will now be classified as a ns_narrowpeak (for #100).

I've checked to ensure that it will still properly classify as a bed4+6 in certain situations.

I did check the files that had been classified as bed4+6 and found that all would re-classify as bed6+4, ns_narrowpeak

donaldcampbelljr · 2025-02-13T15:58:50Z

Also updated regex to handle bed12+0 files for accurately.

donaldcampbelljr · 2025-02-13T19:57:06Z

~~We could also have nonstrict broadPeak, gappedPeak, and even bed as well if their scores are over 1000.~~
~~- [ ] implement nonstrict broadPeak~~
~~- [ ] implement nonstrict gappedPeak~~
~~- [ ] implement nonstrict bed (whats the best way to represent this?)~~

donaldcampbelljr · 2025-02-14T12:43:57Z

Ok, I've now added logic so that a file that conforms to narrowPeak except the 5th column being greater than 1000 will now be classified as a ns_narrowpeak (for #100).

I've checked to ensure that it will still properly classify as a bed4+6 in certain situations.

I did check the files that had been classified as bed4+6 and found that all would re-classify as bed6+4, ns_narrowpeak

I decided to remove this logic for the time being as I realized we would need to have a non-strict format for all the other bed_formats, and this did not feel like the right approach.

Suggestion: if we want to capture how much data could be classified differently without a strict interpretation for Column 5 - score, we could use a flag to enable a non-strict mode for the classifier that would allow the bed file to increment if the score were positive integer values (so 0<score, instead of 0<=score<=1000).

donaldcampbelljr · 2025-02-25T17:03:57Z

Marking this as solved after changes in related issues.

khoroshevskyi assigned donaldcampbelljr Dec 2, 2024

donaldcampbelljr added a commit that referenced this issue Jan 30, 2025

increase to read 30 rows from bed file to ensure column dtypes are ac…

0797f64

…curate, fixes broadPeak classification #82

donaldcampbelljr added a commit that referenced this issue Feb 3, 2025

Remove checking for bed format in file name #82

87de2ac

donaldcampbelljr mentioned this issue Feb 12, 2025

Open BED file using RegionSet from geniml and send to bed classifier #99

Closed

This was referenced Feb 12, 2025

Improving bed classification #101

Open

Improve BED Classifier #60

Closed

donaldcampbelljr added likely-solved bedClassifier and removed likely-solved labels Feb 13, 2025

donaldcampbelljr mentioned this issue Feb 14, 2025

Add parameter to classifier to ignore or apply strict score specification for column 5, score #104

Closed

donaldcampbelljr added the likely-solved label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check bed classifier #82

check bed classifier #82

khoroshevskyi commented Nov 3, 2024 •

edited

Loading

donaldcampbelljr commented Jan 30, 2025

donaldcampbelljr commented Jan 30, 2025

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Feb 3, 2025

donaldcampbelljr commented Feb 3, 2025 •

edited

Loading

donaldcampbelljr commented Feb 3, 2025

Question and Thoughts

donaldcampbelljr commented Feb 3, 2025

donaldcampbelljr commented Feb 3, 2025

donaldcampbelljr commented Feb 4, 2025

donaldcampbelljr commented Feb 4, 2025

donaldcampbelljr commented Feb 4, 2025 •

edited

Loading

donaldcampbelljr commented Feb 12, 2025 •

edited

Loading

donaldcampbelljr commented Feb 12, 2025

donaldcampbelljr commented Feb 12, 2025

donaldcampbelljr commented Feb 12, 2025

donaldcampbelljr commented Feb 13, 2025

donaldcampbelljr commented Feb 13, 2025 •

edited

Loading

donaldcampbelljr commented Feb 14, 2025

donaldcampbelljr commented Feb 25, 2025

check bed classifier #82

check bed classifier #82

Comments

khoroshevskyi commented Nov 3, 2024 • edited Loading

donaldcampbelljr commented Jan 30, 2025

donaldcampbelljr commented Jan 30, 2025

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Jan 31, 2025

Comparison on 10% of the Uploaded Data:

Some examples:

Question and Thoughts

donaldcampbelljr commented Jan 31, 2025

donaldcampbelljr commented Feb 3, 2025

donaldcampbelljr commented Feb 3, 2025 • edited Loading

donaldcampbelljr commented Feb 3, 2025

Question and Thoughts

donaldcampbelljr commented Feb 3, 2025

donaldcampbelljr commented Feb 3, 2025

donaldcampbelljr commented Feb 4, 2025

donaldcampbelljr commented Feb 4, 2025

donaldcampbelljr commented Feb 4, 2025 • edited Loading

donaldcampbelljr commented Feb 12, 2025 • edited Loading

donaldcampbelljr commented Feb 12, 2025

60 rows

all rows

Setting low_memory=False

broadPeaks

60 rows

all rows

donaldcampbelljr commented Feb 12, 2025

donaldcampbelljr commented Feb 12, 2025

donaldcampbelljr commented Feb 13, 2025

donaldcampbelljr commented Feb 13, 2025 • edited Loading

donaldcampbelljr commented Feb 14, 2025

donaldcampbelljr commented Feb 25, 2025

khoroshevskyi commented Nov 3, 2024 •

edited

Loading

donaldcampbelljr commented Feb 3, 2025 •

edited

Loading

donaldcampbelljr commented Feb 4, 2025 •

edited

Loading

donaldcampbelljr commented Feb 12, 2025 •

edited

Loading

donaldcampbelljr commented Feb 13, 2025 •

edited

Loading