-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check bed classifier #82
Comments
…curate, fixes broadPeak classification #82
Pulling the recent data, there was a gap of about ~10% where files were classified as 6+3 but not as broadPeaks. After some investigation, I realized that we are only reading 4 rows and that this window would sometimes miss the floats in columns 7, 8, 9 and thus, they would be classified as int64 and fail broadPeak classification. I increased the rows from 4 ->15->20->25->30 until I found that 30 rows reclassifies all of the 6+3 as broad peaks.
Another, more efficient approach would be to relax the type required in those columns to be either float or int64. |
I believe the reason we should not relax the logic is because then we would have false positives with normal bed6+3 files:
The above columns would be in integer format. |
For narrowPeaks: I am seeing some not classified IF narrowPeak is not in the name (original file did include it but if you re-pull it with just the digest, it will not classify as narrowPeak). classified not classified Seems like ~ 60 columns down, the 0's become floats which will flag it then as a narrowPeak. |
Out of ~15,300 |
So what about files classified as narrowPeaks? Interestingly, if we increase the rows read into the classifier (from 4 to 60), we actually start to see files being classified as Comparison on 10% of the Uploaded Data:1400 files, 4 rows 1400 files, 60 rows Why? because of this check: elif col == 4:
if df[col].dtype == "int" and df[col].between(0, 1000).all():
bedtype += 1
else:
n = num_cols - bedtype
return f"bed{bedtype}+{n}", bed_type_named Some of these files do not fit the column 5 spec that it should be an integer value between 0 and 1000. Some examples:569304341e282330677bee56fd45db0a d091b4b1e97ad3c284235d4d43082078 66788888eaea21c069763798ff719c33 f759ec1fd104ab1db5a1d01200807937 Question and ThoughtsSo, is the 0-1000 range a guideline or a rule? (We have had this conversation in the past, I believe) The original file names have narrowPeak in the filename so the uploaders considered them narrowPeak. |
And now for files that were originally classified as broadPeaks, what happens if we increase the number of rows: 4 rows 60 rows |
Similar to the narrowPeaks, the 5 files classified as not broakpeak when reading in 60 rows instead of 4 rows, were classified this way because one of the rows within the first 60 had a value of more than 1000. |
We can also check to see if there were any files classified as narrowPeak but had the bed_type = if num_cols == 9 and ("broadpeak" in bed or "broadPeak" in bed):
bed_type_named = "broadpeak"
elif num_cols == 10 and ("narrowpeak" in bed or "narrowPeak" in bed):
bed_type_named = "narrowpeak"
else:
bed_type_named = "bed" So, if the bed_format were placed in the file name or path and yet the columns were not true to the specification. However, our current data shows 0 hits for this scenario. |
Instead of checking the range for the first 60 rows, we could compute the median value and see if that is between 0 and 1000. if df[col].dtype == "int" and 0 <= df[col].median() <= 1000:
#if df[col].dtype == "int" and df[col].between(0, 1000).all(): |
In the above commit, I've removed checking for "broadpeak" and "narrowpeak" in the input string because:
|
I've now added gappedPeak to the classifier per issue #91 Similar to broadPeak and narrowPeak files, if column 5 has any values over 1000, they will be classified as |
We did discuss the column 5 being over 1000 here: #34 (comment) May be worth re-visiting. |
For gappedPeak files pulled from GeoFetch (212 total): The items classified as not gappedPeaks were due to either:
|
Do we want to record the values for column 5, could re-visit this downstream after opening up file for stats processing? Move it to a spot where we've already opened it in memory (geniml RegionSet). We are going with strict interpretation. |
Added Encode RNA elements yesterday here: 3e87b0f |
Investigating reading all rows vs 60. One must set the parameter Some metrics: 60 rowsnarrowpeak: 1307 real 0m5.841s all rowsDtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False. narrowpeak: 1138 real 2m48.225s Setting low_memory=Falsenarrowpeak: 1138 real 3m0.093s broadPeaks60 rowsbroadpeak: 1395 real 0m6.293s all rowsbroadpeak: 1339 real 3m10.305s |
The above comment refers to using RegionSet from geniml.io to read the entire bed file and send it to the classifier. We wanted to investigate using this workflow so that the pipeline opens the entire bedfile once. I've added code that allows the classification function to take a dataframe as input: 8b3aff3 However, when testing this functionality (while it works and provides the same classification results), I found that creating a RegionSet and then converting it to a dataframe using RegionSet.to_pandas() is much slower.
|
Ok, I've now added logic so that a file that conforms to narrowPeak except the 5th column being greater than 1000 will now be classified as a I've checked to ensure that it will still properly classify as a I did check the files that had been classified as |
Also updated regex to handle bed12+0 files for accurately. |
|
I decided to remove this logic for the time being as I realized we would need to have a non-strict format for all the other Suggestion: if we want to capture how much data could be classified differently without a strict interpretation for |
Marking this as solved after changes in related issues. |
https://pephub.databio.org/bedbase/gse266949?tag=samples
Files apparently are narrowPeak, but it classified as bed4+6 and as bed file. Investigate it
https://bedbase.org/bedset/gse266949
p.s. Also it is a good example how genome validator predicted reference genome
The text was updated successfully, but these errors were encountered: