Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Callvar throws an error when run on broadPeak file #643

Open
LinearParadox opened this issue May 2, 2024 · 15 comments
Open

Bug: Callvar throws an error when run on broadPeak file #643

LinearParadox opened this issue May 2, 2024 · 15 comments
Assignees

Comments

@LinearParadox
Copy link

When running on a broadpeak file, callvar does the following:

macs3 callvar -b Met.mRp.clN_peaks.broadPeak -t ~/results/chromap/merged_replicate/met_filtered.bam --outdir results/var -o variants.vcf

File "MACS3/Signal/RACollection.pyx", line 358, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos
File "MACS3/Signal/RACollection.pyx", line 382, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 376, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 408, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 307, in MACS3.Signal.ReadAlignment.ReadAlignment.get_REFSEQ
ValueError: invalid literal for int() with base 10: b''
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/miniconda3/envs/MACS3/bin/macs3", line 1028, in
main()
File "/envs/MACS3/bin/macs3", line 105, in main
run( args )
File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 235, in run
results = mapresults.get(timeout=window_size*300)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
ValueError: invalid literal for int() with base 10: b''

@LinearParadox
Copy link
Author

So i fixed this by removing the peak names. However this leads to another issue:

Traceback (most recent call last):
File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 342, in call_variants_at_range
PRI = collection.get_PosReadsInfo_ref_pos ( i, ref_nt, Q=minQ )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "MACS3/Signal/RACollection.pyx", line 358, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos
File "MACS3/Signal/RACollection.pyx", line 382, in MACS3.Signal.RACollection.RACollection.get_PosReadsInfo_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 376, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 408, in MACS3.Signal.ReadAlignment.ReadAlignment.get_variant_bq_by_ref_pos
File "MACS3/Signal/ReadAlignment.pyx", line 307, in MACS3.Signal.ReadAlignment.ReadAlignment.get_REFSEQ
ValueError: invalid literal for int() with base 10: b''
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/miniconda3/envs/MACS3/bin/macs3", line 1028, in
main()
File "/miniconda3/envs/MACS3/bin/macs3", line 105, in main
run( args )
File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 235, in run
results = mapresults.get(timeout=window_size*300)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/MACS3/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
ValueError: invalid literal for int() with base 10: b''

@taoliu
Copy link
Contributor

taoliu commented May 3, 2024

@LinearParadox It seems that MACS3 can't read the broadPeak file correctly. Could you share us the top few lines of your broadpeak file?

@LinearParadox
Copy link
Author

LinearParadox commented May 3, 2024

My current file is:

chr1 903718 907055 + 3338
chr1 936442 936670 + 229
chr1 940313 943409 + 3097
chr1 980575 982962 + 2388

I realize for the original error I accidentally put peak names as the first column. For context, I called a set of differential peaks in R. I'm interested in seeing if there are any variants within these peaks. So I filtered my bam file to include these peaks only, and am running it on the filtered bam and this bed file.

I tested running this on the original peak calls and still get the same error. The original peak calls:

chr1 17311 17589 Met.mRp.clN_peak_1 388 . 3.13389 40.8058 38.8152
chr1 136647 136963 Met.mRp.clN_peak_2 87 . 2.01892 10.5187 8.76954
chr1 139009 139315 Met.mRp.clN_peak_3 22 . 1.51709 3.67699 2.29135
chr1 180870 181834 Met.mRp.clN_peak_4 389 . 3.13599 40.9225 38.9172
chr1 182700 184519 Met.mRp.clN_peak_5 103 . 2.0168 11.9825 10.3187
chr1 185721 187131 Met.mRp.clN_peak_6 57 . 1.78087 7.34263 5.77247

@taoliu
Copy link
Contributor

taoliu commented May 4, 2024

Your original peak file looks fine to me. It's in the appropriate BED+ format. Only the first six columns are essential and they should be in order: 1. Chromosome; 2. Start; 3. End; 4. Name; 5. Score; 6. Strand. Perhaps you can try to cut the first six columns? Also, please make sure the delimiter in the file is "tab" (or "\t"). Then try again?

@taoliu
Copy link
Contributor

taoliu commented May 4, 2024

Hmm... I am wrong with my previous comments. The error was found in:

ind += int(MD_op)

Here MACS tries to use the MD tag and CIGAR in BAM format to recover the reference sequence. MACS are supposed to see an integer but it turns out to be an empty string. I saw you mentioned that you made a BAM file for only the reads in the peaks. In fact, you don't have to do so. Please check the document here: https://macs3-project.github.io/MACS/docs/callvar.html You can use the unfiltered BAM file to call variants given the BAM file has been sorted and indexed (with a .bai file).

@taoliu taoliu self-assigned this May 6, 2024
@LinearParadox
Copy link
Author

LinearParadox commented May 6, 2024

SO I tried it with this, and I'm getting a different error. It's not fatal, but I'm not sure if it's impacting the results:

Arguments: ('chr1', 226478788, 226479694, '. Skipped!')
---begin of peak---
INFO  @ 06 May 2024 20:48:31: [130 MB] Peak: chr1 226537600 226537891 
--- Logging error ---
Traceback (most recent call last):
  File "/miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 195, in run
    ra_collection = RACollection( chrom, peak, tbam.get_reads_in_region( chrom, peak["start"], peak["end"], maxDuplicate=maxDuplicate ) )
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "MACS3/IO/BAM.pyx", line 537, in MACS3.IO.BAM.BAMaccessor.get_reads_in_region
  File "MACS3/IO/BAM.pyx", line 574, in MACS3.IO.BAM.BAMaccessor.get_reads_in_region
OverflowError: value too large to convert to int

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 1160, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 999, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 703, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 392, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "miniconda3/envs/MACS3/bin/macs3", line 1028, in <module>
    main()
  File "miniconda3/envs/MACS3/bin/macs3", line 105, in main
    run( args )
  File "miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Commands/callvar_cmd.py", line 197, in run
    info ("No reads found in peak: ", chrom.decode(), peak["start"], peak["end"], ". Skipped!")
  File "miniconda3/envs/MACS3/lib/python3.12/logging/__init__.py", line 1539, in info
    self._log(INFO, msg, args, **kwargs)
  File "miniconda3/envs/MACS3/lib/python3.12/site-packages/MACS3/Utilities/Logger.py", line 14, in _log
    super()._log(level, f"[{mem_usage} MB] {msg}", args, exc_info, extra, stack_info)
Message: '[130 MB] No reads found in peak: '
Arguments: ('chr1', 226537600, 226537891, '. Skipped!')

Edit: This also produces an empty variants file, which is somehwat unexpected.

@taoliu
Copy link
Contributor

taoliu commented May 7, 2024

This error is caused due to the fact that, the size of a chunk of alignment result in BAM is decoded as larger than 2^32. It shouldn't happen normally since we only tried to decode 4bytes data into an unsigned int in the line of code pointed out by the error message:

entrylength = unpack( '<I', dcdata[ i_bytes : i_bytes + 4 ] )[ 0 ]

There must be something I didn't notice before.

Could we check if this time, the BAM file you used is legit? How did you generate it? What's the output from samtools flagstat? Is the ".BAI" file updated?

taoliu added a commit that referenced this issue May 7, 2024
@taoliu
Copy link
Contributor

taoliu commented May 7, 2024

Also, thanks for pointing out the error related to:

info ("No reads found in peak: ", chrom.decode(), peak["start"], peak["end"], ". Skipped!")

Ideally, if there is any error during retrieving reads from a given peak region, MACS3 should skip the peak. And it seems this line make the whole program quit. A possible solution is to remove the peak from your peak file and try again. And I will fix it in the next release. -- fixed in 759e100

@LinearParadox
Copy link
Author

LinearParadox commented May 7, 2024

Let me double check if it's updated. Bam was generated through the nfcore atac seq pipeline. Weird thing is I remember running these samples through the same pipeline a while ago, and callvar running fine then. I reran them again, only difference being I used chromap as an aligner instead of bwa (default).

samtools flagstat gives the following:

1547945886 + 0 in total (QC-passed reads + QC-failed reads)
1547945886 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
1547945886 + 0 mapped (100.00% : N/A)
1547945886 + 0 primary mapped (100.00% : N/A)
1547945886 + 0 paired in sequencing
773972943 + 0 read1
773972943 + 0 read2
1547945886 + 0 properly paired (100.00% : N/A)
1547945886 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

@taoliu
Copy link
Contributor

taoliu commented May 8, 2024

I see. This is a very important information. I just tested a bam file from chromap and can reproduce your error. Let me check where the problem is.

@taoliu
Copy link
Contributor

taoliu commented May 8, 2024

OK. Now I got it. Here is an alignment in SAM format from chromap that can cause the problem in callvar:

SRR1822137.116112	83	chrI	149	60	50M	=	39	-160	GCCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTAC	1HFAFD>@GFIJJJJJJGHF?GC?F?GHGG@IIGJIHFCDF?AFFFF@@@	NM:i:1	MD:Z:A49

And this is the alignment from bwa:

SRR1822137.116112	83	chrI	149	54	50M	=	39	-160	GCCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTAC	1HFAFD>@GFIJJJJJJGHF?GC?F?GHGG@IIGJIHFCDF?AFFFF@@@	NM:i:1	MD:Z:0A49	MC:Z:50M	AS:i:49	XS:i:44	XA:Z:chrII,-5937,50M,2;

The difference is on the 'MD' tag, with which callvar can recover the original DNA sequence from the reference genome. According to the definition in SAM optional tags, the MD tag should have a format of MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*. And there should be a number, even if it is 0, at the beginning of the MD tag. In bwa, it's correct -- 0A49, but in chromap, it's incorrect -- A49. When callvar processes an MD tag as A49, it will throw the error you found because it tries to interpret the first character in the MD tag as a number. There are more similar problems in the MD tag field in the SAM/BAM file generated from chromap.

The solution: if you have the original fasta file of the genome, you can fix these 'MD' tags by using samtools calmd function, and run callvar on the new BAM file. Example:

samtools calmd -b chromap.bam genome.fa > md_fixed.bam
samtools sort -o md_fixed.sorted.bam md_fixed.bam
samtools index md_fixed.bam

So, @LinearParadox, do you mind submitting a ticket to chromap regarding this?

@taoliu
Copy link
Contributor

taoliu commented May 13, 2024

Hi @LinearParadox, did the samtools calmd command solve this issue?

@LinearParadox
Copy link
Author

Seems to be running fine now! Is there any way to speed it up by chance? It's been a day and I'm still waiting on it to finish.

@taoliu
Copy link
Contributor

taoliu commented May 14, 2024

There are a couple of things to tweak to make the process faster. 1. decrease the number of peaks you plan to call variants; 2. use multiple CPUs with option -m NP; 3. turn off fermi for assembling the reads in the peak region with option -F off, and this will sacrifice the specificity for sensitivity and speed.

taoliu added a commit that referenced this issue May 14, 2024
@LinearParadox
Copy link
Author

Got it. The odd thing is when I pass the m option, my CPU usage never seems to go above 100% (when monitored with top). There seems to be no difference vs if I just didn't pass the option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants