Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: file has no sequences defined (mode='r') #7

Open
getziadz opened this issue Nov 7, 2021 · 11 comments
Open

ValueError: file has no sequences defined (mode='r') #7

getziadz opened this issue Nov 7, 2021 · 11 comments

Comments

@getziadz
Copy link

getziadz commented Nov 7, 2021

I am getting the following error when running data with human genome and the gtf file with circbase bed file.

== 10:33:39-Nov-06-2021 == [Mapping] Mapping consensus sequence to genome done!
== 10:33:39-Nov-06-2021 == [Classifying] Classifying consensus alignment ...
== 10:33:39-Nov-06-2021 == [classify_bam_core] Processing ./cons.fa.sam ... 
Traceback (most recent call last):
  File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
    main()
  File "/home/user/anaconda3/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/home/user/anaconda3/bin/isocirc", line 117, in isocirc_core
    bc.bam_classify(cons_all_sam, high_bam, low_bam, args.high_max_ratio, args.high_min_ratio, args.high_iden_ratio, args.high_repeat_ratio, args.low_repeat_ratio)
  File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/bam_classify.py", line 168, in bam_classify
    with ps.AlignmentFile(in_bam_fn) as in_bam, ps.AlignmentFile(high_bam_fn, 'wb', template=in_bam) as high_bam, \
  File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False

I am using the latest version of isoCirc. Any idea on how I can resolve the issue?

@yangao07
Copy link
Collaborator

yangao07 commented Nov 8, 2021

Hi, can you paste the output of this command in output folder:

ls -l

Seems like the cons.fa.sam is empty.

@getziadz
Copy link
Author

getziadz commented Nov 8, 2021

Hi, the file is not empty:

-rw-rw-r--  1 user  16G Nov  6 10:10 cons.fa.sam

the first line of the file also looks correct:

@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -ax splice -ub --MD --eqx -t 8 Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./cons.fa

@yangao07
Copy link
Collaborator

yangao07 commented Nov 9, 2021

There should be SQ lines in the sam header. Can you paste all the header of cons.fa.sam?
Also, not sure if the reference sequence is correct. Can you try fxtools lp Homo_sapiens.GRCh38.dna.toplevel.fa.gz and paste the output?

@getziadz
Copy link
Author

getziadz commented Nov 9, 2021

Hi, Thank you for your responses.

Yes, you are right, it does not have SQ line. I am not sure what could be the reason though:

@PG     ID:minimap2     PN:minimap2     VN:2.17-r941    CL:minimap2 -ax splice -ub --MD --eqx -t 8 Homo_sapiens.GRCh38.dna.toplevel.fa.gz ./cons.fa

de442b94-5914-48e7-be85-347fcab3600f_cons0      0       13      42168439        60      646S15=2D75=1D1=1D15=4D126N29=2D67=1D4=1I23=2I14=5241N67=1X17=3990N69=1I17=8828N3=1D1=1D55=1D38=1887N44=1D102=1D11=1X37=1I77=1093N22=1D12=1X28=2D39=1D17=259S   *       0       0

       CATATGAACTCAAATTGCCACCAAAAGCTTCCCTACT (and the sequence continues)

I also attached the output of the fxtools here.
fxtools_output.txt

Thanks!

@yangao07
Copy link
Collaborator

The problem is the reference genome you used.
See lh3/minimap2#58 and lh3/minimap2#37 for more information.
Changing to a non-toplevel/primary_assembly reference should work.

Yan

@getziadz
Copy link
Author

Hi Yan,
Thank you for your help. The above suggestion worked and solved the problem of the SQ lines.
The sam file now starts with multiple SQ lines (I am pasting an excerpt below), is this what we expected?:

@SQ     SN:1    LN:248956422
@SQ     SN:10   LN:133797422
@SQ     SN:11   LN:135086622
@SQ     SN:12   LN:133275309
@SQ     SN:13   LN:114364328
@SQ     SN:14   LN:107043718
@SQ     SN:15   LN:101991189
@SQ     SN:16   LN:90338345
@SQ     SN:17   LN:83257441
@SQ     SN:18   LN:80373285
@SQ     SN:19   LN:58617616
@SQ     SN:2    LN:242193529
@SQ     SN:20   LN:64444167
@SQ     SN:21   LN:46709983
@SQ     SN:22   LN:50818468
@SQ     SN:3    LN:198295559
@SQ     SN:4    LN:190214555
@SQ     SN:5    LN:181538259
@SQ     SN:6    LN:170805979
@SQ     SN:7    LN:159345973
@SQ     SN:8    LN:145138636
@SQ     SN:9    LN:138394717
@SQ     SN:MT   LN:16569
@SQ     SN:X    LN:156040895
@SQ     SN:Y    LN:57227415
@SQ     SN:KI270728.1   LN:1872759
@SQ     SN:KI270727.1   LN:448248

However, now there is a new error:

== 21:11:45-Nov-09-2021 == [classify_bam_core] Processing primary_ref/cons.fa.sam done.
== 21:11:45-Nov-09-2021 == [Classifying] Classifying consensus alignment done!
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 370, in __init__
    self.file = self._fasta_opener(filename, 'r+b'
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 273, in open
    return BgzfReader(filename, mode)
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 584, in __init__
    self._load_block(handle.tell())
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 611, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/home/user/anaconda3/lib/python3.8/site-packages/Bio/bgzf.py", line 444, in _load_bgzf_block
    raise ValueError(
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\x1f\x8b\x08\x00'; handle.tell() now says 4

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/anaconda3/bin/isocirc", line 219, in <module>
    main()
  File "/home/user/anaconda3/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/home/user/anaconda3/bin/isocirc", line 132, in isocirc_core
    hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa,
  File "/home/user/anaconda3/lib/python3.8/site-packages/isocirc/hcBSJ_fullIso.py", line 761, in hcBSJ_fullIso
    ref_fa, cons_fa = Fasta(ref), Fasta(cons)
  File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 998, in __init__
    self.faidx = Faidx(
  File "/home/user/anaconda3/lib/python3.8/site-packages/pyfaidx/__init__.py", line 374, in __init__
    raise UnsupportedCompressionFormat(
pyfaidx.UnsupportedCompressionFormat: Compressed FASTA is only supported in BGZF format. Use the samtools bgzip utility (instead of gzip) to compress your FASTA.

FYI, my fastq file is not zipped, it is just fastq, and I did not touch the reference genomes, they are zipped as they come from NCBI. Any suggestions for this error?

Thank you for your time on this.

@yangao07
Copy link
Collaborator

yangao07 commented Nov 11, 2021

pyfaidx.UnsupportedCompressionFormat: Compressed FASTA is only supported in BGZF format

The pyfaidx only accepts BGZF compressed .gz fasta file.
I guess the reference genome you downloaded from NCBI is gziped'.
You can simply unzip it and re-run isocirc.

@getziadz
Copy link
Author

Hi,
Thank you for the suggestion, it worked all the way through when I unzipped the genome files.

One last question I have is, in the stats.out file, although I used human RNA, it finds a lot of reads with BSJs, and total candidate BSJ, it reports that total known high confidence BSJs are 0. I used the CircBase’s human circRNA bed file. Why could that be? Shouldn’t some of the circRNAs detected be the known ones?

@yangao07
Copy link
Collaborator

The circBase coordinates are hg19, did you convert them to hg38?

@getziadz
Copy link
Author

Hi Yan,
Thank you for your suggestion, that was a great catch. I indeed had not convert it to the hg38.
I was sure your suggestion was going to solve the problem but I converted it using LiftOver and reran isocirc, but I got the same results.
Known BSJs are zero, and in the out all isoform IDs are “isocirc0”, “isocirc1”… etc.
Do you have any other ideas of how I can fix it?

@yangao07
Copy link
Collaborator

In addition to the liftover, the chromosome names should also be the same, like, "chr1" and "1" should be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants