-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Dear devellopers,
Describe the issue
I encountered an issue trying to run RepeatModeler on a large and fragmented genome assembly. The assembly is 18Gbp. The issue seem to occur during family refinement steps.
The stderr logs looks like this, continuing over many lines:
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Could not open astacus_scaffold1.translation file for reading!
Then the software stopped after round 4 with this last error line:
eleredef failed. Exit code 9
The astacus_scaffold1.translation file does exist in the folder when I ran RepeatModeler and was created during the BuildDatabase process. It is not empty. So I think the error message is somehow misleading .
I read a GitHub issue which seem to be similar to mine (here), but I do not think it was resolved. Similar to people reporting on that issue, my folder have a blast database split over two files, which I am thinking might be causing the issue.
-rw-r--r-- 1 nevers bio3d 768733 déc. 19 14:26 astacus_scaffold1.00.nhr
-rw-r--r-- 1 nevers bio3d 249444 déc. 19 14:26 astacus_scaffold1.00.nin
-rw-r--r-- 1 nevers bio3d 166240 déc. 19 14:26 astacus_scaffold1.00.nnd
-rw-r--r-- 1 nevers bio3d 700 déc. 19 14:26 astacus_scaffold1.00.nni
-rw-r--r-- 1 nevers bio3d 83152 déc. 19 14:26 astacus_scaffold1.00.nog
-rw-r--r-- 1 nevers bio3d 2999992548 déc. 19 14:26 astacus_scaffold1.00.nsq
-rw-r--r-- 1 nevers bio3d 5502041 déc. 19 14:27 astacus_scaffold1.01.nhr
-rw-r--r-- 1 nevers bio3d 1741356 déc. 19 14:27 astacus_scaffold1.01.nin
-rw-r--r-- 1 nevers bio3d 1160848 déc. 19 14:27 astacus_scaffold1.01.nnd
-rw-r--r-- 1 nevers bio3d 4580 déc. 19 14:27 astacus_scaffold1.01.nni
-rw-r--r-- 1 nevers bio3d 580456 déc. 19 14:27 astacus_scaffold1.01.nog
-rw-r--r-- 1 nevers bio3d 1635106568 déc. 19 14:27 astacus_scaffold1.01.nsq
-rw-r--r-- 1 nevers bio3d 116 déc. 19 14:27 astacus_scaffold1.nal
-rw-r--r-- 1 nevers bio3d 761 déc. 19 14:27 astacus_scaffold1.njs
-rw-r--r-- 1 nevers bio3d 3593168 déc. 19 14:25 astacus_scaffold1.translation
The issue does not occur if I split the genome in two equal parts and run RepeatModeler over each part separately. However I know this is not recommended.
I tried changing the makeblastdb parameter to not split the database files upon creation, but the maximum file size allowed in makeblastdb parameter is 4G which is smaller than what I would need.
I believe my issue stem from this, but would you know if it is something else and how I could solve it? And if not, what would be the best workaround?
Reproduction steps
The command line used where:
BuildDatabase -name astacus_scaffold1 -engine ncbi Ast_scaffold_assembly.fa
Then:
RepeatModeler -threads 64 -engine ncbi -database astacus_scaffold1 -LTRStruct
Log output
I pasted part of the stderr output above. The stdout output for Repeat Modeler starts like this:
RepeatModeler Version 2.0.5
Using output directory = /EDITED/RM_3235912.FriDec191444552025
Search Engine = rmblast 2.14.1+
Threads = 64
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever ,
Ninja 0.97-cluster_only, MAFFT 7.520,
CD-HIT 4.8.1 )
Random Number Seed: 1766151892
Database = astacus_scaffold1 .................
- Sequences = 165886
- Bases = 18539308152
- N50 = 347608
- Contig Histogram:
Size(bp) Count
12163548-13032237 | [ 4 ]
11294859-12163547 | [ 1 ]
10426170-11294858 | [ 5 ]
9557481-10426169 | [ 7 ]
8688792-9557480 | [ 9 ]
7820103-8688791 | [ 20 ]
6951414-7820102 | [ 28 ]
6082726-6951414 | [ 28 ]
5214037-6082725 | [ 57 ]
4345348-5214036 | [ 90 ]
3476659-4345347 | [ 175 ]
2607970-3476658 | [ 279 ]
1739281-2607969 | [ 579 ]
870592-1739280 | [ 1800 ]
1904-870592 |************************************************** [ 162804 ]Storage Throughput = excellent ( 1935.96 MB/s )
Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
and the repetitive content of the sequences. It is not imperative
that RepeatModeler completes all rounds in order to obtain useful
results. At the completion of each round, the files ( consensi.fa, and
families.stk ) found in:
/EDITED/RM_3235912.FriDec191444552025/
will contain all results produced thus far. These files may be
manually copied and run through RepeatClassifier should the program
be terminated early.RepeatModeler Round # 1
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 40029971 bp ( 40026171 non ambiguous )
- Num Contigs Represented = 1131
- Sequence extraction : 00:00:05 (hh:mm:ss) Elapsed Time
-- Running RepeatScout on the sequences...- RepeatScout: Running build_lmer_table ( l = 14 )..
- RepeatScout: Running RepeatScout.. : 2017 raw families identified
- RepeatScout: Running filtering stage.. 942 families remaining
- RepeatScout: 00:21:44 (hh:mm:ss) Elapsed Time
- Large Satellite Filtering.. : 5 found in 00:00:11 (hh:mm:ss) Elapsed Time
- Collecting repeat instances...: 00:02:12 (hh:mm:ss) Elapsed Time
Refinement: 00:00:25 (hh:mm:ss) Elapsed Time
WARNING: Refiner did not return a consensus for /EDITED/RM_3235912.FriDec191444552025/round-1/family-302.fa.
WARNING: Refiner did not return a consensus for /EDITED/RM_3235912.FriDec191444552025/round-1/family-107.fa.
...
`
Environment (please include as much of the following information as you can find out):
I used RepeatModeler v2.0.5 through a Singularity container,
It was run on a Linux server running on Ubuntu 20.04.6 LT.
Thank you,
Best wishes,
Yannis Nevers