minor help improvements

hasindu2008 · Aug 17, 2024 · 6237ce7 · 6237ce7
1 parent d339dcf
commit 6237ce7
Show file tree

Hide file tree

Showing 6 changed files with 18 additions and 269 deletions.
diff --git a/docs/dna.md b/docs/dna.md
@@ -1,6 +1,6 @@
 # squigulator evaluation - genomic DNA
 
-**This document is based on very early experimentsre conducted using commit [2b9f3e886eea16](https://github.com/hasindu2008/squigulator/commit/2b9f3e886eea1601d9f0b2021bcd303ad48005a8) when the tool was under activate development.r more upto date information, please refer to the [squigulator publication](https://genome.cshlp.org/content/34/5/778.full?sid=cd2c8aec-be46-4c9e-885c-8452ac069f64).**
+**This document is based on very early experiments conducted using commit [2b9f3e886eea16](https://github.com/hasindu2008/squigulator/commit/2b9f3e886eea1601d9f0b2021bcd303ad48005a8) when the tool was under activate development. For more up to date information, please refer to the [squigulator publication](https://genome.cshlp.org/content/34/5/778.full?sid=cd2c8aec-be46-4c9e-885c-8452ac069f64).**
 
 For comparing *squigulator* against real data, reads mapping to the chr22 from a NA12878 dataset sequenced on a PromethION sequencer was used. The extracted dataset had ~135,000 reads whose mean read length was ~10,800 bases (will be referred to as *real*). To simulate reads using *squigulator*, first, a heterozygous NA12878 heterozygous chr22 reference was created by applying NA12878 GIAB variants to chr22 in hg38 reference genome. Then, *squigulator* was executed to generate 135,000 reads at mean read length of 10,800 (will be referred to as *simulated*).  See [Methods](#Methods) for more information.
 

diff --git a/docs/man.md b/docs/man.md
@@ -25,14 +25,14 @@ Advanced options are as below:
 -  `--dwell-mean FLOAT`: Mean of number of signal samples per k-mer/base. This is usually the sampling rate (4000Hz for DNA and 3000Hz for RNA) divided by translocation speed in bases per second (450 for R9.4.1 pore for DNA and 70 for RNA).  [default: 9.0]
 -  `--dwell-std FLOAT`: Standard deviation of number of signal samples per k-mer/base. Increasing this will increase time-domain noise. Setting this to 0 is same as `--ideal-time`. See example [here](img/dwell.svg). [default: 4.0]
 - `--bps INT`: translocation speed in bases per second (incompatible with --dwell-mean).
--  `--kmer-model FILE`: custom nucleotide k-mer model file (format similar to [here](https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_450bps.nucleotide.6mer.template.model))
 -  `--prefix=yes/no`: generate prefixes such as adaptor (and polya for RNA). [default: no]
 -  `--seed INT`: seed for random generators (if 0, will be autogenerated). Giving the same seed will produce same results. [default: 0]
 - `--paf-ref`:  in paf output, use the reference as the target instead of read (needs -c)
 - `--cdna`: generate cDNA reads (only valid with dna profiles and the reference must a transcriptome, experimental)
-- `--trans-count FILE`: simulate relative abundance using specified 2-column tsv with first column containing transcript name and the second containing the count  (only for direct-rna and cDNA, experimental). You may generate this a test fatq dataset using minimap2, for example, `minimap2 -cx map-ont transcripts.fa  reads.fastq --secondary=no -t20 -uf | cut -f 6 | sort | uniq -c | awk '{print$2"\t"$1}'`.
+- `--trans-count FILE`: simulate relative abundance using specified 2-column tsv with first column containing transcript name and the second containing the count. See the example at [test/sequin_count.tsv]()  (only for direct-rna and cDNA, experimental). You may generate this using a dataset using minimap2, for example, `minimap2 -cx map-ont transcripts.fa  reads.fastq --secondary=no -t20 -uf | cut -f 6 | sort | uniq -c | awk '{print$2"\t"$1}'`.
 - `--trans-trunc=yes/no`: simulate transcript truncation (only for direct-rna, experimental) [default: no]
 - `--ont-friendly=yes/no`:   generate fake uuid for readids and add a dummy end_reason [default: no]
+-- `--meth-freq FILE`: simulate CpG methylation using a frequency tsv file. The tsv file should have three columns, chr, 0-based pos, and methylation frequency. See the example at [test/mfreq.tsv](). (for DNA, experimental)
 
 Developer options (which are not much tested and error handling) are as below:
 
@@ -43,3 +43,5 @@ Developer options (which are not much tested and error handling) are as below:
 -  `--offset-std FLOAT`:         ADC offset standard deviation (see [here](https://hasindu2008.github.io/slow5specs/summary))
 -  `--median-before-mean`:       Median before mean (see [here](https://hasindu2008.github.io/slow5specs/summary))
 -  `--median-before-std`:        Median before standard deviation (see [here](https://hasindu2008.github.io/slow5specs/summary))
+-  `--kmer-model FILE`:          custom nucleotide k-mer model file (format similar to [f5c models](https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_450bps.nucleotide.6mer.template.model))
+-  `--meth-model FILE`:          custom methylation k-mer model file (format similar to [f5c models](https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_450bps.cpg.6mer.template.model))
diff --git a/scripts/test.sh b/scripts/test.sh
@@ -143,8 +143,11 @@ diff -q test/r9_mfreq.exp a.slow5  || die "diff failed"
 # f5c index a.fastq --slow5 a.slow5 && f5c call-methylation -r a.fastq  -g test/nCoV-2019.reference.fasta -b a.bam --slow5 a.slow5 -o meth.tsv && f5c meth-freq -i meth.tsv -s -o meth-freq.tsv
 
 #meth r10
-# ex ./squigulator -x dna-r9-prom -o a.slow5 --seed 1 -t1 -n 2 -r 29000 test/nCoV-2019.reference.fasta --meth-freq test/methfreq.tsv
+ex ./squigulator -x dna-r10-prom -o a.slow5 --seed 1 -t1 -n 2 -r 29000 test/nCoV-2019.reference.fasta --meth-freq test/methfreq.tsv
 # diff -q test/r10_methfreq.exp a.slow5  || die "diff failed"
+# /install/buttery-eel-0.3.1+6.5.7/scripts/eel -i a.slow5 -o a.fastq --config  dna_r10.4.1_e8.2_400bps_5khz_sup.cfg -x cuda:all
+# minimap2 -ax map-ont /genome/nCoV-2019.reference.fasta  a.fastq --secondary=no | samtools sort - -o a.bam && samtools index a.bam
+# f5c index a.fastq --slow5 a.slow5 && f5c call-methylation -r a.fastq  -g test/nCoV-2019.reference.fasta -b a.bam --slow5 a.slow5 -o meth.tsv && f5c meth-freq -i meth.tsv -s -o meth-freq.tsv
 
 # threads and batch size
 redundancy_check () {

diff --git a/src/genref.c b/src/genref.c
diff --git a/src/sim.c b/src/sim.c
@@ -743,25 +743,27 @@ static void print_help(FILE *fp_help, opt_t opt, profile_t p, int64_t nreads) {
     fprintf(fp_help,"   --dwell-mean FLOAT         mean of number of signal samples per base [%.1f]\n",p.dwell_mean);
     fprintf(fp_help,"   --dwell-std FLOAT          standard deviation of number of signal samples per base [%.1f]\n",p.dwell_std);
     fprintf(fp_help,"   --bps INT                  translocation speed in bases per second (incompatible with --dwell-mean) [%ld]\n",(long)(p.sample_rate/p.dwell_mean));
-    fprintf(fp_help,"   --kmer-model FILE          custom nucleotide k-mer model file (format similar to https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_450bps.nucleotide.6mer.template.model)\n");
-    fprintf(fp_help,"   --meth-model FILE          custom methylation k-mer model file (format similar to https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_450bps.cpg.6mer.template.model)\n");
     fprintf(fp_help,"   --prefix=yes|no            generate prefixes such as adaptor (and polya for RNA) [no]\n");
     fprintf(fp_help,"   --seed INT                 seed or random generators (if 0, will be autogenerated) [%ld]\n",opt.seed);
     fprintf(fp_help,"   --paf-ref                  in paf output, use the reference as the target instead of read (needs -c)\n");
-    fprintf(fp_help,"   --cdna                     generate cDNA reads (only valid with dna profiles and the reference must a transcriptome, experimental)\n");
-    fprintf(fp_help,"   --trans-count FILE         simulate relative abundance using specified tsv with transcript name & count  (only for direct-rna and cDNA, experimental)\n");
+    fprintf(fp_help,"   --cdna                     generate cDNA reads (only for dna profiles & the reference must a transcriptome, experimental)\n");
+    fprintf(fp_help,"   --trans-count FILE         simulate relative abundance using tsv file [transcript name, count]  (for direct-rna & cDNA, experimental)\n");
     fprintf(fp_help,"   --trans-trunc=yes/no       simulate transcript truncattion (only for direct-rna, experimental) [no]\n");
     fprintf(fp_help,"   --ont-friendly=yes/no      generate fake uuid for readids and add a dummy end_reason [no]\n");
-    fprintf(fp_help,"   --meth-freq FILE           simulate CpG methylation using specified frequency file, tsv file with chr, 0-based pos and freq as the columns (only for DNA, experimental)\n");
+    fprintf(fp_help,"   --meth-freq FILE           simulate CpG methylation using frequency tsv file [chr, 0-based pos, freq] (for DNA, experimental)\n");
 
-    fprintf(fp_help,"\ndeveloper options (not much tested yet):\n");
+    fprintf(fp_help,"\ndeveloper options (less tested):\n");
     fprintf(fp_help,"   --digitisation FLOAT       ADC digitisation [%.1f]\n",p.digitisation);
     fprintf(fp_help,"   --sample-rate FLOAT        ADC sampling rate [%.1f]\n",p.sample_rate);
     fprintf(fp_help,"   --range FLOAT              ADC range [%.1f]\n",p.range);
     fprintf(fp_help,"   --offset-mean FLOAT        ADC offset mean [%.1f]\n",p.offset_mean);
     fprintf(fp_help,"   --offset-std FLOAT         ADC offset standard deviation [%.1f]\n",p.offset_std);
     fprintf(fp_help,"   --median-before-mean FLOAT Median before mean [%.1f]\n",p.median_before_mean);
     fprintf(fp_help,"   --median-before-std FLOAT  Median before standard deviation [%.1f]\n",p.median_before_std);
+    fprintf(fp_help,"   --kmer-model FILE          custom nucleotide k-mer model file (format similar to f5c models)\n");
+    fprintf(fp_help,"   --meth-model FILE          custom methylation k-mer model file (format similar to f5c models)\n");
+    fprintf(fp_help,"\n");
+    fprintf(fp_help,"See the manual page on GitHub for more details, options and the format of input/output files.\n");
 
     if(fp_help == stdout){
         exit(EXIT_SUCCESS);

diff --git a/test/methfreq.tsv b/test/methfreq.tsv
@@ -1,4 +1,4 @@
 MN908947.3	99	1.0
 MN908947.3	1184	0.0
 MN908947.3	1456	0.5
-MN908947.3	20031 	0.7
+MN908947.3	20031 	0.7