Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with reads when template length is shorter than read length #5

Open
lhomas opened this issue May 3, 2018 · 4 comments
Open

Comments

@lhomas
Copy link

lhomas commented May 3, 2018

Hello,

I am trying to simulate reads with a template length distribution such that some have a template length that is shorter than the read length, meaning not all reads in the file should reach the full read length of 75bp that I am using. However, all reads in these files are 75bp.

As an example of the issue I have provided a link to a dropbox containing two files, the simulated reads and read model used to create them. The template length is set to a mean of 50 and std of 0, meaning that the DNA fragments should all be 50, but all of the reads are 75bp (the read length set in the model).

https://www.dropbox.com/sh/uz2zjo2ze33978f/AAC8OXPwwnOtevohZ5dv_qjka?dl=0

@ghost
Copy link

ghost commented May 4, 2018

Hi @lhomas

Thank you for using Mitty. Unfortunately, Mitty does not have the capability to perform this kind of simulation out of the box.

However it should be possible for you to write a sequencer model that allows this.

The built in Illumina model is here. This should give you a decent idea of what you need to modify to get the result you want.

You can copy this code and start a new sequencer model (say variable_length_reads) (BTW, out of curiosity, do you have a particular sequencer/library you are trying to simulate this way?) and place it under Mitty/mitty/simulation/sequencing/

Right now, I don't have a formal plugin system. Once you have your model ready you can add it to cli.py around here alongside the other existing models

I'm happy to help you along in this process, should you decide to do this.

Thanks!
-Kaushik

@yassineS
Copy link

yassineS commented Jan 8, 2019

Hi Kaushik,

We have a similar issue here. We are always simulating illumina reads, however, we are trying to simulate the case where the DNA fragments (templates) are shorter that the sequencer's cycles themselves. This is the case in forensics and ancient DNA. However, even when specifying a smaller template length than the read length (50 vs 120) we still get reads of 120bp.

Cheers,
Yas

@ghost
Copy link

ghost commented Jan 8, 2019

Hi @yassineS this is an interesting use case. When the fragment is shorter than the cycle (and paired end presumably) what is the behavior of the real machine?

Machine cycle 10 (for easier counting)
Template length 5 (12345)

What kind of reads will be produced?

Thanks!

@yassineS
Copy link

yassineS commented Jan 9, 2019

Well first thing is that the sequencer reads into the barcodes and index, and then it'll spit a random sequence of bases but at very low quality. So what we do is during the demultiplexing step we trim low quality bases. Here's a snippet from a real dataset (forward reads; R1):

@ST-E00106:394:HJ2L3ALXX:7:1101:2646:1116 1:N:0:NACTCCG
NAGGTACCTCCCAGGTAGCTGAGATTACAGGCACTTGAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACCCTCCTGCCGCCGTCTCTGGCTCGAGGCCGCGCCGGCGT
+
#A-AAFJJAFJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJFJJJ<FFJJJJJJJJFFJJJJJJJJJJJJJJJJJJ7JFJ7AAFFJAJJJ<<A<FJJJJFJJJAJJF----77------7--7---7--7)7)7)))7)-)))))))
@ST-E00106:394:HJ2L3ALXX:7:1101:4797:1116 1:N:0:NACTCCG
NCTTACGTGGTTCCAGTGCTTTAACTTTGGTTCGCTGGTGGCGTTGAAAGCAGGAGCGTCGGCATGATCGAGTTCGGTGATTTCTATCAGCAGTAGTACAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCCCGTATG
+
#AAFFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJAJJJJJJJJFJJJJJJAJJJJ7JF7FFFJFJFFFJJJJJJJJJJJJJAFJJJJJFJJJJJJJJJFJJJJJJ77AJJFJJJA-AFJFJJAFJJA7A)7AA7-7
@ST-E00106:394:HJ2L3ALXX:7:1101:6380:1116 1:N:0:NACTCCG
NGAACGTTCCAGATCGTCCAGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAATAAGCCCCTCCCCCCTCGTACCAGATCCACACATCACTCGCCCTATCGCCCTGCTC
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJ7AJJJJJJJJFJJJJJJJJJJJJJJJJ--7--<-------7-7-----7-----------------7-)-)-)---7-))))7)
@ST-E00106:394:HJ2L3ALXX:7:1101:6400:1116 1:N:0:NACTCCG
NATTGCAAGGCCCACCAAGCTGCTGCCCTAGGCATCGATTAAGATCAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAACTTCTTCTTCTCTATCCTTTTTATTGCTTGCTTCGGCTGC
+
#AAA-FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJJJJ7---7--7------------------7----)7--))))))
@ST-E00106:394:HJ2L3ALXX:7:1101:7314:1116 1:N:0:NACTCCG
NATAGGCTGCAATGGCCTAGGGGACTCAAGGAGTCAGGCTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACACAACAACCTCCCCCCAGCCCCCTGCCTCGAGCCCCC
+
#AAFAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<FJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJFJ7FJJJJJJJJJJJJJJJJFJJJJJJJJ---7--------7---<--))))))))))7))))))))
@ST-E00106:394:HJ2L3ALXX:7:1101:7679:1116 1:N:0:NACTCCG
NCCAAGTAATCTGCACCTGGGCACTGGAGCATCACCTGCATCATGAACTGAAATATTTAAACTACGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACCCCCCCCCCGCC
+
#AAAAAJJJJJJJJFJJJJJJJJJJJFJJJJJJJJJJJJJFJJJAJFJJFAJJFFJJJJJJJJJJ<JFAA<JJAJF77AF-JJJJ<FFA7FF<FJJJFJFJFJJFJJ<FAJ7JFFFAAJJ7<J<AJAAF<-<AJFAA-7-7)))7--))-7
@ST-E00106:394:HJ2L3ALXX:7:1101:7699:1116 1:N:0:NACTCCG
NATAGGCTGCAGCGGAGGCGGCGGAGCGCACCGCCCAAGGCTCTAGATCGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAACCTCCTTTCTCGGTTTTCCTCTCTGGGCCGCGGGTT
+
#AAAFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJFJJJJJJ-----7---77--77----7---)--))))7)7))))
@ST-E00106:394:HJ2L3ALXX:7:1101:8613:1116 1:N:0:NACTCCG
NAGGTACATACCGGTTCTGCAAGCGCCGTGTGGCTATGGCCGCCGACGATGATATCGATGCCTTCTACATTTTCTGCAAGTTCCTTGAGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTG
+
#AAAAJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJ7FJJFJJJJFAJFFJAF
@ST-E00106:394:HJ2L3ALXX:7:1101:8937:1116 1:N:0:NACTCCG
NTTGACTATGGAACAGAATAGAGAGCCCAGAAATAATGCTGCACACCTACAATCATGAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACCTCGACTCCTTGTGGTGTGGC
+
#AA-AJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJFJ7FJJJJJJJJJJJFJJJJJJJJJJJJJ7---)))-)----777-)-7))
@ST-E00106:394:HJ2L3ALXX:7:1101:9851:1116 1:N:0:NACTCCG
NATTGCAATGAGTACCCTGTCGGATGAGCATGGGCCACAGGCGCATGGCCACGCGCCGCGCTTCGATTAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACCCGGCCCCCC
+
#AAF<FJFJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJ<JFJJJJJJJJJFJJJJJJJJJAJJJJ7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJFJJAJFJJJFJJAFJ<<JF7FFFFJJJA-A7--))7))-)-
@ST-E00106:394:HJ2L3ALXX:7:1101:9912:1116 1:N:0:NACTCCG
NTCCGTAGTGAGGATCACTTGGGCCTGGGAGGTAGATGCTGAGAATCTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAAACTTTTTTAATCAATCGGCCGCGCTGGTGCG
+
#AAFFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ7---7-7<<---7-----))))))))-)))7
@ST-E00106:394:HJ2L3ALXX:7:1101:10236:1116 1:N:0:NACTCCG
NCTTACGGATCACAAGGTCAGGAGTTCCAGACTAGCCTGGCCTAGTACAAGATCGGAAGAGCACACGTCTGACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACATACCCATGCCTCCAACCGCCCGCGTGGCGG
+
#AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJ7AJJJJJJJJJJJJJJJJFJJJJJJJJ-7--7--------7-)--))))-)-))-))))
@ST-E00106:394:HJ2L3ALXX:7:1101:11089:1116 1:N:0:NACTCCG
NATTGCATGGTGGTGCCACTCAGCGGAGATCCGGGGAGCCCTCGTGGCAGATGGTTGAGGGTCGTCGATTAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCCGATCTCGTATGCCGTCTTCTGCTTGAAAAAACTCACCGCG
+
#AAAFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJJJJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFJ-FJFJFFJJJ<JAAJFJJ<FF<AAF7F----7))))

@ghost ghost self-assigned this Jan 11, 2019
@ghost ghost added the enhancement label Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants