You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am looking for a method to achieve the task as in the title. This will be used to simulate sequencing read data to reconstruct the trajectory by sampling the simulated reads data in ART read simulator or something like this.
Specifically, I want to use msprime to simulate a bottleneck at 12000 generations ago and a exponential increase from 12000 generations ago to Nt of 200000, kept stable until 4000 generations ago followed by a exponential decrease till present. I intend to simulate sequences for a sample size 100 haploid individuals at a mutation rate and recombination rate for 10 chromosome with chr names in numbers and with each having different size, and output to a vcf file. I wish the simulation uses/applies the same reference genome, which will be later used with bcftools consensus -H A -f ref.fa sim.vcf.gz -o sim_sensus.fa to generate consensus genome sequence for each haploid individual (DNA). Or alternatively, after the simulation, a reference genome or the whole DNA sequence is expected to generated along with simulation data.
I have made the following codes but I did not succeed.
import msprime
import numpy as np
import tskit
import random
# Define parameters
sample_size = 100 # Number of haploid individuals
mutation_rate = 2.9e-9 # Mutation rate per site per generation
recombination_rate = 2.0e-9 # Recombination rate per site per generation
# Define chromosome sizes and names
chromosome_sizes = [100000000, 100000000, 150000000, 150000000, 100000000,
100000000, 100000000, 100000000, 100000000, 100000000] # Sizes for 10 chromosomes
chromosome_names = [str(i+1) for i in range(10)] # Names as strings "1" to "10"
# Define nucleotide composition (equal proportions of A, T, C, G)
nucleotides = ["A", "T", "C", "G"]
# Function to generate a random sequence
def generate_sequence(length):
return "".join(random.choices(nucleotides, k=length))
# Generate the reference genome and save it to a FASTA file
reference_genome = {}
with open("reference_genome.fasta", "w") as fasta_file:
for name, size in zip(chromosome_names, chromosome_sizes):
# Generate a random sequence for the chromosome
sequence = generate_sequence(size)
# Store the sequence in the reference genome dictionary
reference_genome[name] = sequence
# Write the chromosome to the FASTA file
fasta_file.write(f">{name}\n") # Write the header
fasta_file.write(sequence + "\n") # Write the sequence
# Load the reference genome into a dictionary
reference_genome = {}
with open("reference_genome.fasta", "r") as fasta_file:
current_chr = None
for line in fasta_file:
if line.startswith(">"):
current_chr = line.strip().split(">")[1]
reference_genome[current_chr] = ""
else:
reference_genome[current_chr] += line.strip()
# Define demographic model
demography = msprime.Demography()
demography.add_population(name="pop", initial_size=200000) # Initial population size
# Add events in reverse chronological order (most recent to oldest)
# Exponential decrease from 4,000 generations ago to present
final_population_size = 1000
growth_rate_decrease = np.log(final_population_size / 200000) / 4000 # Growth rate for exponential decrease
demography.add_population_parameters_change(time=4000, growth_rate=growth_rate_decrease, population="pop")
# Stable population size from 12,000 to 4,000 generations ago
demography.add_population_parameters_change(time=12000, growth_rate=0, population="pop")
# Exponential growth from 12,000 generations ago to Nt = 200,000
growth_rate = np.log(200000 / 1000) / (12000 - 4000) # Growth rate for exponential increase
demography.add_population_parameters_change(time=12000, growth_rate=growth_rate, population="pop")
# Bottleneck at 12,000 generations ago
demography.add_population_parameters_change(time=12000, initial_size=1000, population="pop")
# Simulate for each chromosome
ts_list = []
for i, size in enumerate(chromosome_sizes):
# Simulate ancestry
ts = msprime.sim_ancestry(
samples=sample_size,
demography=demography,
sequence_length=size,
recombination_rate=recombination_rate,
ploidy=1 # Haploid
)
# Simulate mutations using the reference genome as the ancestral state
tables = ts.dump_tables()
for site in tables.sites:
# Get the position of the site
pos = int(site.position)
# Get the ancestral state from the reference genome
ancestral_state = reference_genome[chromosome_names[i]][pos - 1] # Positions are 1-based in VCF
# Update the ancestral state in the site table
site.ancestral_state = ancestral_state
site.derived_state = ancestral_state # Default derived state (can be updated later)
# Rebuild the tree sequence
ts = tables.tree_sequence()
# Simulate mutations
ts = msprime.sim_mutations(ts, rate=mutation_rate)
ts_list.append(ts)
# Optionally, write each tree sequence to a separate VCF file
with open(f"chromosome_{chromosome_names[i]}.vcf", "w") as vcf_file:
ts.write_vcf(vcf_file, contig_id=chromosome_names[i])
The subsequent steps had issues with inconsistent REF alleles based on above codes.
# subset vcf for an individual tsk_99
bcftools view chromosome_1.vcf- -s tsk_99 -O z -o chromosome_1_tsk_99.vcf.gz
bcftools consensus -H A -f reference_genome.fasta chromosome_1_tsk_99.vcf.gz -o chromosome_1_tsk_99_genome.fa 2>&1 | tee chromosome_1_tsk_99.debug.log
# flag in the above line: -H, --haplotype, R: REF allele in het genotypes; A: ALT allele
The first five lines of the debug log file are as below. This is just an example and there are more inconsistency across the chr.
The fasta sequence does not match the REF allele at 1:64:
REF .vcf: [G]
ALT .vcf: [G]
REF .fa : [T]
TCGTCCGCAATTTGCAA...
I could not figure out this issue. Any help or improvement will be appreciated a lot. Looking forward!
The text was updated successfully, but these errors were encountered:
yzliu01
changed the title
how to simulate data for a demographic trajectory in conjection with a reference genome sequence?
How to simulate data for a demographic trajectory in conjection with a reference genome sequence or output whole DNA sequence?
Feb 28, 2025
Hi everyone,
I am looking for a method to achieve the task as in the title. This will be used to simulate sequencing read data to reconstruct the trajectory by sampling the simulated reads data in
ART
read simulator or something like this.Specifically, I want to use
msprime
to simulate a bottleneck at 12000 generations ago and a exponential increase from 12000 generations ago toNt
of 200000, kept stable until 4000 generations ago followed by a exponential decrease till present. I intend to simulate sequences for a sample size 100 haploid individuals at a mutation rate and recombination rate for 10 chromosome withchr
names in numbers and with each having different size, and output to avcf
file. I wish the simulation uses/applies the same reference genome, which will be later used withbcftools consensus -H A -f ref.fa sim.vcf.gz -o sim_sensus.fa
to generate consensus genome sequence for each haploid individual (DNA). Or alternatively, after the simulation,a reference genome
orthe whole DNA sequence
is expected to generated along with simulation data.I have made the following codes but I did not succeed.
The subsequent steps had issues with inconsistent REF alleles based on above codes.
The first five lines of the debug log file are as below. This is just an example and there are more inconsistency across the chr.
I could not figure out this issue. Any help or improvement will be appreciated a lot. Looking forward!
The text was updated successfully, but these errors were encountered: