Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to simulate data for a demographic trajectory in conjection with a reference genome sequence or output whole DNA sequence? #2340

Open
yzliu01 opened this issue Feb 28, 2025 · 0 comments

Comments

@yzliu01
Copy link

yzliu01 commented Feb 28, 2025

Hi everyone,

I am looking for a method to achieve the task as in the title. This will be used to simulate sequencing read data to reconstruct the trajectory by sampling the simulated reads data in ART read simulator or something like this.

Specifically, I want to use msprime to simulate a bottleneck at 12000 generations ago and a exponential increase from 12000 generations ago to Nt of 200000, kept stable until 4000 generations ago followed by a exponential decrease till present. I intend to simulate sequences for a sample size 100 haploid individuals at a mutation rate and recombination rate for 10 chromosome with chr names in numbers and with each having different size, and output to a vcf file. I wish the simulation uses/applies the same reference genome, which will be later used with bcftools consensus -H A -f ref.fa sim.vcf.gz -o sim_sensus.fa to generate consensus genome sequence for each haploid individual (DNA). Or alternatively, after the simulation, a reference genome or the whole DNA sequence is expected to generated along with simulation data.

I have made the following codes but I did not succeed.

import msprime
import numpy as np
import tskit
import random

# Define parameters
sample_size = 100  # Number of haploid individuals
mutation_rate = 2.9e-9  # Mutation rate per site per generation
recombination_rate = 2.0e-9  # Recombination rate per site per generation

# Define chromosome sizes and names
chromosome_sizes = [100000000, 100000000, 150000000, 150000000, 100000000, 
                    100000000, 100000000, 100000000, 100000000, 100000000]  # Sizes for 10 chromosomes
chromosome_names = [str(i+1) for i in range(10)]  # Names as strings "1" to "10"

# Define nucleotide composition (equal proportions of A, T, C, G)
nucleotides = ["A", "T", "C", "G"]

# Function to generate a random sequence
def generate_sequence(length):
    return "".join(random.choices(nucleotides, k=length))

# Generate the reference genome and save it to a FASTA file
reference_genome = {}
with open("reference_genome.fasta", "w") as fasta_file:
    for name, size in zip(chromosome_names, chromosome_sizes):
        # Generate a random sequence for the chromosome
        sequence = generate_sequence(size)
        # Store the sequence in the reference genome dictionary
        reference_genome[name] = sequence
        # Write the chromosome to the FASTA file
        fasta_file.write(f">{name}\n")  # Write the header
        fasta_file.write(sequence + "\n")  # Write the sequence

# Load the reference genome into a dictionary
reference_genome = {}
with open("reference_genome.fasta", "r") as fasta_file:
    current_chr = None
    for line in fasta_file:
        if line.startswith(">"):
            current_chr = line.strip().split(">")[1]
            reference_genome[current_chr] = ""
        else:
            reference_genome[current_chr] += line.strip()

# Define demographic model
demography = msprime.Demography()
demography.add_population(name="pop", initial_size=200000)  # Initial population size

# Add events in reverse chronological order (most recent to oldest)

# Exponential decrease from 4,000 generations ago to present
final_population_size = 1000
growth_rate_decrease = np.log(final_population_size / 200000) / 4000  # Growth rate for exponential decrease
demography.add_population_parameters_change(time=4000, growth_rate=growth_rate_decrease, population="pop")

# Stable population size from 12,000 to 4,000 generations ago
demography.add_population_parameters_change(time=12000, growth_rate=0, population="pop")

# Exponential growth from 12,000 generations ago to Nt = 200,000
growth_rate = np.log(200000 / 1000) / (12000 - 4000)  # Growth rate for exponential increase
demography.add_population_parameters_change(time=12000, growth_rate=growth_rate, population="pop")

# Bottleneck at 12,000 generations ago
demography.add_population_parameters_change(time=12000, initial_size=1000, population="pop")

# Simulate for each chromosome
ts_list = []
for i, size in enumerate(chromosome_sizes):
    # Simulate ancestry
    ts = msprime.sim_ancestry(
        samples=sample_size,
        demography=demography,
        sequence_length=size,
        recombination_rate=recombination_rate,
        ploidy=1  # Haploid
    )
    
    # Simulate mutations using the reference genome as the ancestral state
    tables = ts.dump_tables()
    for site in tables.sites:
        # Get the position of the site
        pos = int(site.position)
        # Get the ancestral state from the reference genome
        ancestral_state = reference_genome[chromosome_names[i]][pos - 1]  # Positions are 1-based in VCF
        # Update the ancestral state in the site table
        site.ancestral_state = ancestral_state
        site.derived_state = ancestral_state  # Default derived state (can be updated later)
    # Rebuild the tree sequence
    ts = tables.tree_sequence()
    
    # Simulate mutations
    ts = msprime.sim_mutations(ts, rate=mutation_rate)
    ts_list.append(ts)
    
    # Optionally, write each tree sequence to a separate VCF file
    with open(f"chromosome_{chromosome_names[i]}.vcf", "w") as vcf_file:
        ts.write_vcf(vcf_file, contig_id=chromosome_names[i])

The subsequent steps had issues with inconsistent REF alleles based on above codes.

# subset vcf for an individual tsk_99
bcftools view chromosome_1.vcf- -s tsk_99 -O z -o chromosome_1_tsk_99.vcf.gz

bcftools consensus -H A -f reference_genome.fasta chromosome_1_tsk_99.vcf.gz -o chromosome_1_tsk_99_genome.fa 2>&1 | tee chromosome_1_tsk_99.debug.log
# flag in the above line: -H, --haplotype, R: REF allele in het genotypes; A: ALT allele

The first five lines of the debug log file are as below. This is just an example and there are more inconsistency across the chr.

The fasta sequence does not match the REF allele at 1:64:
   REF .vcf: [G]
   ALT .vcf: [G]
   REF .fa : [T]
TCGTCCGCAATTTGCAA...

I could not figure out this issue. Any help or improvement will be appreciated a lot. Looking forward!

@yzliu01 yzliu01 changed the title how to simulate data for a demographic trajectory in conjection with a reference genome sequence? How to simulate data for a demographic trajectory in conjection with a reference genome sequence or output whole DNA sequence? Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant