Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oxford phased haplotype file (.haps) + .sample export #55

Open
hyanwong opened this issue Nov 28, 2023 · 3 comments
Open

Oxford phased haplotype file (.haps) + .sample export #55

hyanwong opened this issue Nov 28, 2023 · 3 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Nov 28, 2023

Both Relate and ARG-needle read data in "Oxford phased haplotype file" (.haps + .sample) format, as described at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample (also see https://www.cog-genomics.org/plink/2.0/formats#haps). This could be a useful output format for a tree sequence, to supplement VCF?

Conversion can apparently be done from VCF using plink or one of the Relate utilities:

plink2 --vcf phased.vcf --export haps --out new_filename_prefix

or

PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode ConvertFromVcf \
                 --haps example.haps \
                 --sample example.sample \
                 -i example 

(see https://myersgroup.github.io/relate/input_data.html#ConvertToFileFormat)

@jeromekelleher
Copy link
Member

If it's just text we can happily support in tskit itself.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 29, 2023

It's a bit annoying that the output consists of two text files, which is which I thought tsconvert might be better. Here's a first stab at a conversion routine. It's not extensively tested, though, and there's some faffing with metadata that might be considered unnecessary (but was useful e.g. for plotting Relate inferences from the Kreitman dataset).

def ts_to_haps_sample(ts, haps_output, sample_output, chromosome_number=1, sample_name_field="name"):
    """
    Output the tree sequence as in haps / sample format as required by Relate and ARGneedle
    (see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample)

    ``haps_output`` and ``sample_output`` are the filehandles to which the data will be written.
    To obtain either as strings, you can pass an io.StringIO object here.

    ``sample_name_field`` gives the metadata field in which to look up names to use in the
    output sample file. Where possible, names are taken from the associated individual metadata.
    If samples are not associated with individuals (i.e. this is haploid data), then
    names are taken from node metadata. If no ``sample_name_field`` is present in the metadata,
    the names used are "Individual_N" if samples are associated with individuals, or "Sample_N"
    otherwise.

    Returns an array of the site_ids that were written to the haps file (sites
    with 1 allele or > 2 alleles are skipped)

    .. example::

        with open("out.haps", "wt") as haps, open("out.sample", "wt") as sample:
            ts_to_haps_sample(ts, haps, sample)
    """
    used = np.zeros(ts.num_sites, dtype=bool)
    for v in ts.variants():
        if len(v.alleles) == 1:
            continue
        if len(v.alleles) > 2:
            print(f"Multialleic site ({v.alleles}) at position {v.site.position} ignored")
            continue
        used[v.site.id] = True
        print(
            str(chromosome_number),
            f"SNP{v.site.id}",
            int(v.site.position),
            v.alleles[0],
            v.alleles[1],
            " ".join([str(g) for g in v.genotypes]),
            sep=" ",
            file=haps_output,
        )

    print("ID_1 ID_2 missing", file=sample_output)
    print("0    0    0", file=sample_output)
    individuals = ts.nodes_individual[ts.samples()]
    if np.all(individuals == tskit.NULL):
        # No individuals, just use node metadata
        pass
    else:
        if np.any(individuals == tskit.NULL):
            raise ValueError("Some samples have no individuals")
        _, counts = np.unique(individuals, return_counts=True)
        if np.all(counts == 2):
            if np.any(np.diff(individuals)[0::2]) != 0:
                ValueError("Pairs of adjacent samples must come from the same individual")
        elif np.all(counts == 1):
            pass
        else:
            raise ValueError("Must have all diploid or all haploid samples")
    samples = ts.samples()
    i=0
    while i < len(samples):
        ind1 = ts.node(samples[i]).individual
        ind2 = tskit.NULL
        if ind1 == tskit.NULL:
            try:
                name = ts.node(samples[i]).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Sample_{samples[i]}"
        else:
            try:
                name = ts.individual(ind1).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Individual_{ind1}"
            try:
                ind2 = ts.node(samples[i+1]).individual
            except IndexError:
                pass
        if ind2 == tskit.NULL or ind2 != ind1:
            print(f'{name} NA 0', file=sample_output)
            i += 1
        else:
            print(f'{name} {name} 0', file=sample_output)
            i += 2
    return np.where(used)[0].astype(ts.mutations_site.dtype)

@jeromekelleher
Copy link
Member

Yep, pass in the file handles directly, and leave out the metadata stuff (follow the existing conventions for vcf and other text formats for generating sample names), and it should be fine for tskit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants