Oxford phased haplotype file (.haps) + .sample export #55

hyanwong · 2023-11-28T22:40:31Z

Both Relate and ARG-needle read data in "Oxford phased haplotype file" (.haps + .sample) format, as described at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample (also see https://www.cog-genomics.org/plink/2.0/formats#haps). This could be a useful output format for a tree sequence, to supplement VCF?

Conversion can apparently be done from VCF using plink or one of the Relate utilities:

plink2 --vcf phased.vcf --export haps --out new_filename_prefix

or

PATH_TO_RELATE/bin/RelateFileFormats \
                 --mode ConvertFromVcf \
                 --haps example.haps \
                 --sample example.sample \
                 -i example

(see https://myersgroup.github.io/relate/input_data.html#ConvertToFileFormat)

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2023-11-29T09:24:59Z

If it's just text we can happily support in tskit itself.

hyanwong · 2023-11-29T14:05:44Z

It's a bit annoying that the output consists of two text files, which is which I thought tsconvert might be better. Here's a first stab at a conversion routine. It's not extensively tested, though, and there's some faffing with metadata that might be considered unnecessary (but was useful e.g. for plotting Relate inferences from the Kreitman dataset).

def ts_to_haps_sample(ts, haps_output, sample_output, chromosome_number=1, sample_name_field="name"):
    """
    Output the tree sequence as in haps / sample format as required by Relate and ARGneedle
    (see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample)

    ``haps_output`` and ``sample_output`` are the filehandles to which the data will be written.
    To obtain either as strings, you can pass an io.StringIO object here.

    ``sample_name_field`` gives the metadata field in which to look up names to use in the
    output sample file. Where possible, names are taken from the associated individual metadata.
    If samples are not associated with individuals (i.e. this is haploid data), then
    names are taken from node metadata. If no ``sample_name_field`` is present in the metadata,
    the names used are "Individual_N" if samples are associated with individuals, or "Sample_N"
    otherwise.

    Returns an array of the site_ids that were written to the haps file (sites
    with 1 allele or > 2 alleles are skipped)

    .. example::

        with open("out.haps", "wt") as haps, open("out.sample", "wt") as sample:
            ts_to_haps_sample(ts, haps, sample)
    """
    used = np.zeros(ts.num_sites, dtype=bool)
    for v in ts.variants():
        if len(v.alleles) == 1:
            continue
        if len(v.alleles) > 2:
            print(f"Multialleic site ({v.alleles}) at position {v.site.position} ignored")
            continue
        used[v.site.id] = True
        print(
            str(chromosome_number),
            f"SNP{v.site.id}",
            int(v.site.position),
            v.alleles[0],
            v.alleles[1],
            " ".join([str(g) for g in v.genotypes]),
            sep=" ",
            file=haps_output,
        )

    print("ID_1 ID_2 missing", file=sample_output)
    print("0    0    0", file=sample_output)
    individuals = ts.nodes_individual[ts.samples()]
    if np.all(individuals == tskit.NULL):
        # No individuals, just use node metadata
        pass
    else:
        if np.any(individuals == tskit.NULL):
            raise ValueError("Some samples have no individuals")
        _, counts = np.unique(individuals, return_counts=True)
        if np.all(counts == 2):
            if np.any(np.diff(individuals)[0::2]) != 0:
                ValueError("Pairs of adjacent samples must come from the same individual")
        elif np.all(counts == 1):
            pass
        else:
            raise ValueError("Must have all diploid or all haploid samples")
    samples = ts.samples()
    i=0
    while i < len(samples):
        ind1 = ts.node(samples[i]).individual
        ind2 = tskit.NULL
        if ind1 == tskit.NULL:
            try:
                name = ts.node(samples[i]).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Sample_{samples[i]}"
        else:
            try:
                name = ts.individual(ind1).metadata[sample_name_field].replace(" ", "_")
            except (TypeError, KeyError):
                name = f"Individual_{ind1}"
            try:
                ind2 = ts.node(samples[i+1]).individual
            except IndexError:
                pass
        if ind2 == tskit.NULL or ind2 != ind1:
            print(f'{name} NA 0', file=sample_output)
            i += 1
        else:
            print(f'{name} {name} 0', file=sample_output)
            i += 2
    return np.where(used)[0].astype(ts.mutations_site.dtype)

jeromekelleher · 2023-11-29T14:13:10Z

Yep, pass in the file handles directly, and leave out the metadata stuff (follow the existing conventions for vcf and other text formats for generating sample names), and it should be fine for tskit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oxford phased haplotype file (.haps) + .sample export #55

Oxford phased haplotype file (.haps) + .sample export #55

hyanwong commented Nov 28, 2023 •

edited

Loading

jeromekelleher commented Nov 29, 2023

hyanwong commented Nov 29, 2023 •

edited

Loading

jeromekelleher commented Nov 29, 2023

Oxford phased haplotype file (.haps) + .sample export #55

Oxford phased haplotype file (.haps) + .sample export #55

Comments

hyanwong commented Nov 28, 2023 • edited Loading

jeromekelleher commented Nov 29, 2023

hyanwong commented Nov 29, 2023 • edited Loading

jeromekelleher commented Nov 29, 2023

hyanwong commented Nov 28, 2023 •

edited

Loading

hyanwong commented Nov 29, 2023 •

edited

Loading