-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oxford phased haplotype file (.haps) + .sample export #55
Comments
If it's just text we can happily support in tskit itself. |
It's a bit annoying that the output consists of two text files, which is which I thought def ts_to_haps_sample(ts, haps_output, sample_output, chromosome_number=1, sample_name_field="name"):
"""
Output the tree sequence as in haps / sample format as required by Relate and ARGneedle
(see https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample)
``haps_output`` and ``sample_output`` are the filehandles to which the data will be written.
To obtain either as strings, you can pass an io.StringIO object here.
``sample_name_field`` gives the metadata field in which to look up names to use in the
output sample file. Where possible, names are taken from the associated individual metadata.
If samples are not associated with individuals (i.e. this is haploid data), then
names are taken from node metadata. If no ``sample_name_field`` is present in the metadata,
the names used are "Individual_N" if samples are associated with individuals, or "Sample_N"
otherwise.
Returns an array of the site_ids that were written to the haps file (sites
with 1 allele or > 2 alleles are skipped)
.. example::
with open("out.haps", "wt") as haps, open("out.sample", "wt") as sample:
ts_to_haps_sample(ts, haps, sample)
"""
used = np.zeros(ts.num_sites, dtype=bool)
for v in ts.variants():
if len(v.alleles) == 1:
continue
if len(v.alleles) > 2:
print(f"Multialleic site ({v.alleles}) at position {v.site.position} ignored")
continue
used[v.site.id] = True
print(
str(chromosome_number),
f"SNP{v.site.id}",
int(v.site.position),
v.alleles[0],
v.alleles[1],
" ".join([str(g) for g in v.genotypes]),
sep=" ",
file=haps_output,
)
print("ID_1 ID_2 missing", file=sample_output)
print("0 0 0", file=sample_output)
individuals = ts.nodes_individual[ts.samples()]
if np.all(individuals == tskit.NULL):
# No individuals, just use node metadata
pass
else:
if np.any(individuals == tskit.NULL):
raise ValueError("Some samples have no individuals")
_, counts = np.unique(individuals, return_counts=True)
if np.all(counts == 2):
if np.any(np.diff(individuals)[0::2]) != 0:
ValueError("Pairs of adjacent samples must come from the same individual")
elif np.all(counts == 1):
pass
else:
raise ValueError("Must have all diploid or all haploid samples")
samples = ts.samples()
i=0
while i < len(samples):
ind1 = ts.node(samples[i]).individual
ind2 = tskit.NULL
if ind1 == tskit.NULL:
try:
name = ts.node(samples[i]).metadata[sample_name_field].replace(" ", "_")
except (TypeError, KeyError):
name = f"Sample_{samples[i]}"
else:
try:
name = ts.individual(ind1).metadata[sample_name_field].replace(" ", "_")
except (TypeError, KeyError):
name = f"Individual_{ind1}"
try:
ind2 = ts.node(samples[i+1]).individual
except IndexError:
pass
if ind2 == tskit.NULL or ind2 != ind1:
print(f'{name} NA 0', file=sample_output)
i += 1
else:
print(f'{name} {name} 0', file=sample_output)
i += 2
return np.where(used)[0].astype(ts.mutations_site.dtype) |
Yep, pass in the file handles directly, and leave out the metadata stuff (follow the existing conventions for vcf and other text formats for generating sample names), and it should be fine for tskit. |
Both Relate and ARG-needle read data in "Oxford phased haplotype file" (.haps + .sample) format, as described at https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#hapsample (also see https://www.cog-genomics.org/plink/2.0/formats#haps). This could be a useful output format for a tree sequence, to supplement VCF?
Conversion can apparently be done from VCF using plink or one of the Relate utilities:
or
(see https://myersgroup.github.io/relate/input_data.html#ConvertToFileFormat)
The text was updated successfully, but these errors were encountered: