general_stat()/sample_count_stat() and f(0)=0 requirement #1188

grahamgower · 2021-02-03T15:47:38Z

grahamgower
Feb 3, 2021
Collaborator

Consider the following code which has a bug. The bug is triggered by the final line, but not by the penultimate line.

import msprime

def allele_counts(ts, samples):

    def f(x):
        if x == 0 or x == len(samples):
            return 0
        return x

    return ts.sample_count_stat(
        [samples],
        f,
        1,
        windows="sites",
        polarised=False,
        mode="site",
        span_normalise=False,
    )

ts = msprime.simulate(
    100, Ne=10000, mutation_rate=1e-8, length=100000, random_seed=1234
)

samples = ts.samples()

ac1 = allele_counts(ts, ts.samples())
ac2 = allele_counts(ts, ts.samples()[:10])

Traceback (most recent call last):
  File "/home/grg/src/AIstats/test_sample_count_stat.py", line 27, in <module>
    ac2 = allele_counts(ts, ts.samples()[:10])
  File "/home/grg/src/AIstats/test_sample_count_stat.py", line 10, in allele_counts
    return ts.sample_count_stat(
  File "/home/grg/.local/lib/python3.9/site-packages/tskit/trees.py", line 5393, in sample_count_stat
    return self.general_stat(
  File "/home/grg/.local/lib/python3.9/site-packages/tskit/trees.py", line 5292, in general_stat
    return self.__run_windowed_stat(
  File "/home/grg/.local/lib/python3.9/site-packages/tskit/trees.py", line 5432, in __run_windowed_stat
    stat = method(*args, **kwargs, windows=windows)
ValueError: object of too small depth for desired array

Now, reading the "Note" in the docs for sample_count_stat() I see the following:

The summary function f should return zero when given both 0 and the sample size (i.e., f(0) = 0 and f(np.array([len(x) for x in sample_sets]) = 0).

Indeed, the code above follows this to the letter. Presumably the code in the note must not be taken literally, because f(x) is never called with a singleton x---x is always a numpy array. So we must interpret the zero in f(0) above as a zero in the algebraic sense? With the following changes, things work as expected.

--- test_sample_count_stat.py
+++ test_sample_count_stat2.py
@@ -1,10 +1,11 @@
 import msprime
+import numpy as np
 
 def allele_counts(ts, samples):
 
     def f(x):
-        if x == 0 or x == len(samples):
-            return 0
+        if all(x == 0) or all(x == len(samples)):
+            return np.zeros_like(x)
         return x
 
     return ts.sample_count_stat(

This isn't really a bug in tskit, although maybe the checks could be more strict in the strict=True case, and/or the docs could be clearer (and likewise for general_stat()). However, I'm curious why this requirement exists, but this isn't automatically done for the user? Instead of strict=True checking for this behaviour, why doesn't it just implement this behaviour?

Answered by petrelharp

Feb 3, 2021

Hm, ok: this is a documentation bug, for sure. Let's see:

What if the docs said:

The summary function `f` should return zero (i.e., an array of zeros of appropriate length)
when given either zero or the sample size: i.e., both `f([0 for _ in sample_sets])` and `f([len(x) for x in sample_sets])`
should return zero.

I'm curious why this requirement exists, but this isn't automatically done for the user? Instead of strict=True checking for this behaviour, why doesn't it just implement this behaviour?

The requirement exists because statistics with this requirement are insensitive to parts of the tree that are not segregating between any of the samples. So, if you've got a summary funciton …

View full answer

petrelharp · 2021-02-03T17:30:06Z

petrelharp
Feb 3, 2021
Maintainer

Hm, ok: this is a documentation bug, for sure. Let's see:

What if the docs said:

The summary function `f` should return zero (i.e., an array of zeros of appropriate length)
when given either zero or the sample size: i.e., both `f([0 for _ in sample_sets])` and `f([len(x) for x in sample_sets])`
should return zero.

I'm curious why this requirement exists, but this isn't automatically done for the user? Instead of strict=True checking for this behaviour, why doesn't it just implement this behaviour?

The requirement exists because statistics with this requirement are insensitive to parts of the tree that are not segregating between any of the samples. So, if you've got a summary funciton that doesn't satisfy this requirement, then it depends on parts of the tree that are unobservable from polymorphism data (e.g., the length of a branch above the root, or branches not ancestral to any of the samples). That might be what you want to do, but we want you to make sure it's what you want to do. (so you have to do strict=False to turn it off)

I could have sworn this was in the docs somewhere, but I can't find it. I can add something about it.

11 replies

petrelharp Feb 5, 2021
Maintainer

Yeah, that's my worry. This check has already caught at least one bug in statistics implementation.

In fact, I think you have a (conceptual) bug here: really, your code should be

def allele_counts(ts, samples):

    def f(x):
        return x

    return ts.sample_count_stat(
        [samples],
        f,
        1,
        windows="sites",
        polarised=False,
        mode="site",
        span_normalise=False,
        strict=False,
    )

... since you're returning the allele counts, per site, which should be equal to the number of samples for fixed alleles, not 0.

petrelharp Feb 5, 2021
Maintainer

Note: this is nearly what I've got here: #504

grahamgower Feb 5, 2021
Collaborator Author

Ah, ok, thanks. I guess I've been struggling with the stats framework because of the sheer volume of possibilities, and thus also the volume of information in the docs. The strict option make me second guess myself too.

grahamgower Feb 5, 2021
Collaborator Author

In fact, I think you have a (conceptual) bug here: really, your code should be [...]

I've gone back and reread the tree stats paper, and I now see why the suggestion to have a strict="auto" option is a bad one. For some reason, I had misunderstood the strict=True note as suggesting that the user should manually check for the f(0) and f(n) cases, and return zero. But really, many summary functions inherently have the property that f(0)=f(n)=0, and we should have strict=True set here, to double check that the summary function implementation does indeed do this (rather than attempting to obtain this property with a separate if statement).

jeromekelleher Feb 5, 2021
Maintainer

Great discussion to have here @grahamgower, perfect material!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

general_stat()/sample_count_stat() and f(0)=0 requirement #1188

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

general_stat()/sample_count_stat() and f(0)=0 requirement #1188

grahamgower Feb 3, 2021 Collaborator

Replies: 1 comment · 11 replies

petrelharp Feb 3, 2021 Maintainer

petrelharp Feb 5, 2021 Maintainer

petrelharp Feb 5, 2021 Maintainer

grahamgower Feb 5, 2021 Collaborator Author

grahamgower Feb 5, 2021 Collaborator Author

jeromekelleher Feb 5, 2021 Maintainer

grahamgower
Feb 3, 2021
Collaborator

Replies: 1 comment 11 replies

petrelharp
Feb 3, 2021
Maintainer

petrelharp Feb 5, 2021
Maintainer

petrelharp Feb 5, 2021
Maintainer

grahamgower Feb 5, 2021
Collaborator Author

grahamgower Feb 5, 2021
Collaborator Author

jeromekelleher Feb 5, 2021
Maintainer