Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow null values to represent "unknown" or missing data #88

Open
jeromekelleher opened this issue Jun 29, 2021 · 7 comments
Open

Allow null values to represent "unknown" or missing data #88

jeromekelleher opened this issue Jun 29, 2021 · 7 comments

Comments

@jeromekelleher
Copy link
Member

I'm not suggesting that we do this for version 1.0, but here's a counter-proposal for the semantics of null values to #76.

There will be situations in which key parameters are unknown, or unknowable by particular inference methods. Forcing them to specify a value would be artificial, and unhelpful. It would be better if there was an agreed-upon way of stating this, and using "null" values seems like as good a way as any.

This has some desirable properties, as simulators that don't support unknown values (if we add it in to a later version of the spec) SHOULD raise an error when an unexpected type (null) is provided for a value.

@grahamgower
Copy link
Member

This will make it very hard to validate a model. E.g. how would the parser validate a deme that has a null start_time?

@jeromekelleher
Copy link
Member Author

Yeah, we'd need to choose the supported fields with care. The main point is that there can be useful semantics we might want to hang on null, which I don't think we should throw away all at once by making it equivalent to "not present".

@grahamgower
Copy link
Member

I'll note one other thing for future us while I think of it: the Builder API in demes-python is probably not setup to permit useful semantics for None values, because add_deme, add_migration, etc. have params with default set to None and this currently does mean "don't set a value in the data dictionary". (The Builder API isn't used when loading a yaml file though, so no problem there)

@apragsdale
Copy link
Member

apragsdale commented Jun 29, 2021

I've been thinking about this a bit more today, and I agree with Jerome that it's a lot more helpful to be able to place null in the scenario of unknown or unknowable parameters in the model. And this is probably a good time to settle out how it would work, when the number of supporting software is minimal and we don't have a 1.0 release out.

As for which parameters in a model we allow to be unknown, I think we can make a reasonable choice along the lines of allowing null for deme-specific parameters (e.g. sizes, selfing rates), but disallow null for parameters that determine the topology of the demography or start/end times for demes and epochs. That way we can still verify that we have a valid model as far as ancestor/descendant relationships and that time flows properly through the graph.

Maybe we only allow this in the YAML? Or should we allow, e.g., an epoch population size to be None in the Builder API?

Just to give a concrete example, this would be a valid model for Tracts (cc: @sgravel, do you agree?), which doesn't use population sizes:

description: Tracts model with no population sizes set. A, B, and C contribute to admixed deme D.
time_units: generations
demes:
- name: A
  epochs:
  - start_size: null
    end_time: 0
- name: B
  epochs:
  - start_size: null
    end_time: 0
- name: C
  epochs:
  - start_size: null
    end_time: 0
- name: D
  ancestors: [A, B]
  proportions: [0.8, 0.2]
  start_time: 10
  epochs:
  - start_size: null
    end_time: 0
pulses:
- time: 5
  source: C
  dest: D
  proportion: 0.1

@sgravel
Copy link

sgravel commented Jun 30, 2021

Yes, that is pretty much how I had coded the model, but with dummy placeholder values rather than null. So I can do without a null in the spec, but of course the explicit "null" is preferable for my purposes if only so that people don't confusedly go about assuming that the dummy variables are good enough for msprime simulations, of fussing over which value they should use.

@grahamgower
Copy link
Member

grahamgower commented Jun 30, 2021

I'm definitely -1 on doing this for Demes 1.0.

But if you want to do something Demes-like now, without waiting for the spec, you can use the demes.load_asdict() method to load the YAML. This won't validate the model, so you can have anything you like in there. You could then put a dummy value into all the start_size fields and convert it to a demes.Graph so the rest of the fields get validated. Something like this (untested):

import demes

def load_with_null_sizes(file):
    data = demes.load_asdict(file)
    demes = data.get("demes")
    if demes is None:
        raise ValueError("must specify one or more demes")
    for deme in demes:
        for k in ("start_size", "end_size", "size_function"):
            if deme.get(k) is not None:
                raise ValueError("start_size, end_size, and size_function must be omitted (or `null`)")
        demes["start_size"] = 1
    return demes.Graph.fromdict(data)

When writing such a model out to a file, you'd have to bypass the demes API again. E.g.

import demes

def dump_with_null_sizes(graph, file):
    data = graph.asdict_simplified()
    for deme in data["demes"]:
        for k in ("start_size", "end_size", "size_function"):
            if k in deme:
                del deme[k]
    with open(file) as f:
        _dump_yaml_fromdict(data, f)

# This is the inverse of demes.load_asdict(), but it's not part of the public API.
# We could easily make it part of the API if it proves useful.
# https://github.com/popsim-consortium/demes-python/blob/8ec47962acb5d023b169dac0affa13f03122b8f0/demes/load_dump.py#L44

import ruamel.yaml

def _dump_yaml_fromdict(data, fp, multidoc=False) -> None:
    """
    Dump data dict to a YAML file-like object.

    :param bool multidoc: If True, output the YAML document start line ``---``,
        and document end line ``...``, which indicate the beginning and end of
        a YAML document respectively. The start indicator is needed when
        outputing multiple YAML documents to a single file (or file stream).
        The end indicator is not strictly needed, but may be desirable
        depending on the underlying communication channel.
    """
    with ruamel.yaml.YAML(typ="safe", output=fp) as yaml:
        # Output flow style, but only for collections that consist only
        # of scalars (i.e. the leaves in the document tree).
        yaml.default_flow_style = None
        # Don't emit obscure unicode, output "\Uxxxxxxxx" instead.
        # Needed for string equality after round-tripping.
        yaml.allow_unicode = False
        # Keep dict insertion order, thank you very much!
        yaml.sort_base_mapping_type_on_output = False
        if multidoc:
            yaml.explicit_start = True
            yaml.explicit_end = True
        yaml.dump(data)

@jeromekelleher
Copy link
Member Author

I'd also prefer to kick this one down the road a bit, as it's a change that we can incorporate in 1.1 pretty smoothly, and we really do want to ship the spec and paper ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants