Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to in-place subset a table #2666

Closed
jeromekelleher opened this issue Dec 15, 2022 · 3 comments
Closed

Method to in-place subset a table #2666

jeromekelleher opened this issue Dec 15, 2022 · 3 comments
Labels
C API Issue is about the C API enhancement New feature or request
Milestone

Comments

@jeromekelleher
Copy link
Member

jeromekelleher commented Dec 15, 2022

In #2665 I added a draft function that looked like this:

int
tsk_node_table_delete_rows(tsk_node_table_t *self, bool *delete_rows,
    tsk_flags_t TSK_UNUSED(options), tsk_id_t *id_map)
{
    int ret;
    tsk_node_table_t copy;
    tsk_node_t node;
    tsk_size_t j;
    tsk_id_t ret_id;

    ret = tsk_node_table_copy(self, &copy, 0);
    if (ret != 0) {
        goto out;
    }
    ret = tsk_node_table_clear(self);
    if (ret != 0) {
        goto out;
    }
    for (j = 0; j < copy.num_rows; j++) {
        if (id_map != NULL) {
            id_map[j] = TSK_NULL;
        }
        if (!delete_rows[j]) {
            tsk_node_table_get_row_unsafe(&copy, (tsk_id_t) j, &node);
            ret_id = tsk_node_table_add_row(self, node.flags, node.time, node.population,
                node.individual, node.metadata, node.metadata_length);
            if (ret_id < 0) {
                ret = (int) ret_id;
                goto out;
            }
            if (id_map != NULL) {
                id_map[j] = ret_id;
            }
        }
    }
out:
    tsk_node_table_free(&copy);
    return ret;
}

The idea is that we can keep/delete a subset of the rows in a table, in place, and optionally return the mapping of new-to-old IDs. I implemented this as delete_rows here, but it felt a bit unnatural and maybe it would be better to do it as keep_rows instead? Maybe subset(bool *keep_rows)? Any better ideas?

I juggled with the idea of providing a list of row indexes instead, but I think the boolean mask is more generally useful, and is easy to set up if you do have a list of rows you want to get rid of/keep.

The idea is to do this for all tables, and for tables that self-reference (e.g. mutation table) to automatically remap those references and to (by default) return an error if you try to delete a row that has references to it.

See also #1034

@jeromekelleher jeromekelleher added enhancement New feature or request C API Issue is about the C API labels Dec 15, 2022
@benjeffery
Copy link
Member

benjeffery commented Dec 17, 2022

Sounds like a good addition to the API - a boolean mask seems to best way to me. I assume that we're not too concerned with perf here? If we are an easy win would be to expand the table once (as the number of rows needed is known) and call tsk_node_table_add_row_internal instead.

As for naming, subset(bool *keep_rows) makes sense to me.

@benjeffery
Copy link
Member

benjeffery commented Dec 17, 2022

One more thought, just realised - the code above is almost equal to:

ret = tsk_node_table_copy(self, &copy, 0);
...
ret = tsk_node_table_clear(self);
...
ret = tsk_node_table_extend(self, copy, num_rows, num_indexes)

So maybe an alternative is adding id_map to extend??

@jeromekelleher
Copy link
Member Author

Aha - so it is!

x_table_extend is documented though, so changes break the 1.0 API contracts. Also we don't think about ID remapping at all in extend, which is a key part of the operation here (updating the mutation/individual table is sooooo tedious!).

I think tsk_x_table_subset(tsk_x_table_t *self, bool *keep_rows, tsk_flags_t options, tsk_id_t *id_map) is good, I think we should go with that.

@jeromekelleher jeromekelleher modified the milestone: C API 1.1.1 Jan 11, 2023
@benjeffery benjeffery added this to the C API 1.1.2 milestone Jan 11, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Jan 25, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Feb 3, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Feb 3, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Feb 5, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Feb 5, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Feb 9, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants