Skip to content

profiles: Clarify Sample merge semantics #706

@jhalliday

Description

@jhalliday

I started out just looking at reordering and documenting the Sample message fields to make the distinction between those that are 'primary key' elements and those that are not a bit clearer, but I think it's a wider problem:

Encoding an array of raw observations into an array of Sample messages when making a Profile message in a space-optimal manner is effectively a merge (reduce) operation. We wish to achieve the minimal number of Sample instances possible, without losing data.

For each observation, calculate the key as a tuple of the identity fields {stack_index,sorted(attribute_indices),link_index}

If an existing Sample with the same key already exists and both have a timestamp, the reduce proceeds with 'append' semantics: append the additional observation's timestamp (and value, if present) to the existing Sample.

If neither Sample has a timestamp, the merge semantics are dependent on the ValueType:
[Note: the ValueType is implicitly part of the identity key, but not part of the Sample message, which is ugly]

  • a counter type has 'sum' semantics: increment the value of the first sample by the value of the second, resulting in a single element value array. Assume a value of 1 if either is missing.

  • all others ValueTypes have 'append' semantics: append the additional observation's value to the values array.

This has two effects:

  • Merge semantics of all ValueTypes must be known at compile time, in order that the correct algorithm can be applied. This makes them non-extensible.
    Maybe make them self-describing by adding an enum field - this is effectively a variation on the AggregationTemporality problem.

  • There is no compact representation for 'an observation with value X was made Y times'.

Do we care? Appending repeated values to the array works and if we assume timestamps are the common case or compression is always used on messages, we probably don't need to worry.

Alternatively, a separate 'occurrence_count' field count be used, but only if timestamps are not present. This would effectively allow non-counter types to treat the value as part of the Sample identity key, similar to an attribute. The occurrence_count always having 'sum' semantics even for non-counter ValueTypes is fine. However, for counter types this is at best confusing duplicate information, or at worst inconsistent, as it has the same purpose as value.

Perhaps have Profile.sample be oneof(Sample|AggregateSample) such that we carry the sum/append semantics in the message type information instead of an enum and can have different fields: AggregateSample removes timestamps field and 'repeated' modifier on value, adds occurence_count.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions