Skip to content

Conversation

pbibra
Copy link
Collaborator

@pbibra pbibra commented Sep 5, 2025

Summary

As per: #51 (comment), we'd like to abstract away fork specific span schema details when building nodes for a trace dependency graph.

This PR introduces a configurable dependency graph node metadata mapping for Zipkin spans. Instead of hardcoding tag names for filling in node metadata, the logic can be driven by a YAML configuration file. This makes field resolution more flexible and maintainable, while allowing conditional overrides for specific span types.

The config format is defined as follow:

node_metadata_tag_mapping:
  <metadata-field-name>:
    default_key: <span.tag>
    default_value: <default-value>
    rules:
      - field: <span.tag>
        value: <match-value>
        override_key: <span.tag>

where rules can be a list on various fields which will be applied sequentially

The GraphConfig class:

  • Supports node_metadata_tag_mapping for defining default tag keys and fallback values.
  • Supports rules, grouped by logical field, to conditionally override which tag key to use based on other span attributes.
  • Exposes a resolve(...) method to compute the correct tag value for a given field, taking into account both defaults and applicable rules.

The GraphService will now take in a GraphConfig instance. If the config file is not provided at startup, this instance will be null in which case, the eventual graph builder should fall back to reasonable defaults e.g. using required span fields as metadata.

Testing

 ▲ ~ kubectl logs --context=m-obs-b-ea1-us --namespace=airtrace-traces-staging airtrace-query-traces-staging-59f75986c-pspsz | grep INFO
{"@timestamp":"2025-09-12T21:18:35.567Z","level":"INFO","message":"Using the APIs optimized for: Java 12+","thread.name":"armeria-common-worker-epoll-3-2","logger_name":"com.linecorp.armeria.internal.common.JavaVersionSpecific"}
{"@timestamp":"2025-09-12T21:18:57.676Z","level":"INFO","message":"GraphConfig: {\n  \"nodeMetadataTagMapping\" : {\n    \"app\" : {\n      \"defaultKey\" : \"kube.app\",\n      \"defaultValue\" : \"unknown_app\",\n      \"rules\" : [ ]\n    },\n    \"namespace\" : {\n      \"defaultKey\" : \"kube-namespace\",\n      \"defaultValue\" : \"unknown_namespace\",\n      \"rules\" : [ ]\n    },\n    \"operation\" : {\n      \"defaultKey\" : \"operation_name\",\n      \"defaultValue\" : \"unknown_operation\",\n      \"rules\" : [ ]\n    },\n    \"resource\" : {\n      \"defaultKey\" : \"resource\",\n      \"defaultValue\" : \"unknown_resource\",\n      \"rules\" : [ {\n        \"field\" : \"operation_name\",\n        \"value\" : \"http.request\",\n        \"overrideKey\" : \"tag.operation.canonical_path\"\n      } ]\n    }\n  }\n}","thread.name":"armeria-common-worker-epoll-3-3","logger_name":"com.slack.astra.graphApi.GraphService","trace_id":"b096b3282ceca4c0","span_id":"b096b3282ceca4c0"}

Requirements

@pbibra pbibra marked this pull request as ready for review September 12, 2025 21:36
Comment on lines 143 to 154
// Apply rules in order. Later matching rules override earlier ones.
for (RuleConfig rule : baseCfg.getRules()) {
// Get the value of the field defined in the rule.
String matchVal = tags.getOrDefault(rule.getField(), "unknown_" + rule.getField());
// If the rule value matches, override which key to use for resolution of the logical field
// only if it exists in tags, otherwise keep the default.
if (rule.getValue().equals(matchVal)) {
if (tags.containsKey(rule.getOverrideKey())) {
keyToUse = rule.getOverrideKey();
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the later matching override, I'd probably implement this by reversing the list and picking the first one that matches. Using streams, it could look like this.

Suggested change
// Apply rules in order. Later matching rules override earlier ones.
for (RuleConfig rule : baseCfg.getRules()) {
// Get the value of the field defined in the rule.
String matchVal = tags.getOrDefault(rule.getField(), "unknown_" + rule.getField());
// If the rule value matches, override which key to use for resolution of the logical field
// only if it exists in tags, otherwise keep the default.
if (rule.getValue().equals(matchVal)) {
if (tags.containsKey(rule.getOverrideKey())) {
keyToUse = rule.getOverrideKey();
}
}
}
// Later rules override earlier ones, so start from the back of the list and use the first one that matches
// where the rule's override key exists in the document.
String keyToUse = baseCfg.getRules().reversed().stream()
.filter(rule -> rule.getValue().equals(tags.getOrDefault(rule.getField(), "unknown_" + rule.getField()))
.map(RuleConfig::getOverrideKey)
.filter(tags::containsKey)
.findFirst().orElse(baseCfg.getDefaultKey());

having a method on Rule for checking the field would make that read cleaner by making the first filter less noisy.

Also if the resolve happens in a hot path, you might want to do the reversing of the list at config construction time, among other things, but that should probably happen later.

public class GraphConfigTest {

@Test
public void testTagConfigGettersAndSetters() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave off the getter / setter testing? I think the main behavior we're gonna care about is how resolution/normalization works, so I'd expect most of the tests to be about that.

Also, I would expect that the caller wouldn't care about the specifics of the default key/values.

Comment on lines 244 to 246
GraphConfig graphConfig =
GraphConfig.load(astraConfig.getQueryConfig().getDepGraphConfigFile());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if this bit followed the pattern of the schema file, where if the file is not defined, we log that it isn't and use the default whatever that ends up being.

https://github.com/airbnb/kaldb/pull/52/files#diff-1def4f01c8d94dbbb54e4c9ed39002029aa0861c11d07f0d38d660973b80c815R519-R528

return mapper.readValue(yaml, GraphConfig.class);
} catch (Exception e) {
LOG.warn("Failed to read or parse dependency graph config file. Returning null config", e);
return null;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of null, could this return a default instance that maybe noops resolution or something?

Comment on lines +78 to +96
GraphConfig.TagConfig serviceConfig = config.getNodeMetadataTagMapping().get("service");
assertThat(serviceConfig.getDefaultKey()).isEqualTo("service.name");
assertThat(serviceConfig.getDefaultValue()).isEqualTo("unknown_service");
assertThat(serviceConfig.getRules()).hasSize(2);

GraphConfig.RuleConfig rule1 = serviceConfig.getRules().getFirst();
assertThat(rule1.getOverrideKey()).isEqualTo("prod.service.name");
assertThat(rule1.getField()).isEqualTo("cluster.name");
assertThat(rule1.getValue()).isEqualTo("prod");

GraphConfig.RuleConfig rule2 = serviceConfig.getRules().get(1);
assertThat(rule2.getOverrideKey()).isEqualTo("test.service.name");
assertThat(rule2.getField()).isEqualTo("cluster.name");
assertThat(rule2.getValue()).isEqualTo("staging");

GraphConfig.TagConfig clusterConfig = config.getNodeMetadataTagMapping().get("cluster");
assertThat(clusterConfig.getDefaultKey()).isEqualTo("cluster.name");
assertThat(clusterConfig.getDefaultValue()).isEqualTo("unknown_cluster");
assertThat(clusterConfig.getRules()).hasSize(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about instead of these, do something more like this, covering the rules and making sure to catch the fallthrough behavior.

var emptyTestMap = Map.of();

assertThat(config.resolve(emptyTestMap, "service").isEqualTo("unknown_service");
// ... etc etc

var testMap = Map.of("service.name", "something", "cluster.name", "prod");
assertThat(config.resolve(testMap, "service").isEqualTo("something");
assertThat(config.resolve(testMap, "cluster").isEqualTo("prod....");

}

@Test
public void testResolveWithDefaultValue() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, you have some of the resolution tests later.

}

// Loads a GraphConfig from a YAML file on disk. On failure, log and return a null config.
public static GraphConfig load(String configFile) throws IOException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about breaking this up into a load(Path filePath) and load(String config)? that would make it easier to test against.


@Test
public void testResolveWithDefaultValue() {
GraphConfig config = createTestConfig();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of referring to a test fixture makes the test harder to follow. If you had GraphConfig.load(String configContents), you could have each test do GraphConfig.load("""<yaml>""");, which would make them easier to skim.

Comment on lines 53 to 67
"""
nodeMetadataTagMapping:
service:
defaultKey: service.name
defaultValue: unknown_service
rules:
- field: cluster.name
value: prod
overrideKey: prod.service.name
- field: cluster.name
value: staging
overrideKey: test.service.name
cluster:
defaultKey: cluster.name
defaultValue: unknown_cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the yaml here in the test case is nice

@pbibra pbibra merged commit f825cb2 into airbnb-main Sep 19, 2025
2 checks passed
@pbibra pbibra deleted the pbibra-graph-config branch September 19, 2025 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants