Remodel rule aggregation format #1265

anderseknert · 2024-11-18T11:16:48Z

Regal evaluates aggregate rules in a 2-step process:

Have all aggregate rules collect data they're interested in and send it back to the Go process
Have the Go process launch another Rego eval where all aggregate_report rules are queried with this data as input.

Each file evaluated by the aggregate rule will result in a data structure like this returned:

{
  "organizational/at-least-one-allow": [
    {
      "aggregate_data": [{"aggregated": "data"}],
      "aggregate_source": {
        "file": "foo/bar.rego",
        "package_path": ["foo", "bar"]
      },
      "rule": {
        "category": "organizational",
        "title": "at-least-one-allow"
      }
    },
    {
       "one of these": "for each aggregated entry!"
    }
  ],
  "other/aggregate_rule" : []  
}

Each input file will then append a similar object to the array of any given rule. This data is verbose, and while it's easily readable, that's not a huge win here as this isn't meant to be consumed by humans in the first place. Running regal lint on its own bundle results in almost one megabyte of this data being sent back and forth, and we don't even have many aggregate rules! As recent work has shown, lots of data equals lots of allocations. Sometimes that's unavoidable, but this is one place where we could be optimizing more.

To make this more efficient, we should look into:

Drop the rule attribute from the payload. The outer map is already keyed by category/rule and repeating it in the payload is just a waste of resources.
Use an object keyed by filenames rather than an array duplicating data.
Have the aggregator rules append to the aggregate_data list for its file rather than the top level one.

{
  "organizational/at-least-one-allow": {
    "foo/bar.rego": {
      "package_path": ["foo", "bar"],
      "aggregate_data": [{"aggregated": "data"}, {"more": "items"}]
    },
    "baz.rego": {}
  },
  "other/aggregate_rule" : {}  
}

As aggregate rules are also available to custom rule authors, this would be a breaking change, and one we'll need to document well how to go from the old format to the new, both on the aggregating side (where this would likely not require a change as long as they are using result.aggregate and we update that) and on the reporting side (where policies would need to be updated to handled the new format).

While we are at it, let's make sure we also handle empty aggregate results correctly and efficiently for both built-in and custom rules.

The text was updated successfully, but these errors were encountered:

anderseknert · 2024-11-18T11:20:34Z

Thinking about it more, I guess we could even provide a way for the aggregate_report rules to be provided the data in the same format as they're currently served. This would be less efficient, but could be an option to avoid breaking existing policies. The question would be how to best opt in/out of that.. 🤔 or if it's worth maintaining..

charlieegan3 · 2024-11-18T16:25:33Z

Worth checking if the caching of these can be improved too:
https://github.com/StyraInc/regal/blob/main/internal/lsp/cache/cache.go#L186-L226

There is some reformatting of that data when the data is cached, and then reformatting on the way out too in order to index by file and not rule.

anderseknert added performance design labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remodel rule aggregation format #1265

Remodel rule aggregation format #1265

anderseknert commented Nov 18, 2024

anderseknert commented Nov 18, 2024

charlieegan3 commented Nov 18, 2024

Remodel rule aggregation format #1265

Remodel rule aggregation format #1265

Comments

anderseknert commented Nov 18, 2024

anderseknert commented Nov 18, 2024

charlieegan3 commented Nov 18, 2024