Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remodel rule aggregation format #1265

Open
anderseknert opened this issue Nov 18, 2024 · 2 comments
Open

Remodel rule aggregation format #1265

anderseknert opened this issue Nov 18, 2024 · 2 comments

Comments

@anderseknert
Copy link
Member

Regal evaluates aggregate rules in a 2-step process:

  1. Have all aggregate rules collect data they're interested in and send it back to the Go process
  2. Have the Go process launch another Rego eval where all aggregate_report rules are queried with this data as input.

Each file evaluated by the aggregate rule will result in a data structure like this returned:

{
  "organizational/at-least-one-allow": [
    {
      "aggregate_data": [{"aggregated": "data"}],
      "aggregate_source": {
        "file": "foo/bar.rego",
        "package_path": ["foo", "bar"]
      },
      "rule": {
        "category": "organizational",
        "title": "at-least-one-allow"
      }
    },
    {
       "one of these": "for each aggregated entry!"
    }
  ],
  "other/aggregate_rule" : []  
}

Each input file will then append a similar object to the array of any given rule. This data is verbose, and while it's easily readable, that's not a huge win here as this isn't meant to be consumed by humans in the first place. Running regal lint on its own bundle results in almost one megabyte of this data being sent back and forth, and we don't even have many aggregate rules! As recent work has shown, lots of data equals lots of allocations. Sometimes that's unavoidable, but this is one place where we could be optimizing more.

To make this more efficient, we should look into:

  1. Drop the rule attribute from the payload. The outer map is already keyed by category/rule and repeating it in the payload is just a waste of resources.
  2. Use an object keyed by filenames rather than an array duplicating data.
  3. Have the aggregator rules append to the aggregate_data list for its file rather than the top level one.
{
  "organizational/at-least-one-allow": {
    "foo/bar.rego": {
      "package_path": ["foo", "bar"],
      "aggregate_data": [{"aggregated": "data"}, {"more": "items"}]
    },
    "baz.rego": {}
  },
  "other/aggregate_rule" : {}  
}

As aggregate rules are also available to custom rule authors, this would be a breaking change, and one we'll need to document well how to go from the old format to the new, both on the aggregating side (where this would likely not require a change as long as they are using result.aggregate and we update that) and on the reporting side (where policies would need to be updated to handled the new format).

While we are at it, let's make sure we also handle empty aggregate results correctly and efficiently for both built-in and custom rules.

@anderseknert
Copy link
Member Author

Thinking about it more, I guess we could even provide a way for the aggregate_report rules to be provided the data in the same format as they're currently served. This would be less efficient, but could be an option to avoid breaking existing policies. The question would be how to best opt in/out of that.. 🤔 or if it's worth maintaining..

@charlieegan3
Copy link
Member

Worth checking if the caching of these can be improved too:
https://github.com/StyraInc/regal/blob/main/internal/lsp/cache/cache.go#L186-L226

There is some reformatting of that data when the data is cached, and then reformatting on the way out too in order to index by file and not rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@anderseknert @charlieegan3 and others