-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Amazon Security Lake] - OCSF v1.1 and how to move foreword #11276
Comments
Pinging @elastic/security-service-integrations (Team:Security-Service Integrations) |
Pinging @elastic/fleet (Team:Fleet) |
The OCSF schema allows an infinite number of fields if you include all optional fields and go to infinite depth (this chart combines fields from all categories and profiles): But when a specific source generates an OCSF document it will have a finite, much smaller number of fields. It's the optional attributes that add depth. With required attributes only, I think you stop at 101 fields. The good option: map all fields at levels 1 and 2, then map any more required fields, but optional fields get mapped as flattened. I think it's better to transition to this even if there's no smooth migration path from the current mappings. That's decent for a generic solution. If deeper fields are useful, then I think the documents should be re-indexed with mappings that are specific to the source. A more ambitious option: try to use dynamic templates to match OCSF fields regardless of where they are nested. I'm not sure if it's possible to write match conditions to correctly match all the OCSF fields. If so, there's still the issue that indexing a wide range of document types may exceed maximum number of fields for an index. So ideally the dynamic templates would cover all fields as necessary, but documents from different sources would be directed to different indexes, so that the number of actual fields per index is limited. |
The concern I have with the The good option is that users will keep on asking for more depth as a lot of critical info is buried within these deeper levels, also the schema keeps expanding with every update, in 1.3 new categories and a bunch of new fields have been added and eventually it will hit the field limit even with level 2 mappings sometime in the future. |
I think we need to begin with some schema analysis to better understand the problem before we try to solve it. I strongly believe that tooling should be used to create the both the mappings and the pipelines. The tooling should codify any decisions we make. It's impractical to spend time building fields.yml files by hand, and then spend even more time reviewing it. Our time would be better spent investing in building and reviewing the tooling. Each successive update to OCSF gets much easier to apply if there is automation around it. IIUC, the events we want to represent correspond to OCSF classes. OCSF conveniently has JSON Schema representations of these classes available which likely makes analysis easier since this is a a standardized format. The schema allows for circular references (e.g. user has a manager, manager is a user; or process has a parent_process, parent_process is a process). When we translate this into concrete fields we get a tree of infinite depth. So we need to place limits on depth to control the number fields in our mappings. This may not have much impact depending on whether real events actually have a high level of nesting. When reaching these cycles we could choose to not index the data after certain level, then if a user wanted to override this decision then could apply an Moving the data stream to use
I'd like to see data on:
And I would like to see the code behind the analysis just so we can verify it. With that data I think we would be in a better position to try to answer how we should structure data streams and the mappings to use OCSF in Elasticsearch. |
I started looking at some similar data. It was convenient to work with data from https://github.com/ocsf/ocsf-lib-py, since that has versions of the schema merged into a single file. The script I wrote is here. It would generate simplified representations of the schema to a certain depth.
|
Thanks for sharing, @chrisberkhout. I also wanted to delve into the schema to gain a better understanding. I took a different approach to 'depth' and examined where circular references occurred, stopping my traversal at these cycles (like |
As discussed in the meeting... This is based on OCSF schema 1.3.0, but I'd expect it to see similar results for later versions. We can get each leaf key name and its value, looking down to depth 10:
Total count of leaf field names:
Going deeper doesn't surface new leaf field names, we already covered them all:
There aren't duplicates. Each leaf field name only has one value type:
|
Updates after recent discussions:
With these necessary steps in place we should have an auto mapper tool or dynamic templates ready that would generate most of the mappings for us. Refer to the OCSF swagger for more functionalities: https://schema.ocsf.io/doc/index.html#/ |
Current Scenario: For the last couple of months we've struggled to incorporate OCSF mappings into our traditional integration pipeline due to the nature of the OCSF schema and how deeply nested the OCSF mappings can get. We have limited fields on our end at the moment, the field limit per data stream being 2048. We have heavily refactored the whole integration, separated mappings into individual units for better maintainability, added dynamic mappings to help out with explicit field mappings in some scenarios, but this is not enough to make the integration sustainable going into the future. If we choose to stay upto date with newer OCSF versions, the direction would probably need to shift.
Main issues:
We can easily maintain till level 2, but beyond that it becomes increasingly difficult.
The OCSF schema is getting larger and larger pretty fast as new fields, classes and entirely new categories are added. This kind of increase seems tricky to keep pace with or update for if we stick to the traditional integration philosophy and we are unable to remove some of the mapping limitations that we currently have.
The initial build of the integration did not follow a specific flattening rule, i.e some objects have highly nested mappings, while others flatten out at levels 1-2. This makes it difficult for us to follow a uniform level based mapping, as we would require to break the integration to follow a more maintainable level 2 mapping. This also removes QoL from the user as flattening objects will reduce the desirability of the integration.
Steps we can take: We need to decide if we wan't to carry on supporting this integration as is, or if we wnat to take a different approach altogether.
Some things that come to mind:
I'm open to suggestions and would like to hear your thoughts on this.
Related Issues:
The text was updated successfully, but these errors were encountered: