Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system tests] Validate fields in transforms are documented based on mappings #2341

Open
mrodm opened this issue Jan 14, 2025 · 8 comments
Open

Comments

@mrodm
Copy link
Contributor

mrodm commented Jan 14, 2025

Follows #2207

In order to try to be agnostic to the structure of the documents ingested to run the validation in system tests. It would be helpful to run validations comparing the mapping definitions instead.

For those mappings that can not be validated against the preview mappings in #2206, it is also needed to validate if they match with any of the dynamic templates found in the data stream.

As part of the packages, there could be defined transforms with their own fields that should also be validated.

Check and if possible validate those fields based on the available mappings and dynamic templates.

The destination index could be used in

Mappings and dynamic templates can be retrieved from these APIs:

  • Mappings and dynamic templates installed by Fleet before ingesting any doc.
    • These preview mappings can be retrieved using this simulated API:
      POST /_index_template/_simulate/<index_template_name>
      
      # Example
      POST /_index_template/_simulate/logs-microsoft_dhcp.log
      
  • Mappings and dynamic templates that are present after ingesting the docs as part of the system tests.
    • These mappings can be retrieved using this API:
      GET /<data_stream_test>/_mapping/
      

The destination index (transforms.dest.index) could be used in the above APIs:

{
  "count": 1,
  "transforms": [
    {
      "id": "logs-ti_anomali.latest_intelligence-default-0.1.0",
      "dest": {
        "index": "logs-ti_anomali_latest.intelligence-1",
        "aliases": [
          {
            "alias": "logs-ti_anomali_latest.intelligence",
            "move_on_creation": true
          }
        ]
      },

The index template looks like it could also known in advance (suffix -template?)

GET /logs-ti_anomali_latest.intelligence-1/_mapping

# using index
POST /_index_template/_simulate_index/logs-ti_anomali_latest.intelligence-1

# using template 
POST /_index_template/_simulate/logs-ti_anomali.latest_intelligence-template

To be tested:

  • Run these validations in stack 7.x
  • Run these validations in stack 8.x
  • Run these validations in input and integration packages.
  • Run these validations in Stacks with LogsDB enabled (synthetics).
@mrodm
Copy link
Contributor Author

mrodm commented Jan 29, 2025

Looking at how tests are working and how transforms are tested by elastic-package, there are scenarios that could cause that not all documents are processed/validated by the transform.

For context, for every system test configuration defined in the package, the process is:

  • the documents found in the data stream are validated (via fields/mappings),
  • if there is any transform and it matches the given configuration, it is validated the documents processed by the transform (via fields/mappings).

So for a each system test, elastic-package requires to wait until some documents are processed by the transform to run the given validations for it.

I think there is an issue that could be raised in two different scenarios:

  • there are several test configuration files defined for a specific data stream.
  • there are several data streams matching the same transform configuration (source.index).

Focusing on the first scenario to show the problem, there could be a package that contains two system test configuration files in the same data stream. When validating the first test case, the process will wait until the destination index of the transform has processed the expected documents. The problem is that the second test case probably will not wait for all the documents since the destination index (that is the same for all the test configuration) has already processed the documents from the first test case.

Moreover, it could happen than the documents for the second test case have the same unique fields (e.g. event.dataset and package.event.id) as the firs test case. If so, no new document is going to be processed by the transform. But if there were different documents with different package.event.id values, it's likely that the process will not wait for those.

For instance, if first test produces 16 documents and the second test 10 documents (considering different unique fields), the first test for transform would wait for those 16 documents. However for the second test as the transform already has 16 documents, the wait process would finish automatically and not wait for those 10 documents.

In contrast, if those 10 documents of the second test case have the same unique keys as the documents ingested by the first test case, the transform should not wait for any new documents (there is not going to be any new document).

Maybe, validate transforms once all tests for data streams are validated/tested. It implies to re-think how to run system tests.
For instance, first running in parallel all system tests for data streams, and once they are finished, test all transforms (in parallel too?).

Any other ideas on how to handle this? @jsoriano

As the validation based on fields is going to keep it (related #2381), the behavior should remain the same as it is now.

@jsoriano
Copy link
Member

jsoriano commented Jan 29, 2025

Not sure what is the best approach. With what we have now, and with the proposal of testing after all tests, we would not be testing on isolation, there would be mixed tests, though we would be choosing different ways of mixing them.

If isolation of tests is really the objective, I think the only way to do this would be, for each transform and test:

  • Clone the transform.
  • Replace the source index in the cloned transform to refer only to the tested data stream.
  • Run the transform tests checking the cloned transform.
  • Cleanup the clones and their destination indexes.

Though even on this case maybe the developer really expects multiple data streams to be executed at the same time, and we could not test that.

Maybe we should introduce some kind explicit definitions of transform tests, so developers can chose with what sets of data streams and configurations the transform should be tested, and these tests would be executed after data stream tests, as you suggest.

Only some thoughts, I see pros and cons on any of the options.

@mrodm
Copy link
Contributor Author

mrodm commented Jan 30, 2025

A part from that issue mentioned above, while working on validating transforms based on mappings, I've found the following issues:

  • there is a package, that the source indices defined in the transform settings are not recognized as a list in 8.10.1:
  • transform state changes to "failed" when the documents produced by the second test case of a given data stream are being processed by the transform:
    • elastic-package to ensure that tests can be run in parallel, assign different namespace values for each test.
    • Error:
      Failed to index documents into destination index due to permanent error: ... failed to parse field [data_stream.namespace] of type [constant_keyword] in document with id 'M3TykoJe00n44xIS1rdAO87CAAAAAAAA'. Preview of field's value: '59467'; java.lang.IllegalArgumentException: [constant_keyword] field [data_stream.namespace] only accepts values that are equal to the value defined in the mappings [75519], but got [59467]]
      
  • how to get a threshold of number of documents processed to stop waiting and ensure the corresponding new mappings (if any) of the test have been added into the destination index.

@mrodm
Copy link
Contributor Author

mrodm commented Jan 31, 2025

Created a PR in package-spec to disallow commas when setting the source.index (and destination index): elastic/package-spec#869

@mrodm
Copy link
Contributor Author

mrodm commented Feb 4, 2025

Related to elastic/package-spec#869, this PR #2388 adds support to validate transforms that have defined the source indices as a list of comma-separated strings.

This would be needed for the elasticsearch package.

@mrodm
Copy link
Contributor Author

mrodm commented Feb 5, 2025

Due to the comments above not sure if it would be interesting to add validation based on mappings in the transforms. There are scenarios were it would not be possible to ensure which documents are ingested/processed by the transform (to ensure the mappings are careted in the destination index). More info at: #2341 (comment) and #2341 (comment)

Moreover, adding this validation implies that all packages defining transforms are going to spend more time in testing.

As an example , the tychon package:

  • validation based on mappings just in the data strewams takes ~30min approx. (buildkite link)
  • validation based on mappings enforced in transforms too takes 1h5min approx (buildkite link).

Probably, the validation of transforms should be revisited on how to proceed. Maybe one option is to include a new test command for transforms (elastic-package test transform), with its own lifecycle and its own test documents. Or as Jaime mentioned above, clone the transform and replace the source index to know exactly which documents should be processed. In any case, this looks a bigger task than expected initially.

WDYT ? @jsoriano @kpollich Should we defer this issue updating the description? Or close this one and create a new issue with this information to re-assess how to validate transforms in elastic-package?

@jsoriano
Copy link
Member

I agree that in their current form transforms are going to be really hard to validate in a proper way. And I think we should not add this validation if it is going to increase build times so much.

The main problem I see is that we put little limitations on what transforms in packages can do, so it is difficult to handle all cases in system testing, for example on these cases where logs-* patterns are used it is difficult to control what we are receiving in the destination index.

I think we should go back to the design table and think about what transform use cases make sense to support in packages. A transform for logs-* probably doesn't make sense in an integration package, that should only care about its own data streams, but maybe it makes sense in a content package.

And given that there can be transforms matching different data streams, they should probably have their own tests.

We could add a meta-issue summarizing all the problems we have with transforms, we could also include #859 and #2242.

@mrodm
Copy link
Contributor Author

mrodm commented Feb 11, 2025

Created a meta-issue to discuss and work about the transform validation in elastic-package #2394

@jsoriano feel free to update the description if you think there is missing something, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants