Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]Add flatten Command to PPL #669

Open
YANG-DB opened this issue Sep 16, 2024 · 4 comments
Open

[FEATURE]Add flatten Command to PPL #669

YANG-DB opened this issue Sep 16, 2024 · 4 comments
Labels
enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 16, 2024

Is your feature request related to a problem?
OpenSearch Piped Processing Language (PPL) currently lacks a native command to flatten nested objects or arrays in documents. Many datasets, especially those containing JSON objects, have deeply nested fields that are difficult to work with in their raw form. The flatten command will simplify these structures and make it easier to analyze and extract data.

What solution would you like?
Introduce a flatten command in PPL that can handle arrays or nested fields, producing a flattened result that contains all the nested elements at the top level.

Syntax:

source=<data_source> | flatten <nested_field>  | fields <fields_to_select>
  • The flatten command takes a nested array or object field and returns each element as part of a flat structure.

Example Use Cases

  1. Flattening an Array Field
source=my-index  | flatten bridges | fields _time, bridges, city, country

This query flattens the bridges array field.

Example Input:

{
  "_time": "2024-09-13T12:00:00",
  "bridges": [
    {"name": "Tower Bridge", "length": 801},
    {"name": "London Bridge", "length": 928}
  ],
  "city": "London",
  "country": "England"
}

Expected Output:

[
  {
    "_time": "2024-09-13T12:00:00",
    "name": "Tower Bridge",
    "length": 801,
    "city": "London",
    "country": "England"
  },
  {
    "_time": "2024-09-13T12:00:00",
    "name": "London Bridge",
    "length": 928,
    "city": "London",
    "country": "England"
  }
]
  1. Flattening a Nested Object
source=my-index | flatten details | fields _time, details 

This query flattens the details object field.

Example Input:

{
  "_time": "2024-09-13T12:00:00",
  "details": {
    "name": "Alice",
    "age": 30,
    "address": {
      "street": "Main St",
      "city": "New York"
    }
  }
}

Expected Output:

{
  "_time": "2024-09-13T12:00:00",
  "name": "Alice",
  "age": 30,
  "street": "Main St",
  "city": "New York"
}

Additional Considerations

  • The flatten command should work efficiently with large arrays or deeply nested structures.
  • It must handle complex JSON objects where multiple levels of nesting exist.
  • Consider supporting multi-level flattening for more deeply nested fields (e.g., flatten details.address).
@YANG-DB YANG-DB added enhancement New feature or request untriaged Lang:PPL Pipe Processing Language support labels Sep 16, 2024
@vamsi-amazon
Copy link
Member

Shouldn't it be bridges.name in flattened object?
|What if multiple object fields has same key inside them?

@YANG-DB YANG-DB removed the untriaged label Sep 16, 2024
@salyh
Copy link
Contributor

salyh commented Oct 1, 2024

Shouldn't it be bridges.name in flattened object? |What if multiple object fields has same key inside them?

As mentioned above "Consider supporting multi-level flattening for more deeply nested fields (e.g., flatten details.address)." I read it as: yes, we support it. Question is what is the default and should it be configurable?

When dealing with nested fields, see #565

@salyh
Copy link
Contributor

salyh commented Oct 5, 2024

@YANG-DB @vamsi-amazon

cc @dr-lilienthal

Before opening a PR, a few design questions and requirement refinements need to be discussed.

  1. Does the terms nested objects or arrays and nested_field refer to a) the datatype "Nested Field" in OpenSearch like described here OR does the terms refer to to b) just a field in a json document which value is a another json array or object? In the first case its relevant to point out that the OpenSearch "Nested Field" datatype is only for arrays and not objects. So the assumption is that b) applies

  2. If above b) applies then it appears that nested arrays in OpenSearch are always "NULL" when queried via Spark SQL or Spark PPL. Nested objects however can be queried as expected. This is possibly a bug and needs to be adressed first before this issue can be solved.

  3. It would help to add the expected input and output in a table structure to the examples because its not yet clear if the flattened object should be added as separate fields to the datarow.

  4. Clarify the relation between flatten and expand_field as proposed here [FEATURE]New expand_field PPL Command #657

@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 8, 2024

@salyh

  1. IMO a general nested field (not specifically opensearch mapping)
  2. yes, we need to solve this specific OpenSearch PPL issue separately
  3. Examples:
source=employees 
| flatten contact 
| fields name, age, contact.phone as phone, contact.address.city as city, contact.address.zipcode as zipcode

results using alias:

Name Age Phone City Zipcode
Alice 30 123-4567 New York 10001
Alice 30 789-0123 New York 10001
Bob 25 234-5678 Los Angeles 90001
Bob 25 890-1234 Los Angeles 90001
source=employees 
| flatten contact 
| fields name, age, contact.phone, contact.address.city, contact.address.zipcode

results:

Name Age contact.phone contact.address.city contact.address.zipcode
Alice 30 123-4567 New York 10001
Alice 30 789-0123 New York 10001
Bob 25 234-5678 Los Angeles 90001
Bob 25 890-1234 Los Angeles 90001
  1. IMO we can merge both - lets discuss this more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
Status: In Progress
Development

No branches or pull requests

3 participants