[BUG] JSON arrays can not be queried as expected (return NULL or unsupported data type: JString(nested)) #750

salyh · 2024-10-07T09:11:05Z

What is the bug?
While working on #669 I noticed that JSON arrays, indexed into OpenSearch, always return NULL when queried via spark. JSON objects and primitves are working as expected.

How can one reproduce the bug?
Steps to reproduce the behavior:

Index the following documents into OpenSearch with default dynamic mapping:

curl localhost:9200/json/_bulk -H 'content-type: application/json' -d '
{"index":{"_id":"1"}}
{"id":3,"name":"Bob Smith","title":null,"projects":[{"name":"SQL Spectrum querying","started_year":1990},{"name":"SQL security","started_year":1999},{"name":"OpenSearch security","started_year":2015}]}
{"index":{"_id":"2"}}
{"id":4,"name":"Susan Smith","title":"Dev Mgr","projects":[]}
{"index":{"_id":"3"}}
{"id":6,"name":"Jane Smith","title":"Software Eng 2","projects":[{"name":"SQL security","started_year":1998},{"name":"Hello security","started_year":2015,"address":[{"city":"Dallas","state":"TX"}]}]}
{"index":{"_id":"4"}}
{"id":7,"name":"Jane Smith2","title":"Software Eng 22","projectsasobject":{"name":"SQL security","started_year":1998}}
'

Connect Spark to OpenSearch
Open a Spark Shell

./spark-bin/bin/spark-shell -c spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog \
  --packages "org.opensearch:opensearch-spark-standalone_2.12:0.6.0-SNAPSHOT,org.opensearch:opensearch-spark-ppl_2.12:0.6.0-SNAPSHOT" \
  --conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions,org.opensearch.flint.spark.FlintSparkExtensions"

Run val dfp = spark.sql("source=dev.default.json"); dfp.show()

+-----------+--------------------+---+--------+---------------+                 
|       name|    projectsasobject| id|projects|          title|
+-----------+--------------------+---+--------+---------------+
|  Bob Smith|                NULL|  3|    NULL|           NULL|
|Susan Smith|                NULL|  4|    NULL|        Dev Mgr|
| Jane Smith|                NULL|  6|    NULL| Software Eng 2|
|Jane Smith2|{SQL security, 1998}|  7|    NULL|Software Eng 22|
+-----------+--------------------+---+--------+---------------+

Here in the first three row no NULL is expected

If the projects field is mapped to ["nested" type](Here in the first three row no NULL is expected) like

{
  "mappings" : {
    "properties": {
      "projects": {
        "type" : "nested"
      }
    }
  }
}

then an error is thrown:

scala> val dfp = spark.sql("source=dev.default.json"); dfp.show()
java.lang.IllegalStateException: unsupported data type: JString(nested)

The text was updated successfully, but these errors were encountered:

salyh · 2024-10-07T09:25:25Z

@YANG-DB This blocks #669

YANG-DB · 2024-10-07T16:13:38Z

@salyh
thanks for this update - does it also blocks the development of spark based PPL ?

salyh · 2024-10-07T16:30:00Z

@salyh thanks for this update - does it also blocks the development of spark based PPL ?

yes, #669 is spark based

YANG-DB · 2024-10-07T18:26:24Z

@dai-chen @noCharger @penghuo can u plz look at this ?

YANG-DB · 2024-10-07T18:45:35Z

@salyh thanks for this update - does it also blocks the development of spark based PPL ?

yes, #669 is spark based

I see but I'm referring to a PPL direct query to a spark engine with S3 based table (dataset) - is that relevant to OpenSearch JSON arrays, indexed into OpenSearch you described here ?

salyh · 2024-10-08T09:35:22Z

@salyh thanks for this update - does it also blocks the development of spark based PPL ?

yes, #669 is spark based

I see but I'm referring to a PPL direct query to a spark engine with S3 based table (dataset) - is that relevant to OpenSearch JSON arrays, indexed into OpenSearch you described here ?

Presumably not. Currently I only code and test against OpenSearch based tables/datasets but, with regards to flatten, I can also setup a S3 based table and work with that. But in any case I think this issues needs to be solved because all PPL should at least work with OpenSearch. If there are limitations then at least they should be documented.

Speaking of documentation: I don't see any quickstart docs for using this opensearch-spark with S3.

salyh added bug Something isn't working untriaged labels Oct 7, 2024

salyh mentioned this issue Oct 7, 2024

[FEATURE]Add flatten Command to PPL #669

Open

YANG-DB added the Lang:PPL Pipe Processing Language support label Oct 7, 2024

YANG-DB removed the untriaged label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] JSON arrays can not be queried as expected (return NULL or unsupported data type: JString(nested)) #750

[BUG] JSON arrays can not be queried as expected (return NULL or unsupported data type: JString(nested)) #750

salyh commented Oct 7, 2024 •

edited

Loading

salyh commented Oct 7, 2024

YANG-DB commented Oct 7, 2024

salyh commented Oct 7, 2024

YANG-DB commented Oct 7, 2024

YANG-DB commented Oct 7, 2024

salyh commented Oct 8, 2024

[BUG] JSON arrays can not be queried as expected (return NULL or unsupported data type: JString(nested)) #750

[BUG] JSON arrays can not be queried as expected (return NULL or unsupported data type: JString(nested)) #750

Comments

salyh commented Oct 7, 2024 • edited Loading

salyh commented Oct 7, 2024

YANG-DB commented Oct 7, 2024

salyh commented Oct 7, 2024

YANG-DB commented Oct 7, 2024

YANG-DB commented Oct 7, 2024

salyh commented Oct 8, 2024

salyh commented Oct 7, 2024 •

edited

Loading