Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON arrays can not be queried as expected (return NULL or unsupported data type: JString(nested)) #750

Open
salyh opened this issue Oct 7, 2024 · 6 comments
Labels
bug Something isn't working Lang:PPL Pipe Processing Language support

Comments

@salyh
Copy link
Contributor

salyh commented Oct 7, 2024

What is the bug?
While working on #669 I noticed that JSON arrays, indexed into OpenSearch, always return NULL when queried via spark. JSON objects and primitves are working as expected.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Index the following documents into OpenSearch with default dynamic mapping:
curl localhost:9200/json/_bulk -H 'content-type: application/json' -d '
{"index":{"_id":"1"}}
{"id":3,"name":"Bob Smith","title":null,"projects":[{"name":"SQL Spectrum querying","started_year":1990},{"name":"SQL security","started_year":1999},{"name":"OpenSearch security","started_year":2015}]}
{"index":{"_id":"2"}}
{"id":4,"name":"Susan Smith","title":"Dev Mgr","projects":[]}
{"index":{"_id":"3"}}
{"id":6,"name":"Jane Smith","title":"Software Eng 2","projects":[{"name":"SQL security","started_year":1998},{"name":"Hello security","started_year":2015,"address":[{"city":"Dallas","state":"TX"}]}]}
{"index":{"_id":"4"}}
{"id":7,"name":"Jane Smith2","title":"Software Eng 22","projectsasobject":{"name":"SQL security","started_year":1998}}
'
  1. Connect Spark to OpenSearch
  2. Open a Spark Shell
./spark-bin/bin/spark-shell -c spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog \
  --packages "org.opensearch:opensearch-spark-standalone_2.12:0.6.0-SNAPSHOT,org.opensearch:opensearch-spark-ppl_2.12:0.6.0-SNAPSHOT" \
  --conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions,org.opensearch.flint.spark.FlintSparkExtensions" 
  1. Run val dfp = spark.sql("source=dev.default.json"); dfp.show()
+-----------+--------------------+---+--------+---------------+                 
|       name|    projectsasobject| id|projects|          title|
+-----------+--------------------+---+--------+---------------+
|  Bob Smith|                NULL|  3|    NULL|           NULL|
|Susan Smith|                NULL|  4|    NULL|        Dev Mgr|
| Jane Smith|                NULL|  6|    NULL| Software Eng 2|
|Jane Smith2|{SQL security, 1998}|  7|    NULL|Software Eng 22|
+-----------+--------------------+---+--------+---------------+

Here in the first three row no NULL is expected

If the projects field is mapped to ["nested" type](Here in the first three row no NULL is expected) like

{
  "mappings" : {
    "properties": {
      "projects": {
        "type" : "nested"
      }
    }
  }
}

then an error is thrown:

scala> val dfp = spark.sql("source=dev.default.json"); dfp.show()
java.lang.IllegalStateException: unsupported data type: JString(nested)
@salyh salyh added bug Something isn't working untriaged labels Oct 7, 2024
@salyh
Copy link
Contributor Author

salyh commented Oct 7, 2024

@YANG-DB This blocks #669

@YANG-DB
Copy link
Member

YANG-DB commented Oct 7, 2024

@salyh
thanks for this update - does it also blocks the development of spark based PPL ?

@salyh
Copy link
Contributor Author

salyh commented Oct 7, 2024

@salyh thanks for this update - does it also blocks the development of spark based PPL ?

yes, #669 is spark based

@YANG-DB
Copy link
Member

YANG-DB commented Oct 7, 2024

@dai-chen @noCharger @penghuo can u plz look at this ?

@YANG-DB YANG-DB added the Lang:PPL Pipe Processing Language support label Oct 7, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Oct 7, 2024

@salyh thanks for this update - does it also blocks the development of spark based PPL ?

yes, #669 is spark based

I see but I'm referring to a PPL direct query to a spark engine with S3 based table (dataset) - is that relevant to OpenSearch JSON arrays, indexed into OpenSearch you described here ?

@salyh
Copy link
Contributor Author

salyh commented Oct 8, 2024

@salyh thanks for this update - does it also blocks the development of spark based PPL ?

yes, #669 is spark based

I see but I'm referring to a PPL direct query to a spark engine with S3 based table (dataset) - is that relevant to OpenSearch JSON arrays, indexed into OpenSearch you described here ?

Presumably not. Currently I only code and test against OpenSearch based tables/datasets but, with regards to flatten, I can also setup a S3 based table and work with that. But in any case I think this issues needs to be solved because all PPL should at least work with OpenSearch. If there are limitations then at least they should be documented.

Speaking of documentation: I don't see any quickstart docs for using this opensearch-spark with S3.

@YANG-DB YANG-DB removed the untriaged label Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Lang:PPL Pipe Processing Language support
Projects
None yet
Development

No branches or pull requests

2 participants