Expose document_stream interface #71

lemire · 2020-12-07T15:43:17Z

The pysimdjson library could support our document_stream interface (parse_many function). It is well tested as of release 0.7 (with fuzz testing) and works well today. It supports streams of indefinite size.

See https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md

Related to #70

The text was updated successfully, but these errors were encountered:

lemire · 2021-08-19T22:35:07Z

Note that as of the upcoming 1.0 release, the On Demand API will support ndjson/stream parsing (sequences of JSON documents).

TkTech · 2021-08-19T23:08:47Z

How stable is the on-demand API now? I've avoided it so far due to a few issues:

Quite a few (relatively) regressions, just from following issues
Almost every use case of pysimdjson so far needs a DOM, and often wants to re-query data stored on the native side of the language barrier. If we switch entirely to on-demand, we'll need to re-invent the containers anyways.
The biggest benefit of an API like on-demand is the potential for streaming chunks, which isn't there yet.
The python side can't really inform the C side of a known document structure without significant overhead, a DSL, or some other approach like encoding the structure as a string. This is the purpose of the Shape API that was to be part of v4, but I lack the time to finish it at the moment.

lemire · 2021-08-19T23:29:24Z

The benefit of the On Demand approach is that you bypass entirely the C++ DOM and you just build up your own data structure directly. So instead of doing JSON -> indexing -> C++ DOM -> your code, you do JSON -> indexing -> your code, skipping a step.

It is not a replacement of the DOM approach. We will continue to support the DOM approach forever I expect.

I am not urging you to use On Demand. I do think that it is something you should seriously examine eventually. If you do, you will get help from us.

Quite a few (relatively) regressions, just from following issues

There were a few issues, yes. They have all been fixed quickly. The current code appears quite robust. It is a new approach, so it required a lot more work to get it to a solid state. It is also more challenging for fundamental reasons because the user has more power. But I think we are there (hence the 1.0 status). We will have an R wrapper. And we have extensive tests.

Almost every use case of pysimdjson so far needs a DOM, and often wants to re-query data stored on the native side of the language barrier. If we switch entirely to on-demand, we'll need to re-invent the containers anyways.

The biggest benefit of an API like on-demand is the potential for streaming chunks, which isn't there yet.

I am not sure that this is true. In both instances, we support streaming JSON data (streams of JSON documents, ndjson). The indexing phase (stage 1) in chunks would be easier to do with the DOM API since we are in full control (from within simdjson). It is more of a challenge with On Demand since the user is in control. You can move through the document (including moving back) and you can imagine the problems that could emerge.

The python side can't really inform the C side of a known document structure without significant overhead, a DSL, or some other approach like encoding the structure as a string. This is the purpose of the Shape API that was to be part of v4, but I lack the time to finish it at the moment.

You can use On Demand in a DOM-like manner.

void recursive_print_json(ondemand::value element) {
  bool add_comma;
  switch (element.type()) {
  case ondemand::json_type::array:
    cout << "[";
    add_comma = false;
    for (auto child : element.get_array()) {
      if (add_comma) {
        cout << ",";
      }
      // We need the call to value() to get
      // an ondemand::value type.
      recursive_print_json(child.value());
      add_comma = true;
    }
    cout << "]";
    break;
  case ondemand::json_type::object:
    cout << "{";
    add_comma = false;
    for (auto field : element.get_object()) {
      if (add_comma) {
        cout << ",";
      }
      // key() returns the key as it appears in the raw
      // JSON document, if we want the unescaped key,
      // we should do field.unescaped_key().
      cout << "\"" << field.key() << "\": ";
      recursive_print_json(field.value());
      add_comma = true;
    }
    cout << "}\n";
    break;
  case ondemand::json_type::number:
    // assume it fits in a double
    cout << element.get_double();
    break;
  case ondemand::json_type::string:
    // get_string() would return escaped string, but
    // we are happy with unescaped string.
    cout << "\"" << element.get_raw_json_string() << "\"";
    break;
  case ondemand::json_type::boolean:
    cout << element.get_bool();
    break;
  case ondemand::json_type::null:
    cout << "null";
    break;
  }
}
void basics_treewalk() {
  padded_string json = R"( [
  { "make": "Toyota", "model": "Camry",  "year": 2018, "tire_pressure": [ 40.1, 39.9, 37.7, 40.4 ] },
  { "make": "Kia",    "model": "Soul",   "year": 2012, "tire_pressure": [ 30.1, 31.0, 28.6, 28.7 ] },
  { "make": "Toyota", "model": "Tercel", "year": 1999, "tire_pressure": [ 29.8, 30.0, 30.2, 30.5 ] }
] )"_padded;
  ondemand::parser parser;
  ondemand::document doc = parser.iterate(json);
  ondemand::value val = doc;
  recursive_print_json(val);
  std::cout << std::endl;
}

lemire · 2021-08-19T23:36:47Z

@TkTech Let us rule out a scenario: the Python programmer uses directly the On Demand API. Though that's possible, I suspect that it would not be performant since each call would have to cross the language barrier.

TkTech · 2023-09-04T06:36:47Z

Prerequisite to unblock this done in #110.

TkTech added the enhancement New feature or request label Dec 8, 2020

lemire mentioned this issue Aug 20, 2021

This parser can't support a document that big #86

Closed

TkTech mentioned this issue Jun 1, 2022

Support ndjson format #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose document_stream interface #71

Expose document_stream interface #71

lemire commented Dec 7, 2020

lemire commented Aug 19, 2021

TkTech commented Aug 19, 2021

lemire commented Aug 19, 2021

lemire commented Aug 19, 2021

TkTech commented Sep 4, 2023

Expose document_stream interface #71

Expose document_stream interface #71

Comments

lemire commented Dec 7, 2020

lemire commented Aug 19, 2021

TkTech commented Aug 19, 2021

lemire commented Aug 19, 2021

lemire commented Aug 19, 2021

TkTech commented Sep 4, 2023