Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose document_stream interface #71

Open
lemire opened this issue Dec 7, 2020 · 5 comments
Open

Expose document_stream interface #71

lemire opened this issue Dec 7, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@lemire
Copy link
Contributor

lemire commented Dec 7, 2020

The pysimdjson library could support our document_stream interface (parse_many function). It is well tested as of release 0.7 (with fuzz testing) and works well today. It supports streams of indefinite size.

See https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md

Related to #70

@TkTech TkTech added the enhancement New feature or request label Dec 8, 2020
@lemire
Copy link
Contributor Author

lemire commented Aug 19, 2021

Note that as of the upcoming 1.0 release, the On Demand API will support ndjson/stream parsing (sequences of JSON documents).

@TkTech
Copy link
Owner

TkTech commented Aug 19, 2021

How stable is the on-demand API now? I've avoided it so far due to a few issues:

  • Quite a few (relatively) regressions, just from following issues
  • Almost every use case of pysimdjson so far needs a DOM, and often wants to re-query data stored on the native side of the language barrier. If we switch entirely to on-demand, we'll need to re-invent the containers anyways.
  • The biggest benefit of an API like on-demand is the potential for streaming chunks, which isn't there yet.
  • The python side can't really inform the C side of a known document structure without significant overhead, a DSL, or some other approach like encoding the structure as a string. This is the purpose of the Shape API that was to be part of v4, but I lack the time to finish it at the moment.

@lemire
Copy link
Contributor Author

lemire commented Aug 19, 2021

The benefit of the On Demand approach is that you bypass entirely the C++ DOM and you just build up your own data structure directly. So instead of doing JSON -> indexing -> C++ DOM -> your code, you do JSON -> indexing -> your code, skipping a step.

It is not a replacement of the DOM approach. We will continue to support the DOM approach forever I expect.

I am not urging you to use On Demand. I do think that it is something you should seriously examine eventually. If you do, you will get help from us.

Quite a few (relatively) regressions, just from following issues

There were a few issues, yes. They have all been fixed quickly. The current code appears quite robust. It is a new approach, so it required a lot more work to get it to a solid state. It is also more challenging for fundamental reasons because the user has more power. But I think we are there (hence the 1.0 status). We will have an R wrapper. And we have extensive tests.

Almost every use case of pysimdjson so far needs a DOM, and often wants to re-query data stored on the native side of the language barrier. If we switch entirely to on-demand, we'll need to re-invent the containers anyways.

The biggest benefit of an API like on-demand is the potential for streaming chunks, which isn't there yet.

I am not sure that this is true. In both instances, we support streaming JSON data (streams of JSON documents, ndjson). The indexing phase (stage 1) in chunks would be easier to do with the DOM API since we are in full control (from within simdjson). It is more of a challenge with On Demand since the user is in control. You can move through the document (including moving back) and you can imagine the problems that could emerge.

The python side can't really inform the C side of a known document structure without significant overhead, a DSL, or some other approach like encoding the structure as a string. This is the purpose of the Shape API that was to be part of v4, but I lack the time to finish it at the moment.

You can use On Demand in a DOM-like manner.

void recursive_print_json(ondemand::value element) {
  bool add_comma;
  switch (element.type()) {
  case ondemand::json_type::array:
    cout << "[";
    add_comma = false;
    for (auto child : element.get_array()) {
      if (add_comma) {
        cout << ",";
      }
      // We need the call to value() to get
      // an ondemand::value type.
      recursive_print_json(child.value());
      add_comma = true;
    }
    cout << "]";
    break;
  case ondemand::json_type::object:
    cout << "{";
    add_comma = false;
    for (auto field : element.get_object()) {
      if (add_comma) {
        cout << ",";
      }
      // key() returns the key as it appears in the raw
      // JSON document, if we want the unescaped key,
      // we should do field.unescaped_key().
      cout << "\"" << field.key() << "\": ";
      recursive_print_json(field.value());
      add_comma = true;
    }
    cout << "}\n";
    break;
  case ondemand::json_type::number:
    // assume it fits in a double
    cout << element.get_double();
    break;
  case ondemand::json_type::string:
    // get_string() would return escaped string, but
    // we are happy with unescaped string.
    cout << "\"" << element.get_raw_json_string() << "\"";
    break;
  case ondemand::json_type::boolean:
    cout << element.get_bool();
    break;
  case ondemand::json_type::null:
    cout << "null";
    break;
  }
}
void basics_treewalk() {
  padded_string json = R"( [
  { "make": "Toyota", "model": "Camry",  "year": 2018, "tire_pressure": [ 40.1, 39.9, 37.7, 40.4 ] },
  { "make": "Kia",    "model": "Soul",   "year": 2012, "tire_pressure": [ 30.1, 31.0, 28.6, 28.7 ] },
  { "make": "Toyota", "model": "Tercel", "year": 1999, "tire_pressure": [ 29.8, 30.0, 30.2, 30.5 ] }
] )"_padded;
  ondemand::parser parser;
  ondemand::document doc = parser.iterate(json);
  ondemand::value val = doc;
  recursive_print_json(val);
  std::cout << std::endl;
}

@lemire
Copy link
Contributor Author

lemire commented Aug 19, 2021

@TkTech Let us rule out a scenario: the Python programmer uses directly the On Demand API. Though that's possible, I suspect that it would not be performant since each call would have to cross the language barrier.

@TkTech
Copy link
Owner

TkTech commented Sep 4, 2023

Prerequisite to unblock this done in #110.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants