v7.0.0 #110

TkTech · 2023-09-03T01:14:37Z

This is a major breaking change release that removes Array and Object proxies. However, after checking all GitHub repos that have this one as a dependency with > 5 stars, only 2 were using these features. They were generally an anti-pattern - if you needed 1 value, use at_pointer() instead. If you needed more than 1 value, it was almost always faster to use at_pointer() for an entire object at once. This new approach also alleviates memory management issues on PyPy.

If all you used was simdjson.loads() and simdjson.parse(), you should notice no difference.

Drop Python 3.6 and 3.7, which are now beyond end-of-life. Add Python 3.11.
Exploits CPython Unicode object internals for significantly faster string creation (up to 45%!)
Removed Array and Object proxy objects.
- Changing our approach to this has significantly improved memory safety internally and fixed pypy support.
Update deprecated github actions.
Update vendored simdjson to version 3.2.3.
- Minified floats no longer drop the .0 (see Float aware mini #102)

ToDo:

Re-add JSON-to-buffer/numpy array removed in initial cleanup (this method is many times faster than naively loading JSON when trying turning a homogeneous array of JSON values into a numpy array)
Add support for latest PyPy
Memory optimization pass
Update documentation and examples.

… 3.11.

…, stuck on clang-15.

…reased significantly over the years.

Run release workflow when github releases are published. Run tests only once when pushed into a PR.

…esulted in misuses. Closes #82, #109.

TkTech · 2023-09-04T05:51:17Z

For certain benchmarks, especially those that are string-heavy, this version is now roughly 45% faster.

---------------------------------------------------------------------- benchmark 'Complete load of data/twitter.json': 2 tests -----------------------------------------------------------------------
Name (time in us)             Min                   Max                  Mean              StdDev                Median                IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
simdjson (NOW)           916.4979 (1.0)      4,188.6930 (1.0)      1,011.2369 (1.0)      391.3896 (1.0)        939.6220 (1.0)      24.2511 (1.0)         16;88  988.8880 (1.0)         690           1
simdjson (OLD)     1,328.0310 (1.45)     4,533.0260 (1.08)     1,428.9499 (1.41)     414.6389 (1.06)     1,355.7710 (1.44)     31.2135 (1.29)        12;49  699.8146 (0.71)        507           1

edgarsi · 2023-09-05T21:26:48Z

I am using pysimdjson to do work like this:

doc = Parser().parse('{"a": {"b": ...}}')
b = doc['a']['b']
s = b.mini

The contents under b are huge, and pysimdjson allows me to avoid creating Python objects of them.

With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.

I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:

doc = Parser().parse('{"a": {"b": ...}}')
b = doc.at_pointer('/a/b')
s = b.mini

But this still requires one to know the full path.

I am also using pysimdjson as follows:

doc = Parser().parse('{"a": [{"x": ...}, ...]}')
items = list(doc['a'])
for item in items:
    item[y] = ...
s = deep_jsonify(items)  # uses .mini when possible

First of all, the drop-in functionality of read-only list and dict structures is very nice here. Second, the new Document does not offer any way to list items at all, without creating Python objects for the full json subtree. If you hate the Array and Object classes, maybe Document.parse_shallow, which returns the Python element, which, in the case of being list lists Document objects, etc for dict?

P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.

TkTech · 2023-09-06T01:23:54Z

Thanks for the feedback @edgarsi, appreciate it.

With the new changes you destroy this feature. Now at_pointer constructs the Python objects forcefully.

This PR won't be merged until it's back to feature parity with v5. The Array and Object interfaces have to disappear for memory safety. While there are a bunch of ways to make it "safe", they come at a severe performance penalty for small documents. They also tended to be used to access more than a key or two, which is often slower than just getting the entire object.

I propose at_pointer returns the Document object, and the Document object implements the mini property or method. (I did not find how mini is even accessable in the current code.) With these changes, the sample code above can be rewritten to:

Most of the methods on Document() will mimic their counterparts in py_yyjson, where every method can take a pointer. .mini will become mini(at_pointer: str = /a/b). You'll actually see a bit of a speed boost and slightly better memory usage.

list lists Document objects, etc for dict?

1 JSON Document will return 1 Document() object. It's a memory container, not meant to represent a simd::element. The list, dict, and numpy helpers will be back before this is merged. Proxy objects cannot be used safely used in Python, because the Document() may have been reused between calls. All methods in v6 will return Python objects.

P.S. Document.root and Document.as_object are the same function, with two names, and neither seem to be implemented for backward compatibility reasons.

This was already fixed locally. root() isn't exposed to Python, it's a cdef to return the document root for internal functions.

Drop Python 3.6 and 3.7, which are now beyond end-of-life. Add Python…

f22d3fc

… 3.11.

TkTech self-assigned this Sep 3, 2023

TkTech added 5 commits September 2, 2023 21:16

ubuntu-latest is running ubuntu-22.04 which doesn't yet have clang-17…

50b0de1

…, stuck on clang-15.

Fail all CI tests immediately if one fails, as our build time has inc…

fd09d98

…reased significantly over the years.

Update vendored simdjson to version 3.2.3.

5577bda

Update actions/checkout to v3.

6d283fe

Run release workflow when github releases are published. Run tests only once when pushed into a PR.

Upstream simdjson changed capitalization on empty buffer exception.

10e2d7d

This was referenced Sep 3, 2023

doc: Update supported Python versions #107

Closed

Float aware mini #102

Closed

TkTech added 2 commits September 3, 2023 08:39

Completely remove Array and Object proxies, they were a mistake and r…

092ce41

…esulted in misuses. Closes #82, #109.

Optimized path for ASCII strings.

5d3533f

TkTech mentioned this pull request Sep 4, 2023

add option to make re-use check late #109

Closed

Default argument support for at_pointer (closes #105)

d2cc66b

This was referenced Sep 4, 2023

Return a default value from at_pointer when the key doesn't exist #105

Closed

Improve user experience of memory safety. #82

Closed

Missing loads and load type definitions (closes #103)

3596a9e

TkTech mentioned this pull request Sep 4, 2023

Expose document_stream interface #71

Open

TkTech changed the title ~~v6.0.0~~ v7.0.0 Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v7.0.0 #110

v7.0.0 #110

TkTech commented Sep 3, 2023 •

edited

Loading

TkTech commented Sep 4, 2023

edgarsi commented Sep 5, 2023

TkTech commented Sep 6, 2023

v7.0.0 #110

Are you sure you want to change the base?

v7.0.0 #110

Conversation

TkTech commented Sep 3, 2023 • edited Loading

TkTech commented Sep 4, 2023

edgarsi commented Sep 5, 2023

TkTech commented Sep 6, 2023

TkTech commented Sep 3, 2023 •

edited

Loading