Improve user experience of memory safety. #82

TkTech · 2021-05-28T18:21:10Z

We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.

It has downsides:

It sucks as a user (Segfault when not assigning the parser to a variable #53 (comment)), where you might have to del the old objects, even if you didn't intend to use them again. Very un-pythonic.
Doesn't work on PyPy, where del is unreliable. The objects may not be garbage collected until much later.

Brainstorming welcome. Alternatives:

Probably the easiest approach would be for a Parser to keep a list of Object and Array proxies that hold a reference to it, and set a dirty bit on them when parse() is called with a different document. The performance of this would probably be unacceptable - I might be wrong.
Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

The text was updated successfully, but these errors were encountered:

chutz · 2021-05-28T18:33:49Z

* Use the new `parse_into_document()` and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

Rather than making it raise an exception, it might be a nicer user experience to have the Python wrapper for the Parser object emit a RuntimeWarning, then create a new parser when the current parser still has references. This will avoid breaking existing code, while providing a way for developers to find these bugs and fix them. It also avoids any extra performance overhead unless it's needed for the code to work as expected.

kaber2 · 2022-06-11T01:57:45Z

Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

Getting back to this - I have some long lived objects created from JSON that are lazily constructed when needed (most of them are not). So what I used to do so far using orjson was simply parse JSON messages as they are received, create those objects and pass them on to the caching and lazy construction logic. I now wanted to switch to pysimdjson, but the one document per parser limitation means my use case won't work.

My naive assumption would be that the Object and Array proxies hold a reference to the document and it is freed when no longer referenced.

So would something like this work:

Parser checks that document has no further references and re-uses it, as it does currently
Otherwise, it uses parse_into_document to create a new one and keeps a reference
If a document is released, parser keeps one released document for reuse on next parse

That would make all memory handling transparent to the user and "simply work". If used in the currently supported way, there's no performance impact, if multiple parsed documents are required, you pay the price, whatever that may be.

Alternatively, could an API be exposed that allows to parse into new documents manually?

Thanks!

zrothberg · 2022-11-29T19:51:43Z

I was poking around this library to use for a project and stumbled upon this issue. I think the below may work out for you.

@TkTech it that may save you some headaches dealing with not only this problem but with your thread safety problem is to take an implementation detail out of http libraries. You can instead of creating your object directly just return a parser via with. That would let you isolate the creation and destruction from the entrance and exiting of it. You can use a simple queue to hold all the parsers instead of just letting them get gc'ed. If there are no parsers in the queue you can create a new one.

That may also help you with the PyPy issue as you can not return the object to the queue until after its references are gone. Though TBH you probably want to just throw an error if an Object or Array hasn't been cleaned up when you close the parser.

That would just change your access api to something along the following

with simdjson.getparser() as parser:
       parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

You can then just in getparser check if there is one available in a queue or create one if there is not. When you close the parser you simply return it to the queue to be reused. LMK if I wasn't being clear anywhere I can make u some pseudocode that will better show what I am talking about.

…esulted in misuses. Closes #82, #109.

TkTech · 2023-09-04T06:21:28Z

Avoided entirely by using parse_into_document and removing long-lived object proxies in #110.

TkTech added enhancement New feature or request help wanted Extra attention is needed labels May 28, 2021

TkTech mentioned this issue May 28, 2021

Segfault when not assigning the parser to a variable #53

Closed

Will-So mentioned this issue Nov 25, 2021

Example from Documentation doesn't work #90

Closed

birneee mentioned this issue Jul 28, 2023

add option to make re-use check late #109

Closed

TkTech added a commit that referenced this issue Sep 3, 2023

Completely remove Array and Object proxies, they were a mistake and r…

092ce41

…esulted in misuses. Closes #82, #109.

TkTech closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve user experience of memory safety. #82

Improve user experience of memory safety. #82

TkTech commented May 28, 2021 •

edited

Loading

chutz commented May 28, 2021

kaber2 commented Jun 11, 2022

zrothberg commented Nov 29, 2022

TkTech commented Sep 4, 2023 •

edited

Loading

Improve user experience of memory safety. #82

Improve user experience of memory safety. #82

Comments

TkTech commented May 28, 2021 • edited Loading

chutz commented May 28, 2021

kaber2 commented Jun 11, 2022

zrothberg commented Nov 29, 2022

TkTech commented Sep 4, 2023 • edited Loading

TkTech commented May 28, 2021 •

edited

Loading

TkTech commented Sep 4, 2023 •

edited

Loading