Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve user experience of memory safety. #82

Closed
TkTech opened this issue May 28, 2021 · 4 comments
Closed

Improve user experience of memory safety. #82

TkTech opened this issue May 28, 2021 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@TkTech
Copy link
Owner

TkTech commented May 28, 2021

We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.

It has downsides:

Brainstorming welcome. Alternatives:

  • Probably the easiest approach would be for a Parser to keep a list of Object and Array proxies that hold a reference to it, and set a dirty bit on them when parse() is called with a different document. The performance of this would probably be unacceptable - I might be wrong.
  • Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.
@TkTech TkTech added enhancement New feature or request help wanted Extra attention is needed labels May 28, 2021
@chutz
Copy link

chutz commented May 28, 2021

* Use the new `parse_into_document()` and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

Rather than making it raise an exception, it might be a nicer user experience to have the Python wrapper for the Parser object emit a RuntimeWarning, then create a new parser when the current parser still has references. This will avoid breaking existing code, while providing a way for developers to find these bugs and fix them. It also avoids any extra performance overhead unless it's needed for the code to work as expected.

@kaber2
Copy link

kaber2 commented Jun 11, 2022

  • Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.

Getting back to this - I have some long lived objects created from JSON that are lazily constructed when needed (most of them are not). So what I used to do so far using orjson was simply parse JSON messages as they are received, create those objects and pass them on to the caching and lazy construction logic. I now wanted to switch to pysimdjson, but the one document per parser limitation means my use case won't work.

My naive assumption would be that the Object and Array proxies hold a reference to the document and it is freed when no longer referenced.

So would something like this work:

  • Parser checks that document has no further references and re-uses it, as it does currently
  • Otherwise, it uses parse_into_document to create a new one and keeps a reference
  • If a document is released, parser keeps one released document for reuse on next parse

That would make all memory handling transparent to the user and "simply work". If used in the currently supported way, there's no performance impact, if multiple parsed documents are required, you pay the price, whatever that may be.

Alternatively, could an API be exposed that allows to parse into new documents manually?

Thanks!

@zrothberg
Copy link

I was poking around this library to use for a project and stumbled upon this issue. I think the below may work out for you.

@TkTech it that may save you some headaches dealing with not only this problem but with your thread safety problem is to take an implementation detail out of http libraries. You can instead of creating your object directly just return a parser via with. That would let you isolate the creation and destruction from the entrance and exiting of it. You can use a simple queue to hold all the parsers instead of just letting them get gc'ed. If there are no parsers in the queue you can create a new one.

That may also help you with the PyPy issue as you can not return the object to the queue until after its references are gone. Though TBH you probably want to just throw an error if an Object or Array hasn't been cleaned up when you close the parser.

That would just change your access api to something along the following

with simdjson.getparser() as parser:
       parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

You can then just in getparser check if there is one available in a queue or create one if there is not. When you close the parser you simply return it to the queue to be reused. LMK if I wasn't being clear anywhere I can make u some pseudocode that will better show what I am talking about.

@TkTech
Copy link
Owner Author

TkTech commented Sep 4, 2023

Avoided entirely by using parse_into_document and removing long-lived object proxies in #110.

@TkTech TkTech closed this as completed Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants