-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming 'jsonl' parser #3831
Conversation
The current build setup doesn't seem to have a concept for benchmark scripts, so for now you'll have to run it with something like ts-node.
Perahps add a minor comment about why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but maybe the owning team wants a say as well..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Some minor comments.
await handler(JSON.parse(buffer)); | ||
} catch (e) { | ||
reject(e); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: If you move the logger.log
statement and the resolve()
call into the try block, you won't need a return
here.
I think it's fine to check the benchmark script in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a file comment, or a README describing what the benchmark does and that it's not being run on a regular basis? Also, it would be nice to include the results that you have in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Replaces the jsonl parser with a streaming version that is at least as fast as the sync version. Also see original PR against the hackathon branch.
Thanks to @esbena for the initial streaming version used for the hackaton. That version used the
readline
library to split lines, which turned out to be a bottleneck, however, making the streaming parser slower than the original sync version. So I wrote one that doesn't usereadline
.Running the benchmark on a 21 MB logfile:
readJsonlReferenceImpl
: 172.4 ms (original non-streaming version)readJsonlFile
: 283.3 ms (streaming version based onreadline
)readJsonlFile2
: 151.3 ms (new version withoutreadline
)justReadline
: 187.5 ms (consumes the file withreadline
and nothing else)On a 520 MB logfile:
readJsonlReferenceImpl
: out of memoryreadJsonlFile
: 6439.4 msreadJsonlFile2
: 3538.4 msjustReadline
: 3664.3 msI've added the benchmark script although the project doesn't seem to have much infrastructure for benchmark scripts. At the moment you'll have to run it with something like
ts-node
and there's no tests to ensure the benchmark script keeps working (but it will be checked for compilation errors). I'm on the fence about whether it should be committed.