Add streaming 'jsonl' parser #3831

asgerf · 2024-11-19T10:36:34Z

Replaces the jsonl parser with a streaming version that is at least as fast as the sync version. Also see original PR against the hackathon branch.

Thanks to @esbena for the initial streaming version used for the hackaton. That version used the readline library to split lines, which turned out to be a bottleneck, however, making the streaming parser slower than the original sync version. So I wrote one that doesn't use readline.

Running the benchmark on a 21 MB logfile:

readJsonlReferenceImpl: 172.4 ms (original non-streaming version)
readJsonlFile: 283.3 ms (streaming version based on readline)
readJsonlFile2: 151.3 ms (new version without readline)
justReadline: 187.5 ms (consumes the file with readline and nothing else)

On a 520 MB logfile:

readJsonlReferenceImpl: out of memory
readJsonlFile: 6439.4 ms
readJsonlFile2: 3538.4 ms
justReadline: 3664.3 ms

I've added the benchmark script although the project doesn't seem to have much infrastructure for benchmark scripts. At the moment you'll have to run it with something like ts-node and there's no tests to ensure the benchmark script keeps working (but it will be checked for compilation errors). I'm on the fence about whether it should be committed.

The current build setup doesn't seem to have a concept for benchmark scripts, so for now you'll have to run it with something like ts-node.

esbena · 2024-11-19T10:43:16Z

Perahps add a minor comment about why readfileSync and readline are insufficient alternatives.

esbena

LGTM, but maybe the owning team wants a say as well..

aeisenberg

Looks great. Some minor comments.

extensions/ql-vscode/src/common/jsonl-reader.ts

aeisenberg · 2024-11-19T23:03:05Z

extensions/ql-vscode/src/common/jsonl-reader.ts

+          await handler(JSON.parse(buffer));
+        } catch (e) {
+          reject(e);
+          return;


Minor: If you move the logger.log statement and the resolve() call into the try block, you won't need a return here.

aeisenberg · 2024-11-19T23:07:46Z

I think it's fine to check the benchmark script in.

aeisenberg · 2024-11-19T23:08:47Z

extensions/ql-vscode/test/benchmarks/jsonl-reader.bench.ts

Can you add a file comment, or a README describing what the benchmark does and that it's not being run on a regular basis? Also, it would be nice to include the results that you have in the PR description.

Co-authored-by: Andrew Eisenberg <[email protected]>

aeisenberg

Nice.

asgerf added 2 commits November 19, 2024 11:23

Add streaming jsonl parser

fac7961

Add benchmark script

2cde3b9

The current build setup doesn't seem to have a concept for benchmark scripts, so for now you'll have to run it with something like ts-node.

Explain why we need to stream and why not use readline

bb1da9c

asgerf marked this pull request as ready for review November 19, 2024 13:02

asgerf requested a review from a team as a code owner November 19, 2024 13:02

esbena approved these changes Nov 19, 2024

View reviewed changes

aeisenberg reviewed Nov 19, 2024

View reviewed changes

asgerf and others added 4 commits November 20, 2024 11:06

Apply suggestions from code review

38849f7

Co-authored-by: Andrew Eisenberg <[email protected]>

Fix missing import

d05cdf4

Move some calls into the try block

b90cfb6

Add a file comment to the benchmark script

57e2b51

aeisenberg approved these changes Nov 20, 2024

View reviewed changes

asgerf merged commit b840c38 into github:main Nov 21, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming 'jsonl' parser #3831

Add streaming 'jsonl' parser #3831

asgerf commented Nov 19, 2024

esbena commented Nov 19, 2024

esbena left a comment

aeisenberg left a comment •

edited

Loading

aeisenberg Nov 19, 2024

aeisenberg commented Nov 19, 2024

aeisenberg Nov 19, 2024

aeisenberg left a comment

Add streaming 'jsonl' parser #3831

Add streaming 'jsonl' parser #3831

Conversation

asgerf commented Nov 19, 2024

esbena commented Nov 19, 2024

esbena left a comment

Choose a reason for hiding this comment

aeisenberg left a comment • edited Loading

Choose a reason for hiding this comment

aeisenberg Nov 19, 2024

Choose a reason for hiding this comment

aeisenberg commented Nov 19, 2024

aeisenberg Nov 19, 2024

Choose a reason for hiding this comment

aeisenberg left a comment

Choose a reason for hiding this comment

aeisenberg left a comment •

edited

Loading