Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster streaming 'jsonl' parser #3829

Merged
merged 3 commits into from
Nov 19, 2024

Conversation

asgerf
Copy link
Contributor

@asgerf asgerf commented Nov 18, 2024

Replaces the streaming JSONL parser with a faster one that doesn't call readline. The streaming JSONL parser on our branch was slower than the original sync version (but doesn't run out of memory), and it seems readline is a bottleneck:

Running the benchmark on a 21 MB logfile:

  • readJsonlReferenceImpl: 172.4 ms (original non-streaming version)
  • readJsonlFile: 283.3 ms (streaming version based on readline)
  • readJsonlFile2: 151.3 ms (🚀 new version without readline)
  • justReadline: 187.5 ms (consumes the file with readline and nothing else)

Note: targeting the hackathon branch. I think it's best to polish that up and try to merge it into main.

@asgerf asgerf requested a review from esbena November 18, 2024 07:48
@asgerf asgerf requested a review from a team as a code owner November 18, 2024 07:48
@esbena
Copy link

esbena commented Nov 18, 2024

nice! I would expect the non-streaming version to be faster for small files like the one you benchmarked on, and would have advocated that as the default for small files.

For completeness, can you try on some larger files like https://github.com/codeql-dca-runners/codeql-dca-worker_javascript/actions/runs/11840886420/artifacts/2188663427 (amphtml on https://github.com/github/codeql-dca-main/issues/24803)

(you need to run codeql generate log-summary --format=predicates in.jsonl in.summary.jsonl to get the right format)

@esbena
Copy link

esbena commented Nov 18, 2024

(on second thought, that will be a lot of work, so don't worry about it)

@asgerf
Copy link
Contributor Author

asgerf commented Nov 18, 2024

I ran it anyway:

  • readJsonlReferenceImpl: out of memory
  • readJsonlFile: 6439.4 ms
  • readJsonlFile2: 3538.4 ms
  • justReadline: 3664.3 ms

@esbena
Copy link

esbena commented Nov 18, 2024

Excellent. That is indeed convincing!

@aeisenberg
Copy link
Contributor

Wow. This is great. Is this something you want to contribute to the main branch?

Also, do you have a hypothesis why readline is so slow?

@esbena
Copy link

esbena commented Nov 18, 2024

My two guesses:

@asgerf
Copy link
Contributor Author

asgerf commented Nov 19, 2024

(@aeisenberg) Is this something you want to contribute to the main branch?

Yes the intent is to get this into main along with the 'Compare Performance' UI. But for benchmarking purposes it was easier to target the hackathon branch initially since that's where the previous streaming implementation lived.

@asgerf
Copy link
Contributor Author

asgerf commented Nov 19, 2024

Although nobody formally 'Approved' the PR I'm gonna merge to the hackathon branch so I can start preparing the PR against main.

@asgerf asgerf merged commit 93dacb0 into github:hackathon/compare-perf Nov 19, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants