Skip to content

[Code Search, II] A tree-sitter parser alternative #1160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Ellpeck
Copy link
Member

@Ellpeck Ellpeck commented Nov 19, 2024

No description provided.

@Ellpeck Ellpeck linked an issue Nov 19, 2024 that may be closed by this pull request
@EagleoutIce
Copy link
Member

EagleoutIce commented Nov 22, 2024

Still open (to be done within this PR or to be outsourced):

  • switch to npm package once available, to avoid wasm and (hopefully init)
    • Still allow the wasm backend (e.g., if we want to develop/explore part of it) with
      a user-supplied .wasm file.
  • For this we maybe want to create a startup configuration (like a feature list that determines what things are done/initialized on startup) (Julian) (Reconsider Startup Configuration #942)
    • Allow for configuration to switch between the tree sitter version and the RShell
  • Verify that docker etc. work (we should incorporate this into the system tests as done by @gigalasr in Basic System Tests for flowR #1164) -> needs RShell etc. to work
  • pass the TreeSitterExecutor object similar to the RShell (Julian)
  • do not use an RShellExecutor when using TreeSitter for Sourcing (Julian)
  • Verify the source tests with tree-sitter too
  • Reflect the backend in the HelloMessage and the other server Requests
  • Document the new backend and the way we can configure it (Flo, later)
  • Also execute benchmarks with tree-sitter and r shell separately
  • Fix the three-ish errors that occur with the tree sitter parser in the social-science benchmark suite

Please add if there are more @Ellpeck

@EagleoutIce EagleoutIce changed the base branch from main to new-parser-staging November 22, 2024 07:04
@Ellpeck
Copy link
Member Author

Ellpeck commented Dec 19, 2024

interim benchmark on artificial suite, it's looking gooooooooood

R SHELL
AST retrieval:                   169:604000ms -    520:805700ms (median:    188:329100ms, mean:    217:305623ms, std:     85:662448ms)
AST retrieval per token:           0:037992ms -     16:838227ms (median:      0:465445ms, mean:      2:977281ms, std:      4:075893ms)
AST retrieval per R token:         0:020693ms -      9:748447ms (median:      0:262474ms, mean:      1:721018ms, std:      2:367208ms)
AST normalization:                 2:698600ms -     72:282900ms (median:      5:148000ms, mean:     10:763441ms, std:     18:270777ms)
AST normalization per token:       0:005473ms -      0:245327ms (median:      0:014127ms, mean:      0:053661ms, std:      0:061915ms)
AST normalization per R token:     0:002981ms -      0:142032ms (median:      0:007828ms, mean:      0:030998ms, std:      0:036063ms)

TREE SITTER
AST retrieval:                     3:339000ms -     25:667300ms (median:      4:349800ms, mean:      6:194559ms, std:      5:404490ms)
AST retrieval per token:           0:001699ms -      0:311318ms (median:      0:014314ms, mean:      0:061600ms, std:      0:078489ms)
AST retrieval per R token:         0:001012ms -      0:214031ms (median:      0:009550ms, mean:      0:040615ms, std:      0:053337ms)
AST normalization:                 2:292700ms -     68:047400ms (median:      8:915700ms, mean:     12:703023ms, std:     16:045872ms)
AST normalization per token:       0:004327ms -      0:208427ms (median:      0:026130ms, mean:      0:059023ms, std:      0:057071ms)
AST normalization per R token:     0:002693ms -      0:143294ms (median:      0:016274ms, mean:      0:038604ms, std:      0:038610ms)

this is with all apparent slicing issues in tree sitter parsing fixed!

social science suite:

R SHELL
Summarized: 50 requests and 12710 slices
Shell init time:                   6:270800ms -     10:907500ms (median:      6:858000ms, mean:      7:262062ms, std:      1:062877ms)
AST retrieval:                   189:291000ms -    377:633600ms (median:    225:340700ms, mean:    240:042182ms, std:     46:944565ms)
AST retrieval per token:           0:054312ms -      1:148321ms (median:      0:336911ms, mean:      0:339304ms, std:      0:246117ms)
AST retrieval per R token:         0:029874ms -      0:626686ms (median:      0:184206ms, mean:      0:189423ms, std:      0:136088ms)
AST normalization:                 4:839700ms -     73:103400ms (median:      8:497000ms, mean:     14:493988ms, std:     13:104362ms)
AST normalization per token:       0:006348ms -      0:032733ms (median:      0:010613ms, mean:      0:013018ms, std:      0:006595ms)
AST normalization per R token:     0:003491ms -      0:017897ms (median:      0:005775ms, mean:      0:007244ms, std:      0:003531ms)

TREE SITTER
Summarized: 45 requests and 11224 slices
Shell init time:                   5:777900ms -     11:120500ms (median:      6:596900ms, mean:      6:962200ms, std:      1:145879ms)
AST retrieval:                     3:775400ms -     18:962800ms (median:      6:510800ms, mean:      7:855360ms, std:      3:482286ms)
AST retrieval per token:           0:002727ms -      0:023071ms (median:      0:008657ms, mean:      0:009514ms, std:      0:005599ms)
AST retrieval per R token:         0:001365ms -      0:011926ms (median:      0:004386ms, mean:      0:004980ms, std:      0:002877ms)
AST normalization:                 6:918300ms -     47:157200ms (median:     16:228100ms, mean:     17:194524ms, std:      8:229940ms)
AST normalization per token:       0:006666ms -      0:048566ms (median:      0:019909ms, mean:      0:019410ms, std:      0:009571ms)
AST normalization per R token:     0:003394ms -      0:024213ms (median:      0:009944ms, mean:      0:010154ms, std:      0:004850ms)

some errors still but waaaay less time than the r shell

@Ellpeck Ellpeck marked this pull request as ready for review December 23, 2024 09:50
@EagleoutIce EagleoutIce merged commit d6cf3d2 into new-parser-staging Dec 23, 2024
7 checks passed
@EagleoutIce EagleoutIce deleted the 1126-code-search-ii-a-tree-sitter-parser-alternative branch December 23, 2024 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Code Search, II] A tree-sitter parser alternative
2 participants