-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the normal form detection #123
Conversation
The even_rows was always called with every_row_has_delim. This meant possibly two full scans of all the rows. By joining those two functions, we can save one of the scans. Also, the logic previously implemented by even_rows now exits early whenever possible (previous implementation used to scan the whole file no matter what).
Sorry for the failed build, I amended the formatting issues. |
clevercsv/normal_form.py
Outdated
@@ -62,7 +62,7 @@ def detect_dialect_normal( | |||
return None | |||
|
|||
form_and_dialect: List[ | |||
Tuple[int, Callable[[str, SimpleDialect], bool], SimpleDialect] | |||
Tuple[int, Callable[[list[str], SimpleDialect], bool], SimpleDialect] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the build is failing because you need from typing import List
here for Python 3.8 (also in a few places below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that is probably it, I am so used to the liwercase versions I did not think of it :) thank you for the patience, I will fix it as soon as I can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the types, hopefully it will pas now :)
Thanks for opening this PR @no23reason! Looks like there are just a few build failures to iron out, but other than that it looks good |
Avoid unnecessary splitting and joining of the rows. The current implementation would split the file into rows in each of the is_form_x separately. They all would do it the same way. So instead, we can split the file once and pass the lines to the is_form_x directly. It also allows us to avoid "re-joining" of the lines in is_form_5 when it calls the is_form_2. The test_normal_forms test inputs were changed accordingly: they are split by the `\n` and the trailing newlines were manually removed (the actual code will always strip the trailing newlines before calling the is_form_x functions).
Thanks again @no23reason! |
Thank you, especially for the patience with the mistakes I should have caught faster :) |
Aimed at avoiding as much full file scans as possible, this PR should bring improved performance of the normal form detection.
Steps taken (there are more details in the individual commits):
even_rows
logic so that it exits as soon as possible instead of going through the whole fileis_form_x
functions