Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the normal form detection #123

Merged
merged 2 commits into from
Mar 21, 2024

Conversation

no23reason
Copy link
Contributor

Aimed at avoiding as much full file scans as possible, this PR should bring improved performance of the normal form detection.

Steps taken (there are more details in the individual commits):

  • optimize the even_rows logic so that it exits as soon as possible instead of going through the whole file
  • avoid repeated file splitting by splitting the file once and passing the split rows to the individual is_form_x functions

The even_rows was always called with every_row_has_delim. This meant possibly two full scans of all the rows.

By joining those two functions, we can save one of the scans. Also, the logic previously implemented by even_rows now exits early whenever possible (previous implementation used to scan the whole file no matter what).
@no23reason
Copy link
Contributor Author

Sorry for the failed build, I amended the formatting issues.

@@ -62,7 +62,7 @@ def detect_dialect_normal(
return None

form_and_dialect: List[
Tuple[int, Callable[[str, SimpleDialect], bool], SimpleDialect]
Tuple[int, Callable[[list[str], SimpleDialect], bool], SimpleDialect]
Copy link
Collaborator

@GjjvdBurg GjjvdBurg Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the build is failing because you need from typing import List here for Python 3.8 (also in a few places below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is probably it, I am so used to the liwercase versions I did not think of it :) thank you for the patience, I will fix it as soon as I can

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the types, hopefully it will pas now :)

@GjjvdBurg
Copy link
Collaborator

Thanks for opening this PR @no23reason! Looks like there are just a few build failures to iron out, but other than that it looks good

Avoid unnecessary splitting and joining of the rows. The current implementation would split the file into rows in each of the is_form_x separately. They all would do it the same way.

So instead, we can split the file once and pass the lines to the is_form_x directly. It also allows us to avoid "re-joining" of the lines in is_form_5 when it calls the is_form_2.

The test_normal_forms test inputs were changed accordingly: they are split by the `\n` and the trailing newlines were manually removed (the actual code will always strip the trailing newlines before calling the is_form_x functions).
@GjjvdBurg GjjvdBurg merged commit 7098c4f into alan-turing-institute:master Mar 21, 2024
7 checks passed
@GjjvdBurg
Copy link
Collaborator

Thanks again @no23reason!

@no23reason
Copy link
Contributor Author

Thank you, especially for the patience with the mistakes I should have caught faster :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants