Skip to content

Optimize the normal form detection#123

Merged
GjjvdBurg merged 2 commits intoalan-turing-institute:masterfrom
no23reason:optimize
Mar 21, 2024
Merged

Optimize the normal form detection#123
GjjvdBurg merged 2 commits intoalan-turing-institute:masterfrom
no23reason:optimize

Conversation

@no23reason
Copy link
Copy Markdown
Contributor

Aimed at avoiding as much full file scans as possible, this PR should bring improved performance of the normal form detection.

Steps taken (there are more details in the individual commits):

  • optimize the even_rows logic so that it exits as soon as possible instead of going through the whole file
  • avoid repeated file splitting by splitting the file once and passing the split rows to the individual is_form_x functions

The even_rows was always called with every_row_has_delim. This meant possibly two full scans of all the rows.

By joining those two functions, we can save one of the scans. Also, the logic previously implemented by even_rows now exits early whenever possible (previous implementation used to scan the whole file no matter what).
@no23reason
Copy link
Copy Markdown
Contributor Author

Sorry for the failed build, I amended the formatting issues.

Comment thread clevercsv/normal_form.py Outdated

form_and_dialect: List[
Tuple[int, Callable[[str, SimpleDialect], bool], SimpleDialect]
Tuple[int, Callable[[list[str], SimpleDialect], bool], SimpleDialect]
Copy link
Copy Markdown
Collaborator

@GjjvdBurg GjjvdBurg Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the build is failing because you need from typing import List here for Python 3.8 (also in a few places below).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is probably it, I am so used to the liwercase versions I did not think of it :) thank you for the patience, I will fix it as soon as I can

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the types, hopefully it will pas now :)

@GjjvdBurg
Copy link
Copy Markdown
Collaborator

Thanks for opening this PR @no23reason! Looks like there are just a few build failures to iron out, but other than that it looks good

Avoid unnecessary splitting and joining of the rows. The current implementation would split the file into rows in each of the is_form_x separately. They all would do it the same way.

So instead, we can split the file once and pass the lines to the is_form_x directly. It also allows us to avoid "re-joining" of the lines in is_form_5 when it calls the is_form_2.

The test_normal_forms test inputs were changed accordingly: they are split by the `\n` and the trailing newlines were manually removed (the actual code will always strip the trailing newlines before calling the is_form_x functions).
@GjjvdBurg GjjvdBurg merged commit 7098c4f into alan-turing-institute:master Mar 21, 2024
@GjjvdBurg
Copy link
Copy Markdown
Collaborator

Thanks again @no23reason!

@no23reason
Copy link
Copy Markdown
Contributor Author

Thank you, especially for the patience with the mistakes I should have caught faster :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants