Optimize the normal form detection #123

no23reason · 2024-03-17T16:43:37Z

Aimed at avoiding as much full file scans as possible, this PR should bring improved performance of the normal form detection.

Steps taken (there are more details in the individual commits):

optimize the even_rows logic so that it exits as soon as possible instead of going through the whole file
avoid repeated file splitting by splitting the file once and passing the split rows to the individual is_form_x functions

The even_rows was always called with every_row_has_delim. This meant possibly two full scans of all the rows. By joining those two functions, we can save one of the scans. Also, the logic previously implemented by even_rows now exits early whenever possible (previous implementation used to scan the whole file no matter what).

no23reason · 2024-03-18T08:08:11Z

Sorry for the failed build, I amended the formatting issues.

GjjvdBurg · 2024-03-18T23:07:54Z

clevercsv/normal_form.py

@@ -62,7 +62,7 @@ def detect_dialect_normal(
            return None

    form_and_dialect: List[
-        Tuple[int, Callable[[str, SimpleDialect], bool], SimpleDialect]
+        Tuple[int, Callable[[list[str], SimpleDialect], bool], SimpleDialect]


I think the build is failing because you need from typing import List here for Python 3.8 (also in a few places below).

Oh, that is probably it, I am so used to the liwercase versions I did not think of it :) thank you for the patience, I will fix it as soon as I can

I changed the types, hopefully it will pas now :)

GjjvdBurg · 2024-03-18T23:36:43Z

Thanks for opening this PR @no23reason! Looks like there are just a few build failures to iron out, but other than that it looks good

Avoid unnecessary splitting and joining of the rows. The current implementation would split the file into rows in each of the is_form_x separately. They all would do it the same way. So instead, we can split the file once and pass the lines to the is_form_x directly. It also allows us to avoid "re-joining" of the lines in is_form_5 when it calls the is_form_2. The test_normal_forms test inputs were changed accordingly: they are split by the `\n` and the trailing newlines were manually removed (the actual code will always strip the trailing newlines before calling the is_form_x functions).

GjjvdBurg · 2024-03-21T20:38:34Z

Thanks again @no23reason!

no23reason · 2024-03-22T09:48:02Z

Thank you, especially for the patience with the mistakes I should have caught faster :)

no23reason force-pushed the optimize branch from b04b6c6 to 5d50da2 Compare March 18, 2024 08:04

GjjvdBurg reviewed Mar 18, 2024

View reviewed changes

no23reason force-pushed the optimize branch from 5d50da2 to d67e172 Compare March 19, 2024 08:07

GjjvdBurg merged commit 7098c4f into alan-turing-institute:master Mar 21, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the normal form detection #123

Optimize the normal form detection #123

no23reason commented Mar 17, 2024

no23reason commented Mar 18, 2024

GjjvdBurg Mar 18, 2024 •

edited

Loading

no23reason Mar 19, 2024

no23reason Mar 20, 2024

GjjvdBurg commented Mar 18, 2024

GjjvdBurg commented Mar 21, 2024

no23reason commented Mar 22, 2024

Optimize the normal form detection #123

Optimize the normal form detection #123

Conversation

no23reason commented Mar 17, 2024

no23reason commented Mar 18, 2024

GjjvdBurg Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

no23reason Mar 19, 2024

Choose a reason for hiding this comment

no23reason Mar 20, 2024

Choose a reason for hiding this comment

GjjvdBurg commented Mar 18, 2024

GjjvdBurg commented Mar 21, 2024

no23reason commented Mar 22, 2024

GjjvdBurg Mar 18, 2024 •

edited

Loading