Performance degrades exponentially as the number of lines in a file increases #939

concur1 · 2024-11-08T14:58:45Z

It seems that the time it takes to lint a file increases exponentially with the size the file.
It seems that performance when using bigquery is worse than other SQL flavors (e.g. ANSI). E.g. for a bigquery file it will take ~2 seconds to lint a fairly simple file with 400 lines.

I have tested this by generating sql files with an increasing number of lines, linting each file and measuring the time it takes.

Here is the result in a graph form:

In case you would like to try and replicate this:
Here is my config file:

[sqlfluff]
dialect = bigquery
max_line_length = 120
exclude_rules = CP02, LT01

[sqlfluff.indentation]
tab_space_size = 2

[sqlfluff:rules:capitalisation.keywords]
extended_capitalisation_policy = upper

[sqlfluff:rules:extended_capitalisation.functions]
extended_capitalisation_policy = lower


[sqlfluff:rules:extended_capitalisation.types]
extended_capitalisation_policy = upper

Here the final file with 608 lines: test_sql.txt (I couldnt upload a .sql file here for some reason)

Here is the python script I used to generate the different files:

initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
  id STRING NOT NULL,
  product STRING NOT NULL,
  cost INTEGER NOT NULL,
  timestamp_added TIMESTAMP NOT NULL  
);

INSERT INTO `project_id.dataset_id.table_id` (
    id,
    product,
    cost,
    timestamp_added
)
VALUES"""

import random
import subprocess
import time


def run_sqruff_check(size_param):
    insert_value_strings_list = []
    for i in range(size_param):
        insert_value_strings_list.append(f"""(
        {i},
        {random.choice(["bike", "car", "tv", "orange"])},
        {random.choice([100, 200, 300, 400])},
        current_timestamp()
    )""")
    
    output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)
    
    with open("./test.sql", "w") as file:
        file.write(output)
    
    start_time = time.time()
    subprocess.run(["sqruff", "lint"], check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    with open("./results.csv","a") as file:
        file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")

for size_param in range(100):
    run_sqruff_check(size_param)

The text was updated successfully, but these errors were encountered:

gvozdvmozgu · 2024-11-08T15:24:50Z

It seems the issue is with the lexer, specifically with the regular expressions.

concur1 · 2024-11-09T12:36:15Z

This probably fixes it for my use-case but I think the performance still degrades exponentially, just more slowly. I have a plot demonstrating this below but I just want to say; thank you for looking into this @gvozdvmozgu and @benfdking , I didn't expect a fix to be done so quickly! This really improves things for my use-case.

Here is how the plot with lines for both before and after the fix if you are interested:

Here is a plot using sqruff after the fix in comparison to sqlfluff. I know this is a special case as it is a simple example and in my experience for more complex files sqruff will generally beat sqlfluff in performance.

Even so I want to share this plot that shows the sqruff lint time increasing exponentially while sqlfluff lint time looks like it is increasing linearly or at least with a much less steep curve. The sqlfluff byte limit kicks in a bit after 1500 lines, that is why there us a drop off for the sqlfluff times.

gvozdvmozgu · 2024-11-09T12:49:13Z

How can I replicate this?

concur1 · 2024-11-09T13:05:01Z

How can I replicate this?

I created the data for all of these plots with the above python script. Running the scriot creates a file called results.csv, you just need to rename/move the results file after each run.

To run for the latest sqruff I changed this line:

subprocess.run(["sqruff", "lint"]

To instead be

subprocess.run(["cargo", " run", "lint"]

(This works if you are running the script from the cli directory with the .sqruff file present as cargo run is building/running the sqruff cli here)

To run for sqlfluff, sqlfluff must be installed and that line must be changed to:

subprocess.run(["sqlfluff", "lint"]

Hope that helps. If not I will try and create a notebook which will hopefully be a bit more reproducible and share that with you

concur1 · 2024-11-09T13:15:20Z

One other thing, you will want to change the value inside the range function to control how many loops there are and how many files are created.

To create that last graph there were 300 loops.

concur1 · 2024-11-09T13:52:14Z

One more thing; when running for sqlfluff you need to create the .sqlfluff file with the config that I gave above.

concur1 · 2024-11-09T18:30:43Z

Ok, I have made a single script that you should work as long as you have sqlfluff installed and you run it from the cli directory of the sqruff repo. It will create two files: sqruff_results.csv and sqlfluff_results.csv:

import random
import subprocess
import time


initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
  id STRING NOT NULL,
  product STRING NOT NULL,
  cost INTEGER NOT NULL,
  timestamp_added TIMESTAMP NOT NULL  
);

INSERT INTO `project_id.dataset_id.table_id` (
    id,
    product,
    cost,
    timestamp_added
)
VALUES"""



def run_sqruff_check(size_param: int, output_filename: str, command: list):
    insert_value_strings_list = []
    for i in range(size_param):
        insert_value_strings_list.append(f"""(
        {i},
        {random.choice(["bike", "car", "tv", "orange"])},
        {random.choice([100, 200, 300, 400])},
        current_timestamp()
    )""")
    
    output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)
    
    with open("./test.sql", "w") as file:
        file.write(output)
    
    start_time = time.time()
    subprocess.run(command, check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    with open(f"./{output_filename}","a") as file:
        file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")
    print(f"{(size_param*6)+14}, {time.time()-start_time}")


def run_performance_checks(loops:int, output_filename: str, command: list):
    with open("./.sqlfluff", "w") as file:
        file.write("""
[sqlfluff]
dialect = bigquery
                   """)
        
    for size_param in range(loops):
        run_sqruff_check(size_param, output_filename, command)    

print("running sqruff performance checks")
run_performance_checks(200, "sqruff_results.csv", ["cargo", "run", "lint"])

print("running sqlfluff performance checks")
run_performance_checks(200, "sqlfluff_results.csv", ["sqlfluff", "lint"])

It only runs up to a file size of 1208 lines, but that is enough to see that the performance of sqruff starts lagging behind the performance of sqlfluff (at least on my machine)

let me know if it doesn't work for some reason

gvozdvmozgu · 2024-12-02T07:57:25Z

@concur1, please check how the PR #1008 affects performance.

concur1 · 2024-12-02T10:20:01Z

hi @gvozdvmozgu I have just run the script for that PR/branch and performance is very much improved =).

Here is the plot comparing sqruff and sqlfluff (sqlfluff data only valid below ~1600 lines due to the large_file_skip_bytes_limit). In my testing sqruff now peforms better the sqlfluff for all file sizes.

gvozdvmozgu · 2024-12-02T10:29:35Z

Yeah, the final solution would be to remove fancy regex and replace it with native functions, even though there are few of them, and they're very slow. I started doing this in another PR, but it's really exhausting, to be honest.

concur1 · 2024-12-02T11:20:23Z

Oooof yeah, sounds painful!

To be honest the performance of the latest branch/PR is more than sufficient for my usecase.

Do you think it would be useful to include one of the files I have been testing with in the automated benchmarks? e.g. a check to make sure a 2000 line file is completed within a certain amount of time (on my machine it would complete in less than 0.5 seconds but GHA may differ).

Let me know if you think it would be useful and I will upload one of these files.

gvozdvmozgu · 2024-12-02T12:57:08Z

I think we need a separate repository that will compare each dialect and each rule individually, as well as all of them together with the Python implementation, and it should be deployed on GitHub Pages.

concur1 · 2024-12-03T11:41:49Z

Any chance you could do a release so I could try out the python package with the latest updates?

benfdking · 2024-12-04T09:57:45Z

@concur1 I've pushed a new version; if you want to add a long one of your generated files to our benchmarks, that would be useful.

concur1 · 2024-12-04T17:47:24Z

@benfdking thanks for that! It looks like something happened with the python release and a couple of the jobs were skipped (including the lublish to pypi). Maybe try re-running?

I can try and add a large file to the benchmarks next week probably.

benfdking · 2024-12-04T17:53:31Z

Oh I think I know what this is ... old Mac os image has been killed will look into it.

benfdking · 2024-12-05T16:25:07Z

That's released and should be on it's way out, sorry @concur1

concur1 · 2024-12-05T19:54:56Z

no worries, thanks for updating!

gvozdvmozgu mentioned this issue Nov 8, 2024

refactor: add start-of-string anchor to regex patterns in lexer #940

Merged

benfdking closed this as completed in #940 Nov 8, 2024

benfdking reopened this Nov 9, 2024

benfdking added enhancement New feature or request rust Pull requests that update Rust code labels Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degrades exponentially as the number of lines in a file increases #939

Performance degrades exponentially as the number of lines in a file increases #939

concur1 commented Nov 8, 2024 •

edited

Loading

gvozdvmozgu commented Nov 8, 2024

concur1 commented Nov 9, 2024 •

edited

Loading

gvozdvmozgu commented Nov 9, 2024

concur1 commented Nov 9, 2024

concur1 commented Nov 9, 2024

concur1 commented Nov 9, 2024

concur1 commented Nov 9, 2024 •

edited

Loading

gvozdvmozgu commented Dec 2, 2024

concur1 commented Dec 2, 2024

gvozdvmozgu commented Dec 2, 2024

concur1 commented Dec 2, 2024 •

edited

Loading

gvozdvmozgu commented Dec 2, 2024

concur1 commented Dec 3, 2024

benfdking commented Dec 4, 2024

concur1 commented Dec 4, 2024

benfdking commented Dec 4, 2024

benfdking commented Dec 5, 2024

concur1 commented Dec 5, 2024

Performance degrades exponentially as the number of lines in a file increases #939

Performance degrades exponentially as the number of lines in a file increases #939

Comments

concur1 commented Nov 8, 2024 • edited Loading

gvozdvmozgu commented Nov 8, 2024

concur1 commented Nov 9, 2024 • edited Loading

gvozdvmozgu commented Nov 9, 2024

concur1 commented Nov 9, 2024

concur1 commented Nov 9, 2024

concur1 commented Nov 9, 2024

concur1 commented Nov 9, 2024 • edited Loading

gvozdvmozgu commented Dec 2, 2024

concur1 commented Dec 2, 2024

gvozdvmozgu commented Dec 2, 2024

concur1 commented Dec 2, 2024 • edited Loading

gvozdvmozgu commented Dec 2, 2024

concur1 commented Dec 3, 2024

benfdking commented Dec 4, 2024

concur1 commented Dec 4, 2024

benfdking commented Dec 4, 2024

benfdking commented Dec 5, 2024

concur1 commented Dec 5, 2024

concur1 commented Nov 8, 2024 •

edited

Loading

concur1 commented Nov 9, 2024 •

edited

Loading

concur1 commented Nov 9, 2024 •

edited

Loading

concur1 commented Dec 2, 2024 •

edited

Loading