Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrades exponentially as the number of lines in a file increases #939

Open
concur1 opened this issue Nov 8, 2024 · 18 comments · Fixed by #940
Open

Performance degrades exponentially as the number of lines in a file increases #939

concur1 opened this issue Nov 8, 2024 · 18 comments · Fixed by #940
Labels
enhancement New feature or request rust Pull requests that update Rust code

Comments

@concur1
Copy link
Contributor

concur1 commented Nov 8, 2024

  • It seems that the time it takes to lint a file increases exponentially with the size the file.
  • It seems that performance when using bigquery is worse than other SQL flavors (e.g. ANSI). E.g. for a bigquery file it will take ~2 seconds to lint a fairly simple file with 400 lines.

I have tested this by generating sql files with an increasing number of lines, linting each file and measuring the time it takes.

Here is the result in a graph form:
image

In case you would like to try and replicate this:
Here is my config file:

[sqlfluff]
dialect = bigquery
max_line_length = 120
exclude_rules = CP02, LT01

[sqlfluff.indentation]
tab_space_size = 2

[sqlfluff:rules:capitalisation.keywords]
extended_capitalisation_policy = upper

[sqlfluff:rules:extended_capitalisation.functions]
extended_capitalisation_policy = lower


[sqlfluff:rules:extended_capitalisation.types]
extended_capitalisation_policy = upper

Here the final file with 608 lines: test_sql.txt (I couldnt upload a .sql file here for some reason)

Here is the python script I used to generate the different files:

initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
  id STRING NOT NULL,
  product STRING NOT NULL,
  cost INTEGER NOT NULL,
  timestamp_added TIMESTAMP NOT NULL  
);

INSERT INTO `project_id.dataset_id.table_id` (
    id,
    product,
    cost,
    timestamp_added
)
VALUES"""

import random
import subprocess
import time


def run_sqruff_check(size_param):
    insert_value_strings_list = []
    for i in range(size_param):
        insert_value_strings_list.append(f"""(
        {i},
        {random.choice(["bike", "car", "tv", "orange"])},
        {random.choice([100, 200, 300, 400])},
        current_timestamp()
    )""")
    
    output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)
    
    with open("./test.sql", "w") as file:
        file.write(output)
    
    start_time = time.time()
    subprocess.run(["sqruff", "lint"], check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    with open("./results.csv","a") as file:
        file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")

for size_param in range(100):
    run_sqruff_check(size_param)    
@gvozdvmozgu
Copy link
Collaborator

image
It seems the issue is with the lexer, specifically with the regular expressions.

@concur1
Copy link
Contributor Author

concur1 commented Nov 9, 2024

This probably fixes it for my use-case but I think the performance still degrades exponentially, just more slowly. I have a plot demonstrating this below but I just want to say; thank you for looking into this @gvozdvmozgu and @benfdking , I didn't expect a fix to be done so quickly! This really improves things for my use-case.

Here is how the plot with lines for both before and after the fix if you are interested:
image

Here is a plot using sqruff after the fix in comparison to sqlfluff. I know this is a special case as it is a simple example and in my experience for more complex files sqruff will generally beat sqlfluff in performance.

Even so I want to share this plot that shows the sqruff lint time increasing exponentially while sqlfluff lint time looks like it is increasing linearly or at least with a much less steep curve. The sqlfluff byte limit kicks in a bit after 1500 lines, that is why there us a drop off for the sqlfluff times.
image

@benfdking benfdking reopened this Nov 9, 2024
@gvozdvmozgu
Copy link
Collaborator

How can I replicate this?

@concur1
Copy link
Contributor Author

concur1 commented Nov 9, 2024

How can I replicate this?

I created the data for all of these plots with the above python script. Running the scriot creates a file called results.csv, you just need to rename/move the results file after each run.

To run for the latest sqruff I changed this line:

subprocess.run(["sqruff", "lint"]

To instead be

subprocess.run(["cargo", " run", "lint"]

(This works if you are running the script from the cli directory with the .sqruff file present as cargo run is building/running the sqruff cli here)

To run for sqlfluff, sqlfluff must be installed and that line must be changed to:

subprocess.run(["sqlfluff", "lint"]

Hope that helps. If not I will try and create a notebook which will hopefully be a bit more reproducible and share that with you

@concur1
Copy link
Contributor Author

concur1 commented Nov 9, 2024

One other thing, you will want to change the value inside the range function to control how many loops there are and how many files are created.

To create that last graph there were 300 loops.

@concur1
Copy link
Contributor Author

concur1 commented Nov 9, 2024

One more thing; when running for sqlfluff you need to create the .sqlfluff file with the config that I gave above.

@concur1
Copy link
Contributor Author

concur1 commented Nov 9, 2024

Ok, I have made a single script that you should work as long as you have sqlfluff installed and you run it from the cli directory of the sqruff repo. It will create two files: sqruff_results.csv and sqlfluff_results.csv:

import random
import subprocess
import time


initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
  id STRING NOT NULL,
  product STRING NOT NULL,
  cost INTEGER NOT NULL,
  timestamp_added TIMESTAMP NOT NULL  
);

INSERT INTO `project_id.dataset_id.table_id` (
    id,
    product,
    cost,
    timestamp_added
)
VALUES"""



def run_sqruff_check(size_param: int, output_filename: str, command: list):
    insert_value_strings_list = []
    for i in range(size_param):
        insert_value_strings_list.append(f"""(
        {i},
        {random.choice(["bike", "car", "tv", "orange"])},
        {random.choice([100, 200, 300, 400])},
        current_timestamp()
    )""")
    
    output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)
    
    with open("./test.sql", "w") as file:
        file.write(output)
    
    start_time = time.time()
    subprocess.run(command, check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    with open(f"./{output_filename}","a") as file:
        file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")
    print(f"{(size_param*6)+14}, {time.time()-start_time}")


def run_performance_checks(loops:int, output_filename: str, command: list):
    with open("./.sqlfluff", "w") as file:
        file.write("""
[sqlfluff]
dialect = bigquery
                   """)
        
    for size_param in range(loops):
        run_sqruff_check(size_param, output_filename, command)    

print("running sqruff performance checks")
run_performance_checks(200, "sqruff_results.csv", ["cargo", "run", "lint"])

print("running sqlfluff performance checks")
run_performance_checks(200, "sqlfluff_results.csv", ["sqlfluff", "lint"])

It only runs up to a file size of 1208 lines, but that is enough to see that the performance of sqruff starts lagging behind the performance of sqlfluff (at least on my machine)

let me know if it doesn't work for some reason

@benfdking benfdking added enhancement New feature or request rust Pull requests that update Rust code labels Nov 27, 2024
@gvozdvmozgu
Copy link
Collaborator

@concur1, please check how the PR #1008 affects performance.

@concur1
Copy link
Contributor Author

concur1 commented Dec 2, 2024

hi @gvozdvmozgu I have just run the script for that PR/branch and performance is very much improved =).

Here is the plot comparing sqruff and sqlfluff (sqlfluff data only valid below ~1600 lines due to the large_file_skip_bytes_limit). In my testing sqruff now peforms better the sqlfluff for all file sizes.

image

@gvozdvmozgu
Copy link
Collaborator

Yeah, the final solution would be to remove fancy regex and replace it with native functions, even though there are few of them, and they're very slow. I started doing this in another PR, but it's really exhausting, to be honest.

@concur1
Copy link
Contributor Author

concur1 commented Dec 2, 2024

Oooof yeah, sounds painful!

To be honest the performance of the latest branch/PR is more than sufficient for my usecase.

Do you think it would be useful to include one of the files I have been testing with in the automated benchmarks? e.g. a check to make sure a 2000 line file is completed within a certain amount of time (on my machine it would complete in less than 0.5 seconds but GHA may differ).

Let me know if you think it would be useful and I will upload one of these files.

@gvozdvmozgu
Copy link
Collaborator

I think we need a separate repository that will compare each dialect and each rule individually, as well as all of them together with the Python implementation, and it should be deployed on GitHub Pages.

@concur1
Copy link
Contributor Author

concur1 commented Dec 3, 2024

Any chance you could do a release so I could try out the python package with the latest updates?

@benfdking
Copy link
Collaborator

@concur1 I've pushed a new version; if you want to add a long one of your generated files to our benchmarks, that would be useful.

@concur1
Copy link
Contributor Author

concur1 commented Dec 4, 2024

@benfdking thanks for that! It looks like something happened with the python release and a couple of the jobs were skipped (including the lublish to pypi). Maybe try re-running?

I can try and add a large file to the benchmarks next week probably.

@benfdking
Copy link
Collaborator

Oh I think I know what this is ... old Mac os image has been killed will look into it.

@benfdking
Copy link
Collaborator

That's released and should be on it's way out, sorry @concur1

@concur1
Copy link
Contributor Author

concur1 commented Dec 5, 2024

no worries, thanks for updating!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants