-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degrades exponentially as the number of lines in a file increases #939
Comments
This probably fixes it for my use-case but I think the performance still degrades exponentially, just more slowly. I have a plot demonstrating this below but I just want to say; thank you for looking into this @gvozdvmozgu and @benfdking , I didn't expect a fix to be done so quickly! This really improves things for my use-case. Here is how the plot with lines for both before and after the fix if you are interested: Here is a plot using Even so I want to share this plot that shows the |
How can I replicate this? |
I created the data for all of these plots with the above python script. Running the scriot creates a file called To run for the latest sqruff I changed this line: subprocess.run(["sqruff", "lint"] To instead be subprocess.run(["cargo", " run", "lint"] (This works if you are running the script from the To run for subprocess.run(["sqlfluff", "lint"] Hope that helps. If not I will try and create a notebook which will hopefully be a bit more reproducible and share that with you |
One other thing, you will want to change the value inside the range function to control how many loops there are and how many files are created. To create that last graph there were 300 loops. |
One more thing; when running for |
Ok, I have made a single script that you should work as long as you have sqlfluff installed and you run it from the import random
import subprocess
import time
initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
id STRING NOT NULL,
product STRING NOT NULL,
cost INTEGER NOT NULL,
timestamp_added TIMESTAMP NOT NULL
);
INSERT INTO `project_id.dataset_id.table_id` (
id,
product,
cost,
timestamp_added
)
VALUES"""
def run_sqruff_check(size_param: int, output_filename: str, command: list):
insert_value_strings_list = []
for i in range(size_param):
insert_value_strings_list.append(f"""(
{i},
{random.choice(["bike", "car", "tv", "orange"])},
{random.choice([100, 200, 300, 400])},
current_timestamp()
)""")
output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)
with open("./test.sql", "w") as file:
file.write(output)
start_time = time.time()
subprocess.run(command, check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
with open(f"./{output_filename}","a") as file:
file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")
print(f"{(size_param*6)+14}, {time.time()-start_time}")
def run_performance_checks(loops:int, output_filename: str, command: list):
with open("./.sqlfluff", "w") as file:
file.write("""
[sqlfluff]
dialect = bigquery
""")
for size_param in range(loops):
run_sqruff_check(size_param, output_filename, command)
print("running sqruff performance checks")
run_performance_checks(200, "sqruff_results.csv", ["cargo", "run", "lint"])
print("running sqlfluff performance checks")
run_performance_checks(200, "sqlfluff_results.csv", ["sqlfluff", "lint"]) It only runs up to a file size of 1208 lines, but that is enough to see that the performance of sqruff starts lagging behind the performance of sqlfluff (at least on my machine) let me know if it doesn't work for some reason |
hi @gvozdvmozgu I have just run the script for that PR/branch and performance is very much improved =). Here is the plot comparing |
Yeah, the final solution would be to remove fancy regex and replace it with native functions, even though there are few of them, and they're very slow. I started doing this in another PR, but it's really exhausting, to be honest. |
Oooof yeah, sounds painful! To be honest the performance of the latest branch/PR is more than sufficient for my usecase. Do you think it would be useful to include one of the files I have been testing with in the automated benchmarks? e.g. a check to make sure a 2000 line file is completed within a certain amount of time (on my machine it would complete in less than 0.5 seconds but GHA may differ). Let me know if you think it would be useful and I will upload one of these files. |
I think we need a separate repository that will compare each dialect and each rule individually, as well as all of them together with the Python implementation, and it should be deployed on GitHub Pages. |
Any chance you could do a release so I could try out the python package with the latest updates? |
@concur1 I've pushed a new version; if you want to add a long one of your generated files to our benchmarks, that would be useful. |
@benfdking thanks for that! It looks like something happened with the python release and a couple of the jobs were skipped (including the lublish to pypi). Maybe try re-running? I can try and add a large file to the benchmarks next week probably. |
Oh I think I know what this is ... old Mac os image has been killed will look into it. |
That's released and should be on it's way out, sorry @concur1 |
no worries, thanks for updating! |
I have tested this by generating sql files with an increasing number of lines, linting each file and measuring the time it takes.
Here is the result in a graph form:
In case you would like to try and replicate this:
Here is my config file:
Here the final file with 608 lines: test_sql.txt (I couldnt upload a .sql file here for some reason)
Here is the python script I used to generate the different files:
The text was updated successfully, but these errors were encountered: