Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Performance - LIMIT clause does been optimized in execution stage #754

Open
penghuo opened this issue Oct 8, 2024 · 0 comments
Labels
enhancement New feature or request untriaged

Comments

@penghuo
Copy link
Collaborator

penghuo commented Oct 8, 2024

Is your feature request related to a problem?

Problem statements

A query with a LIMIT clause, such as SELECT * FROM cloudtrail LIMIT 10, takes 1.5 minutes to execute when querying a CloudTrail dataset containing over 10,000 files.
Key issues observed:

  • Excessive Resource Consumption: The query plans more than 10K tasks, With the Dynamic Resource Allocation (DRA) feature, EMR-S Spark spins up additional nodes, which adds a delay of 30+ seconds.
  • Inefficient Execution: Despite Spark's limit optimization, which is designed to minimize unnecessary file scans by incrementally reading splits, the query is not efficiently skipping unneeded files.

Root cause analysis

We use the following example to explain the problem. The dataset consists of 225 files, and Spark splits and groups them into 12 input splits.
When the user submits the query SELECT * FROM alb_logs LIMIT 10, the expected behavior is that Spark should scan only one split (i.e., one file). If the query successfully retrieves the 10 rows, it returns the result without scanning additional files. Otherwise, Spark will scan more files, controlled by the spark.sql.limit.scaleUpFactor. For instance, the query execution plan for this query is shown in the figure below. The Spark job only contains one task, which fetches 10 rows from a single file without requiring a shuffle stage. The entire job took 24 milliseconds to complete.
Screenshot 2024-10-08 at 7 53 31 AM

However, when the query is submitted through FlintREPL, it interacts with Spark using the following code: spark.sql("SELECT * FROM alb_logs LIMIT 10").toJSON.collect()
In this case, the execution plan differs. The introduction of the toJSON operator causes Spark to split the execution into two stages, with a shuffle stage in between. which leads to unnecessary overhead.
Screenshot 2024-10-08 at 7 44 32 AM

What solution would you like?
I would like a solution that can progressively plan the InputPartition and collect only the necessary dataset.

What alternatives have you considered?
n/a

Do you have any additional context?
attached

@penghuo penghuo added enhancement New feature or request untriaged labels Oct 8, 2024
@penghuo penghuo changed the title [FEATURE] Performance - LIMIT clause does not push down [FEATURE] Performance - LIMIT clause does been optimized in execution stage Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request untriaged
Projects
None yet
Development

No branches or pull requests

1 participant