Slow Table List Retrieval in Many Tables #589

mrchypark · 2024-08-05T07:55:22Z

Description

We are currently managing a large number of tables (964) in our database, with expectations for continued growth. We've encountered a significant performance issue when attempting to retrieve the list of tables using the SELECT * FROM information_schema.tables query. This operation is taking over 30 seconds to complete, which is causing concerns about efficiency and scalability.

Context

Current number of tables: 964
Expected: Continued growth in the number of tables
Current behavior: DROP TABLE and CREATE TABLE operations do not support IF NOT EXISTS or IF EXISTS clauses
Workaround attempt: Using SELECT * FROM information_schema.tables to check table existence
Problem: This query is taking more than 30 seconds to execute

Impact

The slow retrieval of the table list is affecting our ability to efficiently manage and operate on our database schema. This issue may become more severe as we continue to add more tables to the database.

Questions for Investigation

Is this query performance normal for the given number of tables?
Are there any optimizations we can apply to the information_schema.tables query to improve its performance?
Can we implement support for IF NOT EXISTS and IF EXISTS clauses in our CREATE TABLE and DROP TABLE operations to avoid the need for this query?
Are there any index optimizations we can apply to improve the performance of metadata queries?
Is there an alternative method to efficiently check for table existence without querying the entire table list?
How does this query performance scale with an increasing number of tables?

Additional Information

While other queries are running concurrently, the time taken for retrieving the table list seems disproportionately long.
We need to find a solution that scales well as our database continues to grow.

Next Steps

We would appreciate insights from the team on:

Potential causes of this performance issue
Recommended approaches for investigation
Possible solutions or optimizations to improve the table list retrieval performance

Any additional context or suggestions for improving our database management approach would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

gruuya · 2024-08-07T05:37:48Z

Hey, thanks for a very detailed issue.

Indeed I anticipated this might be the case

seafowl/src/catalog/metastore.rs

Lines 148 to 152 in 5b0462f

    
           // Build a delta table but don't load it yet; we'll do that only for tables that are 
        
           // actually referenced in a statement, via the async `table` method of the schema provider. 
        
           // TODO: this means that any `information_schema.columns` query will serially load all 
        
           // delta tables present in the database. The real fix for this is to make DF use `TableSource` 
        
           // for the information schema, and then implement `TableSource` for `DeltaTable` in delta-rs.

So basically the root of the issue is that the DataFusion loads each (in our case Delta) table serially whilst generating the information_schema.tables table, which doesn't scale past a certain number of tables: https://github.com/apache/datafusion/blob/bddb6415a50746d2803dd908d19c3758952d74f9/datafusion/core/src/catalog_common/information_schema.rs#L93-L102

My presumption is that this can be averted by having a bulk table load method on the schema provider (or something else). The only real reason to load each table (since we already have its name) is to find out its table type to display.

In the short term we can take a look at fixing the IF (NOT) EXISTS functionality.

mrchypark · 2024-08-07T07:01:17Z

Thank you for your kind and detailed answer. I now understand that we need to request all information to verify the actual state of deleted tables. I was curious about this because SQLite has metadata, so I wondered why it would take a long time.
Regarding all the methods you suggested, I'm afraid I don't have a strong opinion on any of them yet (probably because I'm not very familiar with Rust). Implementing the functionality of IF (NOT) EXISTS might be a low-hanging fruit that we could tackle.

gruuya · 2024-08-07T09:19:49Z

There's now an accompanying DataFusion issue apache/datafusion#11865, though I'm not sure about a good long-term solution there.

In the meantime, the workaround for IF (NOT) EXISTS should be available with #593.

gruuya mentioned this issue Aug 7, 2024

Sub-optimal information_schema table retrieval logic apache/datafusion#11865

Open

gruuya mentioned this issue Aug 7, 2024

Add flag denoting first sync response #593

Merged

gruuya closed this as completed in #593 Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Table List Retrieval in Many Tables #589

Slow Table List Retrieval in Many Tables #589

mrchypark commented Aug 5, 2024

gruuya commented Aug 7, 2024

mrchypark commented Aug 7, 2024

gruuya commented Aug 7, 2024

Slow Table List Retrieval in Many Tables #589

Slow Table List Retrieval in Many Tables #589

Comments

mrchypark commented Aug 5, 2024

Description

Context

Impact

Questions for Investigation

Additional Information

Next Steps

gruuya commented Aug 7, 2024

mrchypark commented Aug 7, 2024

gruuya commented Aug 7, 2024