Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Table List Retrieval in Many Tables #589

Closed
mrchypark opened this issue Aug 5, 2024 · 3 comments · Fixed by #593
Closed

Slow Table List Retrieval in Many Tables #589

mrchypark opened this issue Aug 5, 2024 · 3 comments · Fixed by #593

Comments

@mrchypark
Copy link
Contributor

Description

We are currently managing a large number of tables (964) in our database, with expectations for continued growth. We've encountered a significant performance issue when attempting to retrieve the list of tables using the SELECT * FROM information_schema.tables query. This operation is taking over 30 seconds to complete, which is causing concerns about efficiency and scalability.

Context

  • Current number of tables: 964
  • Expected: Continued growth in the number of tables
  • Current behavior: DROP TABLE and CREATE TABLE operations do not support IF NOT EXISTS or IF EXISTS clauses
  • Workaround attempt: Using SELECT * FROM information_schema.tables to check table existence
  • Problem: This query is taking more than 30 seconds to execute

Impact

The slow retrieval of the table list is affecting our ability to efficiently manage and operate on our database schema. This issue may become more severe as we continue to add more tables to the database.

Questions for Investigation

  1. Is this query performance normal for the given number of tables?
  2. Are there any optimizations we can apply to the information_schema.tables query to improve its performance?
  3. Can we implement support for IF NOT EXISTS and IF EXISTS clauses in our CREATE TABLE and DROP TABLE operations to avoid the need for this query?
  4. Are there any index optimizations we can apply to improve the performance of metadata queries?
  5. Is there an alternative method to efficiently check for table existence without querying the entire table list?
  6. How does this query performance scale with an increasing number of tables?

Additional Information

  • While other queries are running concurrently, the time taken for retrieving the table list seems disproportionately long.
  • We need to find a solution that scales well as our database continues to grow.

Next Steps

We would appreciate insights from the team on:

  1. Potential causes of this performance issue
  2. Recommended approaches for investigation
  3. Possible solutions or optimizations to improve the table list retrieval performance

Any additional context or suggestions for improving our database management approach would be greatly appreciated.

@gruuya
Copy link
Contributor

gruuya commented Aug 7, 2024

Hey, thanks for a very detailed issue.

Indeed I anticipated this might be the case

// Build a delta table but don't load it yet; we'll do that only for tables that are
// actually referenced in a statement, via the async `table` method of the schema provider.
// TODO: this means that any `information_schema.columns` query will serially load all
// delta tables present in the database. The real fix for this is to make DF use `TableSource`
// for the information schema, and then implement `TableSource` for `DeltaTable` in delta-rs.

So basically the root of the issue is that the DataFusion loads each (in our case Delta) table serially whilst generating the information_schema.tables table, which doesn't scale past a certain number of tables: https://github.com/apache/datafusion/blob/bddb6415a50746d2803dd908d19c3758952d74f9/datafusion/core/src/catalog_common/information_schema.rs#L93-L102

My presumption is that this can be averted by having a bulk table load method on the schema provider (or something else). The only real reason to load each table (since we already have its name) is to find out its table type to display.

In the short term we can take a look at fixing the IF (NOT) EXISTS functionality.

@mrchypark
Copy link
Contributor Author

Thank you for your kind and detailed answer. I now understand that we need to request all information to verify the actual state of deleted tables. I was curious about this because SQLite has metadata, so I wondered why it would take a long time.
Regarding all the methods you suggested, I'm afraid I don't have a strong opinion on any of them yet (probably because I'm not very familiar with Rust). Implementing the functionality of IF (NOT) EXISTS might be a low-hanging fruit that we could tackle.

@gruuya
Copy link
Contributor

gruuya commented Aug 7, 2024

There's now an accompanying DataFusion issue apache/datafusion#11865, though I'm not sure about a good long-term solution there.

In the meantime, the workaround for IF (NOT) EXISTS should be available with #593.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants