NL2SQL "Best Practices" #3322
-
There is no github discussions tab in the kernel memory repo, so posting here. @crickman The work on NL2SQL is awesome. I've currently adapted it to a project I'm working on using SQLite. In the initial blog post here, you mentioned exploring Microsoft SQL Server, MySql, PostgreSQL, and SqlLite. Are the other schema providers going to become open source since it seems only SQL Server is out there now or are you open to a SQLite contribution?
I understand as outlined above that there are many ways to potentially solve this problem, but I'm looking for "best practices" (which likely don't exist, but there are guilelines at least) or to start a discussion about tradeoffs of approaches. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Thank you Eddie for your enthusiasm, great write-up, and thoughtful approach. Before provide a direct response, I do want to call out that other options are available for integrating SQL into cognitive use-cases. For instance, Azure Cognitive Search supports a SQL indexer. It may also be worth emphasizing that using LLM for code-generation outside of design-time/co-pilot scenarios has inherit complexities and risks. With regards to the approach you've outlined: [1/2] I like your thinking here. I recently working on a different POC that used a similiar "glossary" approach (although not for query generation). For the NL2SQL demo, I didn't want to restrict to a single database or schema...so this informed the approach for defining and managing schema. It certainly makes sense that you are adapting this as you see fit. Yes, reducing surface area w.r.t. how much schema is expressed in the prompts is the way to go. Using vector database to create a semantic glossary of table-columns and querying for the most relevant columns (with over-selection) can be an effective approach. I suspect you've thought through this already, but I'd include: column-name, table-name, and description as meta-data along with the embedding (basically, anything you'd need to extract to include in the schema passed to the prompt). I'd imaging a skeleton-template that includes key fields, and using the glossary to flush out useful non-key/attribute fields and "plain code" to assemble the schema-definition from this template. If there's anything that resemebles a "fact-column", I might always include that in the template as well (e.g. Price, Count, etc...). It might be conceivable that there are some smaller tables whose complete definition is always in the schema template depending on position/utilization. In terms of generating the YAMl schema expression, or the source data for the column glossary, I might consider using SQL for code/data generation followed with manual edits/manipulation. You can certainly manually enrich the schema descriptions as needed. How to teach the model to translate its notion of "state" to a numeric column code may require some iteration. If you have enough schema control, a transform so a more intuitive value may be ideal. [3] I can envision a lot of cases where targeting a view might provide more consistent results, based on your schema & scenario, I'd perhaps avoid utilizing this approach on the outset. For one, it might create cardinality issues for certain aggregate results. [4] Planner can be useful to coordinate discrete "steps" (across functions, for instance) for this application it might interfere with the comprehensive scope required for code generation. Planners are also useful for coordinating "steps" that may be dynamic (invoking different functions depending on context). I suspect they may not add value to your case. Please let me know if I've missed anything and feel free to follow-up with additional thoughts and questions. |
Beta Was this translation helpful? Give feedback.
-
This all sounds good. Totally agree that it depends on scale. If you can get all the tables in your schema described within a prompt with room for the additional required inputs, no need for additional complexity. I have had success with using a vector query to (over-)select from a huge input domain and then let the model reason through the result in the context of a very specific goal. I suspect you'll find breaking out the look-up table resulting in a lot less friction than fighting with the model (as you've concluded). Looking forward to hearing more. |
Beta Was this translation helpful? Give feedback.
Thank you Eddie for your enthusiasm, great write-up, and thoughtful approach.
Before provide a direct response, I do want to call out that other options are available for integrating SQL into cognitive use-cases. For instance, Azure Cognitive Search supports a SQL indexer.
It may also be worth emphasizing that using LLM for code-generation outside of design-time/co-pilot scenarios has inherit complexities and risks.
With regards to the approach you've outlined:
[1/2] I like your thinking here. I recently working on a different POC that used a similiar "glossary" approach (although not for query generation). For the NL2SQL demo, I didn't want to restrict to a single database or schema...so…