NL2SQL "Best Practices" #3322

Eddie-Hartman · 2023-10-26T15:48:39Z

Eddie-Hartman
Oct 26, 2023

There is no github discussions tab in the kernel memory repo, so posting here. @crickman

The work on NL2SQL is awesome. I've currently adapted it to a project I'm working on using SQLite. In the initial blog post here, you mentioned exploring Microsoft SQL Server, MySql, PostgreSQL, and SqlLite. Are the other schema providers going to become open source since it seems only SQL Server is out there now or are you open to a SQLite contribution?
Also, while building a query just based off of the schema is fine, my issue has been being able to understand data within the schema to build intelligent queries. A specific example being say that I have an int column that represents US States. These may be saved as numbers and be an enum in code, but how would you recommend I adapt what you already made to have the ability to translate a query such as ""What inventory is in Alabama?"" to then use the correct number code for Alabama? In my particular case I have hundreds of columns like this where some will also have hundreds of codes, such as all of the counties in America having a number.
I've thought about taking a few different approaches:

Modifying the yaml schema (I've made my own yaml schema provider as well since sqlite doesn't have multiple schemas and it doesn't have descriptions like sql server) to include a dictionary under the columns that need it, but have concerns about overwhelming the schema and also using too many tokens when they aren't all needed (what if I don't care about county for a particular query).
Making a SQLite memory store with descriptions and keys for each column, and then storing some form of dictionary (currently json) and then searching for relevant columns based on the question and supplying that as additional context along with the schema. This requires having the values in the description to search over as well as the columns and descriptions since for things like states, "Alabama" would not be in the column name or description, but is a possible value for that column. This makes the relationship weaker, but is the approach I'm currently using and is working. The big drawback to this for me is sending in unnecessary context when it's not needed and not always getting the proper column context (the relationship may be weak and not in the top 5 results), but it seems to work most of the time and is my current approach.
Creating a view with all of the data essentially joined on the dictionary values so that everything is in plain text, but then I likely lose the ability to perform writes as well with this query generator, which was something I was looking forward to trying.
There may be a totally different approach I'm completely missing such as using a planner to define a query, then process all columns that query gets in the where or join clauses, updating context based on that, then generating an updated query (this seems possible, but more fickle). Let me know!

I understand as outlined above that there are many ways to potentially solve this problem, but I'm looking for "best practices" (which likely don't exist, but there are guilelines at least) or to start a discussion about tradeoffs of approaches.

Answered by crickman

Oct 27, 2023

Thank you Eddie for your enthusiasm, great write-up, and thoughtful approach.

Before provide a direct response, I do want to call out that other options are available for integrating SQL into cognitive use-cases. For instance, Azure Cognitive Search supports a SQL indexer.

It may also be worth emphasizing that using LLM for code-generation outside of design-time/co-pilot scenarios has inherit complexities and risks.

With regards to the approach you've outlined:

[1/2] I like your thinking here. I recently working on a different POC that used a similiar "glossary" approach (although not for query generation). For the NL2SQL demo, I didn't want to restrict to a single database or schema...so…

View full answer

crickman · 2023-10-27T22:23:38Z

crickman
Oct 27, 2023
Collaborator

Thank you Eddie for your enthusiasm, great write-up, and thoughtful approach.

Before provide a direct response, I do want to call out that other options are available for integrating SQL into cognitive use-cases. For instance, Azure Cognitive Search supports a SQL indexer.

It may also be worth emphasizing that using LLM for code-generation outside of design-time/co-pilot scenarios has inherit complexities and risks.

With regards to the approach you've outlined:

[1/2] I like your thinking here. I recently working on a different POC that used a similiar "glossary" approach (although not for query generation). For the NL2SQL demo, I didn't want to restrict to a single database or schema...so this informed the approach for defining and managing schema. It certainly makes sense that you are adapting this as you see fit.

Yes, reducing surface area w.r.t. how much schema is expressed in the prompts is the way to go. Using vector database to create a semantic glossary of table-columns and querying for the most relevant columns (with over-selection) can be an effective approach. I suspect you've thought through this already, but I'd include: column-name, table-name, and description as meta-data along with the embedding (basically, anything you'd need to extract to include in the schema passed to the prompt).

I'd imaging a skeleton-template that includes key fields, and using the glossary to flush out useful non-key/attribute fields and "plain code" to assemble the schema-definition from this template. If there's anything that resemebles a "fact-column", I might always include that in the template as well (e.g. Price, Count, etc...). It might be conceivable that there are some smaller tables whose complete definition is always in the schema template depending on position/utilization.

In terms of generating the YAMl schema expression, or the source data for the column glossary, I might consider using SQL for code/data generation followed with manual edits/manipulation. You can certainly manually enrich the schema descriptions as needed. How to teach the model to translate its notion of "state" to a numeric column code may require some iteration. If you have enough schema control, a transform so a more intuitive value may be ideal.

[3] I can envision a lot of cases where targeting a view might provide more consistent results, based on your schema & scenario, I'd perhaps avoid utilizing this approach on the outset. For one, it might create cardinality issues for certain aggregate results.

[4] Planner can be useful to coordinate discrete "steps" (across functions, for instance) for this application it might interfere with the comprehensive scope required for code generation. Planners are also useful for coordinating "steps" that may be dynamic (invoking different functions depending on context). I suspect they may not add value to your case.

Please let me know if I've missed anything and feel free to follow-up with additional thoughts and questions.

1 reply

Eddie-Hartman Oct 29, 2023
Author

Great insights. A few follow ups.
One thing I hadn't mentioned at all is fine-tuning. Have you tried that for a particular schema? I was wondering if this is worth going through the effort at all.
The approach of RAG (retrieval augmented generation) seems to be working well for the large amount of context here. For some items, such as County info, there is just too much context to put into a prompt, so I'll make a separate lookup table and add it to the schema. So I'm just generally saying that it really depends on the scale of data we're talking about.
I have to tune my prompting, but I recently added better logging and error handling, so I can monitor which causes and questions are working well vs which are not.

I suspect you've thought through this already, but I'd include: column-name, table-name, and description as meta-data along with the embedding (basically, anything you'd need to extract to include in the schema passed to the prompt).

This is a good point. The only thing I'm currently missing in the meta-data is the table name. I thought that since my column names are unique across my tables, it would pick up on what table they correlate to from the above schema, but it sometimes does break down thinking that the columns are all on the same table and thought this was a problem with my prompt language and how I modified it to use the additional context, but just simply adding the table name to the embedding and re-trying the failing cases sounds promising.

I'll post more if I find any other particularly important insights or other follow up questions. Thanks for your help.

crickman · 2023-10-31T03:45:46Z

crickman
Oct 31, 2023
Collaborator

This all sounds good.

Totally agree that it depends on scale. If you can get all the tables in your schema described within a prompt with room for the additional required inputs, no need for additional complexity.

I have had success with using a vector query to (over-)select from a huge input domain and then let the model reason through the result in the context of a very specific goal.

I suspect you'll find breaking out the look-up table resulting in a lot less friction than fighting with the model (as you've concluded).

Looking forward to hearing more.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NL2SQL "Best Practices" #3322

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

NL2SQL "Best Practices" #3322

Eddie-Hartman Oct 26, 2023

Replies: 2 comments · 1 reply

crickman Oct 27, 2023 Collaborator

Eddie-Hartman Oct 29, 2023 Author

crickman Oct 31, 2023 Collaborator

Eddie-Hartman
Oct 26, 2023

Replies: 2 comments 1 reply

crickman
Oct 27, 2023
Collaborator

Eddie-Hartman Oct 29, 2023
Author

crickman
Oct 31, 2023
Collaborator