Using AI to generate SQL queries from text prompts offers many potential benefits. For example, business analysts and executives can retrieve data without relying on specialized resources. The democratization of data access enables faster querying and decision-making, coupled with the ease of using natural language. This can also free up resources for your data team.
Oracle Autonomous Database Select AI’s text-to-SQL generation – also called natural language-to-SQL (NL2SQL) – is a game-changing way to query databases, but the following common challenges can make it difficult to obtain the results you may expect:
- You have hundreds of tables in your schema that support your enterprise application.
- The names of your tables, views, and columns are cryptic or ambiguous.
- The questions you ask involve potentially complex formulas, and enterprise-specific or domain-specific jargon.
- You expect AI to magically produce the right answers regardless of how you ask the question.
If any of these resonate, you may find the guidance in this blog helpful.
Writing queries is challenging, even for a SQL expert
Imagine that you have a newly hired SQL expert who knows how to write amazingly complex queries but understands little (or nothing) about your business domain, application, and database schema. Moreover, the schema has hundreds of tables. Suppose you ask the SQL expert to write a query to answer the question: “What are the best-performing products year over year, and what are the reasons for that performance?” The SQL expert will likely struggle to get the right answer, or at least the specific answer you’re looking for.
Perhaps it’s because:
- Multiple DBAs, developers, and architects had a hand in designing the tables and naming their columns, which resulted in a lack of naming and documentation consistency.
- Some tables may appear to be very similar, both in name and content, making it unclear which should be used.
- Some of your business logic—such as how certain KPIs are calculated—has been embedded in your application code, which obscures the calculation methods used to derive them.
- It’s unclear how certain terms are defined. For example, how do you define “performance” and what constitutes “good” performance? Or, how should a “year” be interpreted? By the calendar, your company’s fiscal year, or the trailing twelve months?
When faced with these stumbling blocks, even a well-versed human SQL expert could struggle to write the right SQL query to answer a seemingly simple question.
LLM as SQL expert
Now, enter AI. Why should we expect a Large Language Model (LLM) to fare much better? It is widely acknowledged that LLMs have a wealth of general knowledge and may even understand some business jargon. But in general, LLMs will have knowledge gaps that need to be filled before they can generate SQL queries with the specificity needed to correctly address your business questions.
And so, you may need “a reality check” on what you can expect from LLMs that generate SQL. More importantly, you may want to know what you can do to help LLMs get more context-relevant results.
In this blog, I’ll explore some best practices you can implement using Select AI, which, among other features, supports Oracle Autonomous Database’s text-to-SQL solution:
- Enrich and enhance the metadata available to the LLM
- Define views that join tables that are normally used together
- Define columns to compute common or often-used KPIs and metrics
- Provide the most relevant and the minimum set of table metadata to the LLM
- Improve your prompts
- Choose an LLM that excels at SQL generation
1. Enrich your schema metadata
Let’s start with metadata. In Select AI, metadata is used to augment the user prompt and can include information about your data such as:
- Table and view definitions, their names and column names and data types
- Table, view, and column comments and annotations
- Referential integrity constraints on your tables
There are several ways you can enrich your metadata as we’ll discuss here. Each can be used independently or in combination.
Use meaningful names
If your tables and columns have names that are intrinsically meaningful, that’s a good start. If they’re cryptic or generic (like “FLEX_FIELD_1”), the LLM isn’t likely to be able to effectively use them because it can’t interpret what their purpose is. You could rename the tables and columns, but this may not always be possible if applications or reports are already using them. Instead, you can define a view over an existing table or view to address cryptic naming. For example, you can define a view over the following table definition to improve the table metadata, as shown below:
-- Table with Cryptic names CREATE TABLE tab1 ( id NUMBER, field1 VARCHAR2(256), dt DATE, flex_field2 VARCHAR2(4000)); -- View with readable names CREATE OR REPLACE VIEW EMPLOYEE AS SELECT id AS employee_id, field1 AS full_name, dt AS hire_date, flex_field2 AS home_address FROM tab1;
The new view definition can be included in your AI profile.
Enhance metadata with comments
In addition, you can include table and column comments and enable comments in your AI profile. Table and column comments are brief descriptions explaining what each is, how it should be used, what values are valid, and even how to interpret values. Such comments can improve an LLM’s ability to understand their purpose and can improve the applicability of the generated SQL queries. Here’s an example of how to specify a comment on a table:
COMMENT ON MY_GENERIC_TABLE IS 'This table describes customer demographic data…' In your AI profile attributes, specify the following: "comments":"true" Note that extensive use of comments and metadata in general can lead to exceeding token limits for certain LLMs, depending on the comments themselves and the number of tables and columns involved.
Enhance metadata with constraints
Similarly, an LLM may not be able to determine which combination of tables should be joined to produce a desired result. Referential integrity constraints, such as foreign keys, may be defined on your tables, but these are not automatically included in the metadata that Select AI sends to the LLM. Such metadata can help the LLM understand relationships among tables.
In your AI profile attributes, specify the following:
"constraints":"true"
Here’s an example that defines a foreign key constraint to the department table using the department number (dept_no) column:
CREATE TABLE employee ( emp_no NUMBER PRIMARY KEY, emp_name VARCHAR2(50), salary NUMBER, dept_no NUMBER, CONSTRAINT emp_dept_fk FOREIGN KEY (dept_no) REFERENCES department(dept_no) );
Enhance metadata with annotations
In Oracle Database 23ai, you also have an option to use annotations that can be specified on your tables and columns. To enable annotations in the AI profile, specify the following:
"annotations":"true"
Here’s an example that defines several annotations on columns and the table:
CREATE TABLE employee ( emp_no NUMBER ANNOTATIONS (mandatory, system_generated, alias 'employee number'), emp_name VARCHAR2(50) ANNOTATIONS (display_value), salary NUMBER, dept_no NUMBER ANNOTATIONS (alias 'department number') ) ANNOTATIONS (primary_table 'Table storing main information about employees);
2. Pre-join tables using views
If you or your DBA know that certain table joins are used repeatedly to respond to user prompts, apply your domain knowledge to define views over these tables, which can simplify the LLM’s work. Consider reviewing the content of the V$SQL view to review previously run queries.
As mentioned above, enabling constraints can help the LLM; however, there is a chance that the LLM still deduces needed joins incorrectly. Pre-joining tables enables the LLMs to use the intended relationship between tables.
3. Define columns that compute metrics
If you have common or often-used formulas, metrics, or KPIs that are derived from one or more columns or tables, consider defining a view that explicitly computes these as columns. This way, the LLM does not have to figure out your intended formula each time it encounters it and can potentially avoid the generation of inconsistent or erroneous results.
For example, if you often refer to net profit margin, you may want to define a column in a view with the formula:
SELECT product_id, …, net_income/total_revenue*100 AS net_profit_margin FROM my_finance_table;
You may think that describing the formula in your prompt offers more flexibility; however, LLMs still have limitations to reliably perform certain mathematical computations. It’s safer and provides a uniform interpretation to include such computations directly in your view definition.
4. Minimize metadata
Select AI provides four options for determining which metadata should be provided to the LLM:
- Specify the schema(s) only. This includes metadata for all the tables in the schema(s) in the object list.
- Specify individual tables. This limits the metadata to the tables in the object list.
- Specify a combination of schema name(s) and individual tables.
- Specify the object list mode as automated, which will dynamically determine which metadata to provide to the LLM. This can be used with either the schema-only or individual table specification.
When it comes to LLMs and metadata, the phrase “less is more” is especially true because having too many tables represented can introduce a few issues:
- The LLM needs to process and make sense of all the metadata provided.
- A larger prompt involves more tokens, resulting in higher costs.
- A large prompt could result in exceeding the LLM token limit, so it can’t process the request.
It is a good practice to specify a minimal number of tables in your AI Profile, unless your schema already has a small set of tables. Alternatively, as noted above, you can leverage the automated object list mode, which uses RAG on the metadata to identify tables that are most likely to be applicable to the prompt provided by the user. This avoids having to specify a set of tables and possibly define multiple AI profiles.
To limit metadata explicitly, you may choose to define multiple AI profiles with subsets of tables that correspond to certain domains or use cases. Then, using an agentic approach, your LLM could select the most likely profile (based on its description) to address the user prompt. For example, if your app has a “Sales Support” section, use an AI profile that includes related tables. Similarly, in the “Customer Relationship” section, use a customer-focused AI profile that lists only customer-related tables.
In other cases, the number of tables can be reduced by joining tables and views using a new view. This is another benefit of pre-joining tables.
In addition, you should avoid listing tables and views in your AI profile that contain the same or similar content, especially a “dev” or “test” version of the real tables. If the wrong table is selected based on the object list in the AI profile, your results will likely be incorrect. Similarly, consider defining views that omit columns that are redundant or are not (properly) populated, as the LLM would not know — nor have access to — the content in these columns to determine if they are appropriate to use. This technique can reduce the likelihood of the LLM including such columns in a generated query.
5. Improve your prompt
If you ask an LLM a vague question, you should expect to get a general or less helpful result. For example, a question like “How many customers do I have?” could result in a count of all rows in your customer table. However, you may have actually meant to ask about active customers, or customers that you support, or customers of a specific product. As such, be sure to provide sufficient detail in your prompt to avoid the LLM having to make assumptions about what you mean.
On another level, do you know how to refer to the values in your tables, or better yet, does the LLM? For example, you ask, “How many customers are in the US?” You know there are thousands of such customers, but you get back zero as your answer. While the LLM knows from the metadata that there is a country column, it doesn’t know the correct values to use (unless you enumerated them in comments). Should the LLM use “US”, “USA”, “United States”, “United States of America”, or “Les États-Unis” in the WHERE clause? You see the problem. If the number of distinct values is reasonable and relatively unchanging, you may include this information as part of a column comment and enable using comments in your AI profile. Your application may allow or encourage users to view or search for related values, so they know what terms to use. For example, a user may supply a prompt to “list the distinct country names for product X.” Then when asking a more insightful question, use the appropriate country string in your prompt.
Case sensitivity can also cause problems. Suppose all values in your country column are in upper case (e.g., UNITED STATES). If your prompt stated, “How many customers are in the United States?”, you may likely get no results. To address this problem, you can use a new AI profile attribute to enable or disable case-sensitive WHERE clauses for string columns:
"case_sensitive_values": false
6. Choose your LLM wisely
Lastly, not all LLMs are good at SQL generation, although increasingly more can generate usable queries. It’s worth trying a few LLMs to compare results for your domain. Historically, llama3 and OpenAI models have performed well.
Consult various benchmarks, e.g., BIRD-SQL, to understand relative performance for text/natural language-to-SQL generation.
For more information…
See these resources.
- Try Select AI for free on OCI: Autonomous Database Free Trial
- Documentation: Use Select AI for Natural Language Interaction with your Database
- Documentation: DBMS_CLOUD_AI package
- Video: Getting Started with Oracle Select AI
- Blog: Select AI Enhances Text-to-SQL Support
- Blog: Now Available! Pre-built Embedding Generation model for Oracle Database 23ai
- Blog: Enhance your AI/ML applications with flexible Bring Your Own Model options
- LiveLab: Chat with Your Data in Autonomous Database Using Generative AI
- LiveLab: Develop apps using GenAI, Autonomous Database, and React