Text-to-SQL Prompt Project【Prompt Engineering】

We've just started an open source project, pg-text-query, with the goal of making production-ready Large Language Model (LLM) hints for text-to-SQL. Our goal is to develop a top-notch text-to-SQL translation using LLM, our own deep knowledge of the PostgreSQL database, and rigorous testing.
insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

1. Text to SQL: The Basics

SQL is the third most used programming language. It is the first language many aspiring developers learn. We just released AI-powered text-to-SQL translation in bit.io to lower the entry barrier for learning and using SQL, allowing users to focus on the underlying logic of queries rather than syntax.

You can use it now on bit.io. In the bit.io Query Editor or Query API, put the following on the first line: #!translate:text. On the next line, enter your simple language request as a question, for example, What record with provider 'aws' has the lowest latency? , and it will be translated into a query you can edit or run.

Our Text-to-SQL feature utilizes OpenAI's Codex model to send text and database schema information ("hints") to OpenAI LLM. The model generates the requested SQL and returns it to the user, who can then edit (if necessary) and execute the query.
insert image description here

This might sound simple, like we're just sending text to a third-party API and returning the result. Of course, OpenAI's LLM does the heavy lifting here. However, there are a lot of nuances involved in making sure your models return usable results and that they are returned quickly, efficiently, and safely. That's where we focus our energy. Decisions such as model selection, hyperparameter values, and prompt content all have a huge impact on the quality of the returned results.

2. State-of-the-art technology (so far)

Our text to SQL translation feature works great. We spent a lot of time working on initial hints, comparing several different models, and tuning model hyperparameters. So far, most improvements to hints have been done by trial and error: we read a lot about hint engineering, wrote a lot of hints, and iterated on what worked. We also leveraged the existing schema aggregation feature to pass schema details along with hints, allowing OpenAI models to return SQL with the correct identifiers.

The current prompt is fairly simple. It starts by passing three annotations: the first specifies the language (PostgreSQL); the second passes schema details (schema, tables, columns, types); and the third specifies the desired output, containing the user's natural language queries. The last line of the prompt, SELECT 1 , indicates that we do not want to receive output SQL as comments, but rather SQL ready to execute.

-- Language PostgreSQL
-- Table penguins, columns = [species text, island text, bill_length_mm double precision, bill_depth_mm double precision, flipper_length_mm bigint, body_mass_g bigint, sex text, year bigint]
-- A PostgreSQL query to return 1 and a PostgreSQL query for {natural language query}
SELECT 1;

With this in mind, given an explicitly specified query in a simple language, OpenAI models typically:

  • returns the job code corresponding to the user's plain text query,
  • returns Postgres-compatible SQL code, not code from other languages ​​or other SQL variants, and
  • Include the correct (but not always well-formed; see below) identifier corresponding to the database schema

insert image description here

3. Challenge

There is still room for improvement.

One of the main hurdles is sending concise hints to Codex models while still providing enough schema information. We want to provide enough information about the database schema to make usable queries without sending too many tokens in the prompt and possibly increasing the cost of using the OpenAI model API.

Properly formatting and quoting some identifiers is another challenge. The output was correct most of the time, but in some cases tables in schemas other than "public" were malformed and table names with uppercase or special characters in quotes were not returned. These specific nuances of SQL syntax require careful handling to ensure accurate output.

Preventing abuse of text-to-SQL transformation functionality through hint injection is another important challenge, critical to maintaining user trust in the system. Preventing misuse helps us narrow our focus and ensures we don't incur unnecessary costs or risk exposing common code translation tools. For example, it is currently possible (though inconvenient) to generate code in other languages ​​by:

#!translate:text
return a string defining a python function for adding two numbers

This query returns:

SELECT 'def add(x, y): return x + y'

There are easier ways to get AI-generated Python code; and there are good reasons to store code snippets in databases. Simply preventing this usage pattern won't solve the problem, but there is still value in anticipating and preparing for possible unexpected usage patterns.

Finally, it is important to prevent users from accidentally modifying or deleting data. No one should execute LLM-generated code without review. But we also want to be very aware of when a user might run a query that might result in data modification or loss.

For example, the (rather vague) hint:

#!translate:text
update the table to make it clear that all of the islands in the table are in Antarctica.

Return the following SQL:

UPDATE penguins SET island = 'Antarctica' WHERE island IS NOT NULL

Overlaying the "Islands" column with Antarctica is probably not the user's intent. Perhaps the intent was to add a "continent" column, or to append "(Antarctica)" to each entry in the islands column, although the intent is not clear from the prompt. In any case, it would be useful to have proper safeguards in place to prevent users from blindly performing such queries and changing data accidentally.

Of course, we're not pointing out these challenges for fun. We have plans to address these issues.

4. pg-text-query open source project

We are continuously improving the text to SQL translation feature. We want to share. We are developing an open source project for hinting, configuration, and testing to engage with the community and collect feedback, and to share our findings widely. LLMs are often used as generalists: they are good at translating any language into any other and responding to arbitrary text prompts. We want to learn how to make the best text to SQL translator possible.

You can start using some key features right away: These tools allow you to immediately start improving the prompts for text-to-SQL conversion.

5. Prompt word playground

Clone this repository ; install streamlit with pip install streamlit.py; then, run streamlit run playground/app.py from the root directory. This will open an interactive "prompt word playground" where you can experiment with different combinations of prompts and pattern details.

insert image description here

This is useful for quickly testing and iterating on different prompt ideas and building intuition about what works and what doesn't. You can set "initialization hints" (which the end user does not have access to); the user's simple language query and schema details, and see how these different parts of the query interact with each other. SQL can then be generated and even executed on the live database. Make sure to double-check all generated SQL before executing it to make sure you don't accidentally delete or modify data.
insert image description here

6. Schema Details Utility

The db_schema.py module includes utilities for extracting structured schema data from Postgres databases. It is useful to provide enough schema information to enable the model to contain the correct identifier. However, too much schema information may tie up a large number of tokens, incurring unnecessary costs, and may leave too few tokens for the model to successfully generate the required SQL.

You can use this module as follows:

import os
from pprint import pprint

import bitdotio
from dotenv import load_dotenv
from pg_text_query import get_db_schema

DB_NAME = "bitdotio/palmerpenguins"
b = bitdotio.bitdotio(os.getenv("BITIO_KEY"))

# Extract a structured db schema from Postgres
with b.pooled_cursor(DB_NAME) as cur:
db_schema = get_db_schema(cur, DB_NAME)
pprint(db_schema)

7. Prompt and query generation

You can generate prompts (based on our prompt engineering work so far) using the prompt.py module, which provides helpers to prepare Postgres query prompts.

# Construct a prompt that includes text description of query
prompt = get_default_prompt(
"most common species and island for each island",
db_schema,
)

# Note: prompt includes extra `SELECT 1` as a naive approach to hinting for
# raw SQL continuation
print(prompt)

Returns the following hints:

-- Language PostgreSQL
-- Table penguins, columns = [species text, island text, bill_length_mm double precision, bill_depth_mm double precision, flipper_length_mm bigint, body_mass_g bigint, sex text, year bigint]
-- A PostgreSQL query to return 1 and a PostgreSQL query for most common species and island for each island
SELECT 1;

The gen_query.py module is a wrapper around openai.Completion.create which handles
sending requests to the OpenAI API.

# Using default OpenAI request config, which can be overriden here w/ kwargs
query = generate_query(prompt)
print(query)

Returning from the prompt above:

SELECT species, island, COUNT(*) FROM penguins GROUP BY species, island

8. Next step plan

This project is still in its infancy, but we have a few main directions we want to explore:

  • The project will include a test suite with different types of queries and required text to test different prompts.
  • Using this test suite, we plan to compare different models, hyperparameters, and hints. Are some more accurate than others? Can we achieve the same accuracy with shorter prompts? Can we use faster or more efficient models without sacrificing accuracy?
  • We plan to document any avenues for model abuse, along with mitigation strategies.
  • In the long run, we will also fine-tune the model for a large number of SQL queries and translations.

Original Link: Text-to-SQL Prompt Project—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/130741976
Recommended