Code of LLMs: SQLCoder's introduction, installation, and detailed guide on how to use it

Code of LLMs: SQLCoder's introduction, installation, and detailed guide on how to use it

Table of contents

Introduction to SQLCoder

1. Results

2. Results by Question Category

Installation of SQL Coder

1. Hardware requirements

2. Download model weights

3. Use SQL Coder

4. Run SQLCoder in Colab

The first step is to configure the environment

The second step, test

The third step, download the model

Step 4, set up questions and prompts and tokenize

The fifth step is to generate SQL

How to use SQLCoder


Introduction to SQLCoder

In August 2023, SQLCoder was released, an advanced LLM for converting natural language questions into SQL queries. SQLCoder is fine-tuned on the basic StarCoder model . SQLCoder, a model with 15 billion parameters , outperforms gpt-3.5-turbo on the task of natural language to SQL generation on our sql-eval framework, and performs significantly among all popular open source models. It also significantly outperforms the text-davinci-003 model, which is more than 10 times the size.

Defog was trained on 10537 human-screened questions over 2 epochs. The questions are based on 10 different patterns. In the training data, none of the patterns in the evaluation framework were included.

Training is divided into 2 phases. The first stage is about questions classified as "easy" or "moderate" difficulty, and the second stage is about questions classified as "hard" or "super hard" difficulty.

The training results on easy+medium data are stored in a model called defog-easy. We find that additional training on hard+extra-hard data leads to a performance increase of 7 percentage points.

Official website online test : https://defog.ai/sqlcoder-demo/

GitHub官网GitHub - defog-ai/sqlcoder: SoTA LLM for converting natural language questions to SQL queries

1. Results

model

perc_correct

gpt-4

74.3

defog-sqlcoder

64.6

gpt-3.5-turbo

60.6

defog-easysql

57.1

text-davinci-003

54.3

wizardcoder

52.0

starcoder

45.1

2. Results by Question Category

We classify each generated question into one of 5 categories. The table shows the percentage of correctly answered questions broken down by category for each model.

query_category

gpt-4

defog-sqlcoder

gpt-3.5-turbo

defog-easy

text-davinci-003

wizard-coder

star-coder

group_by

82.9

77.1

71.4

62.9

62.9

68.6

54.3

order_by

71.4

65.7

60.0

68.6

60.0

54.3

57.1

ratio

62.9

57.1

48.6

40.0

37.1

22.9

17.1

table_join

74.3

57.1

60.0

54.3

51.4

54.3

51.4

where

80.0

65.7

62.9

60.0

60.0

60.0

45.7

Installation of SQL Coder

1. Hardware requirements

SQLCoder has been tested on an A100 40GB GPU, using bfloat16 weights. You can also load 8-bit and 4-bit quantized versions of the model on consumer GPUs with 20GB or more of memory . Examples include the RTX 4090 , RTX 3090 , and Apple's M2 Pro, M2 Max, or M2 Ultra chips with 20GB or more of memory.

2. Download model weights

Address : defog/sqlcoder · Hugging Face

3. Use SQL Coder

You can use SQLCoder through the transformers library by downloading our model weights from the Hugging Face repository. We've added sample code for inferring on a sample database schema.

python inference.py -q "Question about the sample database goes here"

Example question: Do we get more revenue from customers in San Francisco compared to customers in New York? Gives me the total income for each city and the difference between the two. You can also use the demo on our website, or run SQLCoder in Colab.

4. Run SQLCoder in Colab

Address : https://colab.research.google.com/drive/1z4rmOEiFkxkMiecAWeTUlPl0OmKgfEu7?usp=sharing#scrollTo=MKuocI44V-Bo

The first step is to configure the environment

!pip install torch transformers bitsandbytes accelerate

The second step, test

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

torch.cuda.is_available()

The third step, download the model

Load it in bf16 using A100 on Colab Pro (or any system with >30GB VRAM) . If not available,  load it in 8-bit with a GPU with at least 20GB of VRAM, or load it in 4-bit with at least 12GB of VRAM. On Colab, it works fine on V100, but crashes on T4.

The step of first downloading the model and then loading it into memory takes about 10 minutes. So please be patient :)

model_name = "defog/sqlcoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    # torch_dtype=torch.bfloat16,
    # load_in_8bit=True,
    load_in_4bit=True,
    device_map="auto",
    use_cache=True,
)

Step 4, set up questions and prompts and tokenize

Feel free to change the questions below. Edit the schema in the prompt if you want to experiment with your own database schema.

question = "What product has the biggest fall in sales in 2022 compared to 2021? Give me the product name, the sales amount in both years, and the difference."

prompt = """### Instructions:
Your task is to convert a question into a SQL query, given a Postgres database schema.
Adhere to these rules:
- **Deliberately go through the question and database schema word by word** to appropriately answer the question
- **Use Table Aliases** to prevent ambiguity. For example, `SELECT table1.col1, table2.col1 FROM table1 JOIN table2 ON table1.id = table2.id`.
- When creating a ratio, always cast the numerator as float

### Input:
Generate a SQL query that answers the question `{question}`.
This query will run on a database whose schema is represented in this string:
CREATE TABLE products (
  product_id INTEGER PRIMARY KEY, -- Unique ID for each product
  name VARCHAR(50), -- Name of the product
  price DECIMAL(10,2), -- Price of each unit of the product
  quantity INTEGER  -- Current quantity in stock
);

CREATE TABLE customers (
   customer_id INTEGER PRIMARY KEY, -- Unique ID for each customer
   name VARCHAR(50), -- Name of the customer
   address VARCHAR(100) -- Mailing address of the customer
);

CREATE TABLE salespeople (
  salesperson_id INTEGER PRIMARY KEY, -- Unique ID for each salesperson
  name VARCHAR(50), -- Name of the salesperson
  region VARCHAR(50) -- Geographic sales region
);

CREATE TABLE sales (
  sale_id INTEGER PRIMARY KEY, -- Unique ID for each sale
  product_id INTEGER, -- ID of product sold
  customer_id INTEGER,  -- ID of customer who made purchase
  salesperson_id INTEGER, -- ID of salesperson who made the sale
  sale_date DATE, -- Date the sale occurred
  quantity INTEGER -- Quantity of product sold
);

CREATE TABLE product_suppliers (
  supplier_id INTEGER PRIMARY KEY, -- Unique ID for each supplier
  product_id INTEGER, -- Product ID supplied
  supply_price DECIMAL(10,2) -- Unit price charged by supplier
);

-- sales.product_id can be joined with products.product_id
-- sales.customer_id can be joined with customers.customer_id
-- sales.salesperson_id can be joined with salespeople.salesperson_id
-- product_suppliers.product_id can be joined with products.product_id

### Response:
Based on your instructions, here is the SQL query I have generated to answer the question `{question}`:
```sql
""".format(question=question)
eos_token_id = tokenizer.convert_tokens_to_ids(["```"])[0]

The fifth step is to generate SQL

Can be very slow on a V100 with 4-bit quantization. Each query may take about 1-2 minutes. On a single A100 40GB, it takes about 10-20 seconds.


inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(
    **inputs,
    num_return_sequences=1,
    eos_token_id=eos_token_id,
    pad_token_id=eos_token_id,
    max_new_tokens=400,
    do_sample=False,
    num_beams=5
)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
torch.cuda.empty_cache()
torch.cuda.synchronize()
# 清空缓存,以便在内存崩溃时可以生成更多结果
# 在Colab上特别重要 - 内存管理要简单得多
# 在运行推断服务时
# 嗯!生成的SQL在这里:
print(outputs[0].split("```sql")[-1].split("```")[0].split(";")[0].strip() + ";")

How to use SQLCoder

updating……

Guess you like

Origin blog.csdn.net/qq_41185868/article/details/132571527