ChatGPT paper: Battle of large language model LLM: Dolly, LLaMA, Vicuna, Guanaco, Bard, ChatGPT--Comparison of natural language to SQL (NL2SQL, Text-to-SQL) (2)

3 Evaluation results

3.1 Spider dataset

Insert image description here
Table 2 lists the execution accuracy (EX) and test suite (TS) accuracy for various prompting strategies and model combinations. Our main findings are:

  • Open source models encounter difficulties on the Spider dataset: Despite a positive correlation between the number of parameters and model performance, open source models face challenges in achieving high accuracy on the Spider dataset. For example, although Vicuna 7B and 13B have demonstrated improvements over the original pre-trained LLaMA 7B and 13B models, there is still a significant gap in performance compared to Bard and GPT-3.5. Furthermore, the Dolly model also performs poorly on different prompting strategies compared to the 13B version of LLaMA.
  • LLM performance is highly sensitive to prompting style: our empirical results confirm that there is no universal prompting strategy that works for all models. While the IS hint strategy proved effective for GPT-3.5, Bard, Vicuna, and guanaco, it produced suboptimal accuracy for Dolly and LLaMA. Surprisingly, LLaMA achieves the best results when using S3 tips, while the performance of GPT-3.5 deteriorates significantly in comparison.
  • Few-shot learning using random examples provides limited performance gains: most results obtained from 1SL and 5SL tend to perform poorly, or at best achieve results comparable to other prompting strategies. However, there are some exceptions to this trend. An exception is the Dolly model, which shows improved performance of the 1SL prompting strategy compared to other prompting strategies in the 12B variant. This result seems anomalous as no similar performance improvement was observed in other 1SL and 5SL results. Another exception is the LLaMA model, where the few-shot hint strategy outperforms some zero-shot strategy. For example, the 30B LLaMA model achieves 22.4% EX and 19.9% ​​TS accuracy with only 5 given examples, which is close to the performance of the guanaco model (24.4% EX and 19.0% TS).

3.2 Classic data sets

Insert image description hereSince Academic, Restaurants, IMDB, and Yelp do not have training sets, we extract examples of 1SL and 5SL from the evaluation sets of other classic data sets. We highlight some key findings based on the results in Table 3:

  • LLM performs poorly on most classical datasets: in particular, compared to the baseline performance reported in previous studies, the highest accuracy achieved on these datasets is only 2.9% and 2.4%, respectively, significantly lower than using LSTM or Baseline results of 34.0% and 45.2% were observed in other studies of BERT’s traditional seq2seq model (Devlin et al., 2019). Furthermore, even with instruction tuning, Vicuna, Guanaco, and Dolly face considerable challenges on classic datasets. Their execution accuracy across various prompting strategies and dataset combinations is often close to zero.
  • The effectiveness of few-shot learning varies across models: compared to findings on the Spider dataset, we observe improved performance on 1SL and 5SL for LLaMA and GPT-3.5. For example, using 1SL, the performance of GPT-3.5 on the GeoQuery dataset is improved from 15.4% to 42.3%, while using 5SL, the performance of LLaMA is also significantly improved from 12.1% to 15.4% on the same dataset. However, we didn't see similar performance improvements with Dolly, Vicuna and Bard's 1SL or 5SL.
  • Attaching database example rows is ineffective: Just like the results observed with the Spider dataset, the S3 prompting strategy produces subpar results when applied to the classic dataset with different models. Therefore, it is clear that the S3 prompt policy may not be effective in a Text-to-SQL environment.

4.Discuss

4.1 Does LLM generate valid SQL?

One possible explanation for the poor performance of large language models is their inability to understand the intent behind hints designed to generate SQL statements. GPT-3.5 fails to generate a valid response when faced with many S3 prompts. To assess the scope of such instances, we plot the proportion of valid SQL statements generated by various large language models using different hinting strategies in Figures 1a and 1b. For the Spider dataset, we found that many models (except Dolly) were consistent in generating valid SQL responses more than 90% of the time using the IS, 1SL, and 5SL prompt strategies. Interestingly, LLaMA also demonstrated the ability to generate efficient SQL statements, although it was not specifically fine-tuned on the instruction data set. For classic datasets, Bard-P2 and GPT-3.5 are still able to generate valid SQL in the range of 80-100%. However, open source models such as Vicuna and Dolly encounter challenges in achieving valid SQL percentages higher than 75%. Of particular note are the different trends observed in LLaMA and guanaco. LLaMA generates more efficient SQL through few-shot learning, while guanaco's performance decreases as the number of examples increases.
Additionally, we note that AD and S3 prompting strategies are often suboptimal, as they result in a significant reduction in the number of valid SQL responses across all datasets for many large language models. GPT-3.5 is particularly vulnerable to the S3 hint policy, causing a sharp drop in the percentage of valid SQL generated in both the Spider and Classic datasets. Finally, it is important to emphasize that although these language models can generate efficient SQL reconstructions, these SQLs are often semantically inaccurate and cannot adequately address the input text problem. Therefore, the execution accuracy on most datasets is very low.

Insert image description here
Insert image description here

4.2 How does sample selection affect the performance of 1SL and 5SL?

Including random examples from the training set in hints does not significantly improve the performance of different models

Based on the results in Tables 2 and 3, it is clear that including random examples from the training set in the hints does not significantly improve the performance of the different models. The only exceptions are LLaMA and GPT-3.5, which show significant improvements on most classical datasets when using 1SL and 5SL hint strategies. The improved performance of LLaMA using 1SL or 5SL hinting strategies can be partially attributed to the fact that exposing LLaMA to more examples can significantly enhance its ability to generate effective SQL, as shown in Figure 1b.

LLM adapts to normalized SQL style

Another notable observation is that when large language models are fed examples from classic datasets, they start generating SQL in a style similar to the canonical format described in Finegan-Dollak et al. (2018), as shown in Figure 2, where table aliases follow the standardized convention of <TABLE_NAME>alias.
Insert image description here

LLM sensitivity to style changes

To evaluate the extent to which the language model (LLM) follows the canonical SQL style when generating SQL using 1SL and 5SL, we examined the proportion of generated SQL statements containing the term "alias" in Table 4. Our results show that changes in the generated SQL style are significant only when employing the 1SL and 5SL prompting strategies. Notably, LLaMA stands out among all models as it consistently appends the term “alias” to more than 86% of the generated SQL statements. Interestingly, Bard is less sensitive to normalized SQL styles, with style changes observed in only 16.0% of all generated SQL. On the other hand, GPT-3.5 shows higher sensitivity, with more than 50% of generated SQL being affected. Based on this observation, we hypothesized that this difference in sensitivity may be a contributing factor to the greater success of the 1SL and 5SL cue strategies employed by LLaMA and GPT-3.5.
Insert image description here

Impact of sampling from different sources on performance

We conclude this section with a brief discussion of experiments involving sampling examples from sources other than the training set. Table 5 lists the 1SL and 5SL results obtained when taking samples from two different sources: 1) Spider training set, 2) Evaluation set. In the second case, we take precautions to avoid any potential answer leakage by filtering out all examples that have the same SQL answer as the question of interest. We found that using examples from the Spider dataset not only failed to yield any benefit, but also resulted in degraded model performance that was worse than the zero-shot approach. On the other hand, we observe an improvement in the evaluation results when we include examples from the evaluation set. After careful inspection of the prompts, we found that some examples were syntactically similar to the expected SQL responses, with the main differences being tables, columns, and values. This finding highlights the sensitivity of LLM to the examples provided in the prompts. We hypothesized that if we provided LLMs with examples that were syntactically close to the expected SQL responses, they might generate more accurate SQL statements.
Insert image description here

4.3 Do we truly evaluate text-to-SQL datasets with zero or few samples?

We have identified several potential sources of data contamination (Elangovan et al., 2021; Lewis et al., 2021; Magar and Schwartz, 2022) that raise concerns about the true nature of zero- or few-shot evaluations of text -to-SQL data set. These sources include the availability of Spider datasets and classic datasets on GitHub repositories, as well as the existence of Spider datasets on platforms such as the Huggingface dataset. Additionally, text-to-SQL datasets can also be included in instruction-tuned dataset collections, such as FLAN (Wei et al.). We end the paper with a question for researchers to think about: Are we really performing zero-shot or few-shot evaluation of large language models when they have been exposed to our evaluation data?

5 related work

Recently, large decoder-based language models have made great contributions to code generation tasks (Li et al., 2023b; Fu et al., 2023; Darm et al., 2023). These models exploit unsupervised autoregressive learning on large-scale text data, enabling them to capture rich semantic relationships and probability distributions of words. Although they perform well in context with just one or a few examples, recent research shows that they still face challenges in text-to-SQL tasks involving complex reasoning (Liu et al., 2023).

Several works focus on improving the text-to-SQL parsing capabilities of large language models through enhanced hint design. In a study conducted by Nan et al. (2023), the authors emphasize the importance of careful selection of situated learning examples. They demonstrate that incorporating syntactic structure in example queries can greatly enhance the small-sample capabilities of large language models. Chang and Fosler-Lussier (2023) conducted a comprehensive study to explore the impact of prompt length on the performance of text-to-SQL models. Additionally, they examined the sensitivity of database knowledge representation across different domains. Guo et al. (2023) proposed a case-based reasoning framework to adapt the input of GPT-3.5 in a cross-domain setting by adaptively retrieving case hints. Rai et al. (2023) Improve the generalization ability of large language models using boundary-based techniques that preprocess hints at the token level and sequence level of schema and SQL.

Meanwhile, some studies have also explored the potential benefits of complex multi-step reasoning in improving the performance of large language models in text-to-SQL parsing. Tai et al. (2023) showed that least-to-most prompts (Zhou et al., 2023) may be unnecessary and directly applying Chain of Thought (CoT) prompts (Wei et al., 2022) may lead to error propagation. Liu and Tan (2023) introduced a partitioning and hinting paradigm for text-to-SQL tasks, which involves dividing the task into multiple subtasks and applying the CoT method to each subtask. In another study by Pourreza and Rafiei (2023), a self-correction module was employed in a zero-sample setting to achieve new state-of-the-art results on the Spider leaderboard. This module feeds the solution of each sub-problem back to the large language model, allowing it to build a better overall solution.

6 Conclusions and future work

This paper systematically evaluates the text-to-SQL parsing capabilities of six popular large-scale language models on nine benchmark datasets using five different hinting strategies. Our results show that open source models significantly underperform compared to closed source models. However, it is worth noting that even GPT-3.5 performs worse than smaller baseline models on several classic datasets. We are making our results available for further analysis and to stimulate future research efforts. There are several research topics we would like to explore in the future. First, we plan to use limited GPU resources to study the fine-tuning of these large language models on Text-to-SQL datasets using techniques such as low-rank adaptation. (2021). Second, we wanted to explore methods that can dynamically select examples for contextual learning. Finally, we are interested in studying the feasibility and limitations of using these large language models on multi-round text-to-SQL datasets such as SPARC (Yu et al., 2019).
limitation

First, we acknowledge that the scope of this study is limited to six large language models, and these models do not cover the entire research area. There are exciting new entries in the family, such as the Falcon model. Second, appending 5 examples to the database schema of some classic datasets may in some cases exceed the 2048 token limit of open source models, resulting in truncation that may penalize these models with shorter context windows. Finally, some models generate not only SQL statements but also supplementary information, including explanations. To ensure accuracy, we developed regular expression patterns designed to extract only SQL statements on a best-effort basis. Nonetheless, we acknowledge that our rules may not be completely foolproof and may introduce erroneous SQL in some cases

论文原文:Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT - A Text-to-SQL Parsing Comparison

Guess you like

Origin blog.csdn.net/rkjava/article/details/135432655