添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
BIRD ( BI g Bench for La R ge-scale D atabase Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB . It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc. Oct. 9, 2025: We release our paper of BIRD-Interact , detailing how we built the dataset and our key findings on why interaction and communication are critical for frontier LLMs to become more reliable assistance in the DBA cycle. Currently, GPT-5 (Med) only achieves 8.67% SR on c-Interact and 17.00% SR on a-Interact of full tasks. Additionally, we will release Mini-Interact which will be operated in sqlite, with our trained local and stable user simulator. It is currently undergoing alignment and function testing before release. If you think our work is helpful to you, we will feel grateful if you can star us, which is a strong motivation for boosting our open-source efforts! Sep. 19, 2025: We’ve released BIRD23-train-filtered , a high-quality filtered subset of the BIRD train split (6,601/9,428 ≈70%) that works as a drop-in replacement for text-to-SQL finetuning; See the example finetuning usage in here . Sep. 10, 2025: We're excited to announce the release of the LiveSQLBench-Base-Full-V1 (600) ! The first text-to-SQL benchmark covering all SQL spectrum with Hierarchical Knowlegde Base (HKB) and test cases. We provide two types of queries: normal query and colloquial queries for people to test according to their own needs. The flag model Gemini-2.5-pro can only achieve 28.67 in colloquial queries, and 35.67 in normal queries. The base-lite and base-full-v1 would be locked version for development of research methods. The detailed performance is in our website . Aug. 26, 2025: We're excited to announce the release of the BIRD-Interact-Full (600) set! It's a tough one—the best LLMs are only achieving a 16.33% success rate, with just 10.0% on the c-interact and a-interact portions. For more details, please visit our project website . We'll be sending the Ground Truth & Test cases to our mailing list this week. If you want early access, please send an email as instructed on the site for an automatic download. On another note, we've also released a SQLite version of LiveSQLBench-Lite for easier local research. The full LiveSQLBench-Base and -Large versions are coming soon! Jul. 15, 2025: LLMs always struggle with self-correction due to self-enhancement bias, a problem particularly pronounced in SQL generation where declarative syntax and concise logs would provide limited guidance. Our new work in ACL Main , SHARE , mitigates this significantly by (1) converting SQL queries into procedural programming steps and (2) proposing a generalized pipeline through on-policy multi-agent methods for fine-tuning SLMs. Models: BAM , SAM , Dataset: https://huggingface.co/datasets/birdsql/share-bam Paper: https://huggingface.co/papers/2506.00391 Code: https://github.com/quge2023/SHARE Jul. 10, 2025: We optimized and upload bird-mini-dev with 3 dialects to huggingface according to community. Thanks!! Please download and check! Jul. 10, 2025: We release the human performance scores on BIRD-CRITIC 1.0 ! It contains both the group with and without usage of GenAI Tools. The huge gap encourages more intelligent solutions to serve users in the real-world DB applications. Please check~~. Jun. 25, 2025: We document our procedures, findings, and training recipe for open-source LLM agents that resolve SQL debugging tasks in our paper SWE-SQL . Jun. 08, 2025: We have released bird-critic-1.0-postgresql (sql issues for single dialects). Check out the data in Hugging Face and the newest code in GitHub . It seems that bird-critic is a challenging reasoning tasks for text-to-SQL since all top-peforming models are reasoning-based models. Have fun! Thanks! Jun 4, 2025: We release BIRD-Interact , a comprehensive interactive evaluation for text-to-SQL models. It contains conversational ( c-Interact ) and agentic ( a-Interact ) interaction modes. Top results: o3-mini achieves 24.4% Success Rate on c-Interact , while Claude-3.7-Sonnet reaches 17.78% Success Rate on a-Interact . We also discovered Interaction-Time Scaling (ITS) - performance scales over extended interactions. Enjoy! May 29, 2025: We release LiveSQLBench , the first containmination-free text-to-SQL benchmark covering full SQL specturm featuring more advanced SQLs, annotated RDBs, structrued / unstructured knowledge base, test cases, etc. First, we present LiveSQLBench-Base-Lite with 270 tasks for trail. Even though Lite is the easiest version with the most clear conditions, the SOTA model o3-mini only achieves 44.81% Success Rate. We will update full set and settings. You can preview and interact with data samples in our website. Stay tuned! May. 25, 2025: Please report issues when you find any more of BIRD-SQL 2023, we will check and clean dev set (1534) for the last time during summer by considering all feedbacks. Next week, we will start our new project LiveSQLBench , the first one-stop containmination-free text-to-SQL benchmark covering full SQL specturm featuring more advanced SQLs, databases, hierarichal / unstructured knoweldge base, test cases. Also it supports multi-turn conversational and interactive evaluation via our BIRD-Interact , which will be released together. May 22, 2025: We update Single-Model Leaderboard, now the Self-Consistency column has 4 values: empty (no self consistency used); Few: 1-7 candidates used in majority vote; Many: 8-32 candidates; Scale: >32 candidates; Apr. 20, 2025: We have released bird-critic-1.0-open (600 tasks by 4 dialects). Check out the data in Hugging Face and the newest code in GitHub . The full set of PostgreSQL will be released 1 week later. It seems that bird-critic is a challenging reasoning tasks for text-to-SQL since all top-peforming models are reasoning-based models. Have fun! Thanks! Feb. 4, 2025: We've launched BIRD-Critic (a.k.a SWE-SQL) , a brand new text-to-SQL benchmark that really digs into reasoning challenges! A lite version is ready for exploration. Full sets are coming soon! Feel free for any feedbacks! Your inputs and suggestions are much appreciated! Nov. 26, 2024: Thanks the support of BIRD-SQL 2023 ! Now we are pleased to share that the project BIRD 2025 has been started. It will contains 4-6 new benchmarks with each covering its special focus of professional databases and their knowledge in the wild applications. We will release the first benchmark by early Jan . Feel free to let us know your needs or suggestions for cooking new generations of Text-to-SQL challenges. Thanks! Aug. 4, 2024: The Reward-based Valid Efficiency Score (R-VES) will be used as the efficiency metric for future test submissions. The rationale and formula for R-VES can be found in the Mini-Dev repository . You can check the legacy VES scores for previous submissions here . Jun. 30, 2024: Due to large requests of test submissions about mixed models (open-source + GPU-based closed source), we update the submission instructions to accelerate your waiting time. Please check it out! Jun. 30, 2024: If you are interested in code agent, please do not miss a SOTA code agent implementation by OpenDevin for BIRD Dev! Jun. 27, 2024: Excited to announce the release of our BIRD Mini-Dev dataset with 500 high-quality examples. This dataset includes all BIRD keywords, with modifications for questions such as the addition of window function . We are the first to deliver it in not only SQLite , but also MySQL , and PostgreSQL . We include Soft-F1 and R-VES metrics to reduce bias. Don't miss the column_meaning.json file, preprocessed by TA-SQL. Available for dev and testing set. Check out our work here: Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation (TA-SQL) , appearing at ACL 2024 Findings. Apr. 27, 2024: Due to large volume of requests, we now modify the license of our data to CC BY-SA 4.0 . However, we will not take responsibility of any bad purposes by using our data. Since we develop this benchmark for research and healthy application only. Mar 13, 2024: Please also take a look at our related work: Tapilot-Crossing , which is the first challenging and more realistic benchmark designed to evaluate Large Language Model (LLM) agents on interactive data analysis tasks. The code includes Python and Private Library. And it covers 6 common agent actions in evaluation. Sept 25, 2023: We have released a cleaner version of dev set . Please download dev set again. We checked all cases of dev set and fixed all errors that we found. After cleaning, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. Thanks for all feedbacks! Sept 21, 2023: Our paper has been accepted by NeurIPS 2023 as a Spotlight !!! Thanks for all the efforts and suggestions of co-authors, anonymous reviewers, awesome researchers/users in github or emails. July 17, 2023: We update newest results of GPT-4, Claude-2 and Palm-2. July 14, 2023: The data link has been updated, fixing the schema names in the CSV files. Additionally, tied results caused by order_by limit 1 are now considered. Both SQL queries - with and without accounting for tied results - are valid at this time. Jun 12, 2023: We are welcome to any suggestions and reported gold errors in help_wanted . Any of your help is appreciated! Jun 5, 2023: We open-sourced our Graphix-T5 , a graph-aware semi-pretrained text-to-text PLM specifically designed to improve multi-hop reasoning for the complex text-to-SQL task. May 30, 2023: If you are interested in ICL, please check out our interesting work deep-thinking🤔. Generate 1000 models for 1000 people smoothly! 1. Large and Dirty values: Due to the nature of the real-world scenarios from which BIRD's database values were collected, they typically retain their original and frequently "dirty" format. Hence, text-to-SQL parsers must first analyze these values to account for their non-standard format before engaging in reasoning. 2. External Knowledge: "account.type = 'OWNER'" can be inferred by the knowledge evidence: "The condition of the loans require the account type should be the owner." 3. Text-to-Efficient-SQL: BIRD is the first text-to-SQL benchmark designed to encourage semantic parsers to produce SQL queries that are not only correct but also efficient. This emphasis on efficiency is especially valuable in real-world data / business analysis circumstances. Please follow the Submission Guideline (below) and contact [email protected] for test evaluation. Ususally, we will return your results in 10 days! Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address. @article{li2024can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} CSC-SQL + XiYanSQL-QwenCoder-32B-2412
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 71.33 73.67 CSC-SQL + Qwen2.5-Coder-7B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 69.19 71.72 SLM-SQL + Qwen2.5-Coder-1.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 67.08 70.49 SLM-SQL + Qwen2.5-Coder-0.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 56.87 61.82 CSC-SQL + XiYanSQL-QwenCoder-32B-2412
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 67.84 CSC-SQL + Qwen2.5-Coder-7B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 67.47 SLM-SQL + Qwen2.5-Coder-1.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 65.25 SLM-SQL + Qwen2.5-Coder-0.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25] [link] 57.11 A Lite version of developtment dataset, which is designed to facilitate efficient and cost-effective development cycles, especially for testing and refining SQL query generation models. For more details, please visit the GitHub repository . For updating Leaderboard, please make sure your paper or resource is public available and submit a PR.
Mini Dev - Execution Accuracy (EX)

Single Trained Model Track

This specific track was proposed aiming at advancing training techniques for text-to-SQL and fair comparison of single model capabilities. Self-consistency is allowed as it reflects the model's own capabilities, and we categorize submissions based on the number of candidates used. The Self-Consistency column indicates: empty (Self-Consistency is not used); Few (1-7 candidates); Many (8-32 candidates); Scale (>32 candidates).
Leaderboard - Execution Accuracy (EX)