BIRD (
BI
g Bench for La
R
ge-scale
D
atabase
Grounded Text-to-SQL Evaluation) represents a pioneering,
cross-domain dataset that examines the impact of extensive
database contents on text-to-SQL parsing.
BIRD contains over
12,751
unique question-SQL pairs,
95
big databases with a total size of
33.4 GB
. It
also covers more than
37
professional domains, such as
blockchain, hockey, healthcare and education, etc.
Oct. 9, 2025:
We release our paper of
BIRD-Interact
, detailing how we built the dataset and our key findings on why interaction and communication are critical for frontier LLMs to become more reliable assistance in the DBA cycle.
Currently, GPT-5 (Med) only achieves 8.67% SR on
c-Interact
and 17.00% SR on
a-Interact
of full tasks. Additionally, we will release
Mini-Interact
which will be operated in sqlite, with our trained local and stable user simulator. It is currently undergoing alignment and function testing before release.
If you think our work is helpful to you, we will feel grateful if you can star us, which is a strong motivation for boosting our open-source efforts!
Sep. 19, 2025:
We’ve released
BIRD23-train-filtered
, a high-quality filtered subset of the BIRD train split (6,601/9,428 ≈70%) that works as a drop-in replacement for text-to-SQL finetuning; See the example finetuning usage in
here
.
Sep. 10, 2025:
We're excited to announce the release of the
LiveSQLBench-Base-Full-V1 (600)
! The first text-to-SQL benchmark covering all SQL spectrum with Hierarchical Knowlegde Base (HKB) and test cases.
We provide two types of queries: normal query and colloquial queries for people to test according to their own needs. The flag model Gemini-2.5-pro can only achieve
28.67
in colloquial queries, and
35.67
in normal queries. The
base-lite
and
base-full-v1
would be locked version for development of research methods.
The detailed performance is in our
website
.
Aug. 26, 2025:
We're excited to announce the release of the
BIRD-Interact-Full (600)
set! It's a tough one—the best LLMs are only achieving a
16.33%
success rate, with just
10.0%
on the
c-interact
and
a-interact
portions. For more details, please visit our
project website
.
We'll be sending the Ground Truth & Test cases to our mailing list this week. If you want early access, please send an email as instructed on the site for an automatic download. On another note, we've also released a SQLite version of
LiveSQLBench-Lite
for easier local research. The full
LiveSQLBench-Base
and
-Large
versions are coming soon!
Jul. 15, 2025:
LLMs always struggle with self-correction due to self-enhancement bias, a problem particularly pronounced in SQL generation where declarative syntax and concise logs would provide limited guidance. Our new work in
ACL Main
,
SHARE
, mitigates this significantly by (1) converting SQL queries into procedural programming steps and (2) proposing a generalized pipeline through on-policy multi-agent methods for fine-tuning SLMs.
Models:
BAM
,
SAM
,
Dataset:
https://huggingface.co/datasets/birdsql/share-bam
Paper:
https://huggingface.co/papers/2506.00391
Code:
https://github.com/quge2023/SHARE
Jul. 10, 2025:
We optimized and upload
bird-mini-dev
with 3 dialects to huggingface according to community. Thanks!! Please download and check!
Jul. 10, 2025:
We release the human performance scores on
BIRD-CRITIC 1.0
! It contains both the group with and without usage of GenAI Tools. The huge gap encourages more intelligent solutions to serve users in the real-world DB applications. Please check~~.
Jun. 25, 2025:
We document our procedures, findings, and training recipe for open-source LLM agents that resolve SQL debugging tasks in our paper
SWE-SQL
.
Jun. 08, 2025:
We have released bird-critic-1.0-postgresql (sql issues for single dialects). Check out the data in
Hugging Face
and the newest code in
GitHub
.
It seems that bird-critic is a challenging reasoning tasks for text-to-SQL since all top-peforming models are reasoning-based models. Have fun! Thanks!
Jun 4, 2025:
We release
BIRD-Interact
, a comprehensive interactive evaluation for text-to-SQL models. It contains conversational (
c-Interact
) and agentic (
a-Interact
) interaction modes.
Top results: o3-mini achieves
24.4%
Success Rate on
c-Interact
, while Claude-3.7-Sonnet reaches
17.78%
Success Rate on
a-Interact
. We also discovered
Interaction-Time Scaling (ITS)
- performance scales over extended interactions. Enjoy!
May 29, 2025:
We release
LiveSQLBench
, the first containmination-free text-to-SQL benchmark covering full SQL specturm featuring more advanced SQLs, annotated RDBs, structrued / unstructured knowledge base, test cases, etc.
First, we present
LiveSQLBench-Base-Lite
with 270 tasks for trail. Even though Lite is the easiest version with the most clear conditions, the SOTA model
o3-mini
only achieves
44.81%
Success Rate.
We will update full set and settings. You can preview and interact with data samples in our website. Stay tuned!
May. 25, 2025:
Please report issues when you find any more of BIRD-SQL 2023,
we will check and clean dev set (1534) for the last time during summer by considering all feedbacks.
Next week, we will start our new project
LiveSQLBench
, the first one-stop containmination-free text-to-SQL benchmark covering full SQL specturm featuring more advanced SQLs,
databases, hierarichal / unstructured knoweldge base, test cases.
Also it supports multi-turn conversational and interactive evaluation via our
BIRD-Interact
, which will be released together.
May 22, 2025:
We update Single-Model Leaderboard, now the Self-Consistency column has 4 values:
empty
(no self consistency used);
Few:
1-7 candidates used in majority vote;
Many:
8-32 candidates;
Scale:
>32 candidates;
Apr. 20, 2025:
We have released bird-critic-1.0-open (600 tasks by 4 dialects). Check out the data in
Hugging Face
and the newest code in
GitHub
. The full set of PostgreSQL will be released 1 week later.
It seems that bird-critic is a challenging reasoning tasks for text-to-SQL since all top-peforming models are reasoning-based models. Have fun! Thanks!
Feb. 4, 2025:
We've launched
BIRD-Critic (a.k.a SWE-SQL)
, a brand new text-to-SQL benchmark that really digs into reasoning challenges! A lite version is ready for exploration. Full sets are coming soon! Feel free for any feedbacks! Your inputs and suggestions are much appreciated!
Nov. 26, 2024:
Thanks the support of
BIRD-SQL 2023
!
Now we are pleased to share that the project
BIRD 2025
has been started.
It will contains 4-6 new benchmarks with each covering its special focus of professional databases and their knowledge in the wild applications.
We will release the first benchmark by early Jan
. Feel free to let us know your needs or suggestions for cooking new generations of Text-to-SQL challenges. Thanks!
Aug. 4, 2024:
The Reward-based Valid Efficiency Score (R-VES) will be used as the efficiency metric for future test submissions.
The rationale and formula for R-VES can be found in the
Mini-Dev repository
.
You can check the legacy VES scores for previous submissions
here
.
Jun. 30, 2024:
Due to large requests of test submissions about mixed
models (open-source + GPU-based closed source), we
update the submission instructions to accelerate your
waiting time. Please check it out!
Jun. 30, 2024:
If you are interested in code agent, please do not miss a SOTA code agent implementation by
OpenDevin
for BIRD Dev!
Jun. 27, 2024:
Excited to announce the release of our
BIRD Mini-Dev dataset
with 500 high-quality examples. This dataset includes
all BIRD keywords, with modifications for questions such
as the addition of
window function
. We are
the first to deliver it in not only
SQLite
, but also
MySQL
, and
PostgreSQL
.
We include Soft-F1 and R-VES metrics to reduce bias.
Don't miss the
column_meaning.json
file,
preprocessed by TA-SQL. Available for dev and testing
set. Check out our work here:
Before Generation, Align it! A Novel and Effective
Strategy for Mitigating Hallucinations in Text-to-SQL
Generation (TA-SQL)
, appearing at
ACL 2024
Findings.
Apr. 27, 2024:
Due to large volume of requests, we now modify the
license of our data to
CC BY-SA 4.0
. However, we will not take responsibility of any bad
purposes by using our data. Since we develop this
benchmark for research and healthy application only.
Mar 13, 2024:
Please also take a look at our related work:
Tapilot-Crossing
, which is the first challenging and more realistic
benchmark designed to evaluate Large Language Model
(LLM) agents on interactive data analysis tasks. The
code includes Python and Private Library. And it covers
6 common agent actions in evaluation.
Sept 25, 2023:
We have released a cleaner version of
dev set
. Please download dev set again. We
checked all cases of dev set and fixed all errors that
we found. After cleaning, the ChatGPT (gpt-3.5-turbo)
and GPT4 (gpt-4-32k) EX scores have improved to
42.24
(from 37.22) and
49.15
(from 46.35), respectively.
Thanks for all feedbacks!
Sept 21, 2023:
Our paper has been accepted by
NeurIPS 2023
as a
Spotlight
!!! Thanks for all the efforts and suggestions of
co-authors, anonymous reviewers, awesome
researchers/users in github or emails.
July 17, 2023:
We update newest results of GPT-4, Claude-2 and Palm-2.
July 14, 2023:
The data link has been updated, fixing the schema names
in the CSV files. Additionally, tied results caused by
order_by limit 1
are now considered. Both
SQL queries - with and without accounting for tied
results - are valid at this time.
Jun 12, 2023:
We are welcome to any suggestions and reported gold
errors in
help_wanted
. Any of your help is appreciated!
Jun 5, 2023:
We open-sourced our
Graphix-T5
, a graph-aware semi-pretrained text-to-text PLM
specifically designed to improve multi-hop reasoning for
the complex text-to-SQL task.
May 30, 2023:
If you are interested in ICL, please check out our
interesting work
deep-thinking🤔.
Generate 1000 models for 1000 people smoothly!
1. Large and Dirty values:
Due to the nature of the
real-world scenarios from which BIRD's database values were
collected, they typically retain their original and frequently
"dirty" format. Hence, text-to-SQL parsers must first analyze
these values to account for their non-standard format before
engaging in reasoning.
2. External Knowledge:
"account.type = 'OWNER'" can be
inferred by the knowledge evidence: "The condition of the loans
require the account type should be the owner."
3. Text-to-Efficient-SQL:
BIRD is the first text-to-SQL
benchmark designed to encourage semantic parsers to produce SQL
queries that are not only correct but also efficient. This
emphasis on efficiency is especially valuable in real-world data
/ business analysis circumstances.
Please follow the Submission Guideline (below) and contact
[email protected]
for test evaluation.
Ususally, we will return your results in 10 days!
Bird is a long-term research project aimed at bridging the gap
between semantic parsing models and the success of database
applications. To receive the latest updates of the dataset, you
can leave your email address.
@article{li2024can,
title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
CSC-SQL + XiYanSQL-QwenCoder-32B-2412
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
71.33
73.67
CSC-SQL + Qwen2.5-Coder-7B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
69.19
71.72
SLM-SQL + Qwen2.5-Coder-1.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
67.08
70.49
SLM-SQL + Qwen2.5-Coder-0.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
56.87
61.82
CSC-SQL + XiYanSQL-QwenCoder-32B-2412
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
67.84
CSC-SQL + Qwen2.5-Coder-7B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
67.47
SLM-SQL + Qwen2.5-Coder-1.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
65.25
SLM-SQL + Qwen2.5-Coder-0.5B-Instruct
Wuhan University of Technology + University of Science and Technology of China
[Lei Sheng et al. '25]
[link]
57.11
A
Lite
version of developtment dataset, which is designed
to facilitate efficient and cost-effective development cycles,
especially for testing and refining SQL query generation models. For more details, please visit the
GitHub repository
.
For updating Leaderboard, please make sure your paper or resource is public available and submit a PR.
Mini Dev - Execution Accuracy (EX)
Single Trained Model Track
This specific track was proposed aiming at advancing training techniques for text-to-SQL and fair comparison of
single
model capabilities. Self-consistency is allowed as it reflects the model's own capabilities, and we categorize submissions based on the number of candidates used. The Self-Consistency column indicates: empty (Self-Consistency is not used); Few (1-7 candidates); Many (8-32 candidates); Scale (>32 candidates).