Overview
The paper proposes automatic metadata extraction for text-to-SQL via three components: database profiling, query log analysis, and SQL-to-text generation [1](https://arxiv.org/pdf/2505.19988).
They emphasize that understanding database contents is the hardest part; once understood, writing SQL is straightforward [2](https://arxiv.org/pdf/2505.19988).
Profiling Workflow
Basic profiles include statistics and sketches (e.g., MinHash) to characterize fields; more complex constraints exist but the study restricts to basic profiles [3](https://arxiv.org/pdf/2505.19988).
Schema Linking with Profile Metadata
Long descriptions can overflow LLM context, so the system uses schema linking to keep only task-relevant fields, improving performance; perfect linking helps even more [4](https://arxiv.org/pdf/2505.19988).
The method leverages LSH indices over field contents to align literals from the generated SQL to candidate fields and iteratively refine the query [5](https://arxiv.org/pdf/2505.19988).
End-to-end System (BIRD Submission)
Their BIRD leaderboard submission used profiling + LLM summarization + schema linking, plus few-shot retrieval and multi-candidate generation with SQL validation and majority voting [6](https://arxiv.org/pdf/2505.19988).
Reported effects: profiling metadata outperforms supplied metadata; fused metadata best; and schema linking yields gains with or without hints (table shows up to ~77% with perfect linking + hints) [7](https://arxiv.org/pdf/2505.19988).
Notes & Examples
- MinHash sketch m(f) stores K minima of hash functions over field values; resemblance is estimated by fraction of matching minima, enabling fast detection of large intersections between fields [8](https://arxiv.org/pdf/2505.19988).
- English profiles are generated mechanically (e.g., NULL counts, distincts, min/max, value length), then summarized by an LLM into short/long descriptions; examples include identifying CDSCode semantics and JSON-typed fields [9](https://arxiv.org/pdf/2505.19988).
- Example task shows aligning the literal 'Fresno County Office of Education' to the right field via LSH over field contents to fix the query [10](https://arxiv.org/pdf/2505.19988).
- Submission details: few-shot via masked question retrieval; generate 3 candidates; validate with SQL checks; majority vote over result sets [11](https://arxiv.org/pdf/2505.19988).