Profiling Approach Visualization (Paper 2505.19988)

Overview

The paper proposes automatic metadata extraction for text-to-SQL via three components: database profiling, query log analysis, and SQL-to-text generation [1](https://arxiv.org/pdf/2505.19988).

They emphasize that understanding database contents is the hardest part; once understood, writing SQL is straightforward [2](https://arxiv.org/pdf/2505.19988).

Profiling Schema Linking Few-shot & Candidate Selection

Profiling Workflow

Interactive diagram shows how raw tables become LLM-usable metadata.

Basic profiles include statistics and sketches (e.g., MinHash) to characterize fields; more complex constraints exist but the study restricts to basic profiles [3](https://arxiv.org/pdf/2505.19988).

Schema Linking with Profile Metadata

Long descriptions can overflow LLM context, so the system uses schema linking to keep only task-relevant fields, improving performance; perfect linking helps even more [4](https://arxiv.org/pdf/2505.19988).

The method leverages LSH indices over field contents to align literals from the generated SQL to candidate fields and iteratively refine the query [5](https://arxiv.org/pdf/2505.19988).

End-to-end System (BIRD Submission)

Their BIRD leaderboard submission used profiling + LLM summarization + schema linking, plus few-shot retrieval and multi-candidate generation with SQL validation and majority voting [6](https://arxiv.org/pdf/2505.19988).

Reported effects: profiling metadata outperforms supplied metadata; fused metadata best; and schema linking yields gains with or without hints (table shows up to ~77% with perfect linking + hints) [7](https://arxiv.org/pdf/2505.19988).

Notes & Examples

MinHash sketch m(f) stores K minima of hash functions over field values; resemblance is estimated by fraction of matching minima, enabling fast detection of large intersections between fields [8](https://arxiv.org/pdf/2505.19988).
English profiles are generated mechanically (e.g., NULL counts, distincts, min/max, value length), then summarized by an LLM into short/long descriptions; examples include identifying CDSCode semantics and JSON-typed fields [9](https://arxiv.org/pdf/2505.19988).
Example task shows aligning the literal 'Fresno County Office of Education' to the right field via LSH over field contents to fix the query [10](https://arxiv.org/pdf/2505.19988).
Submission details: few-shot via masked question retrieval; generate 3 candidates; validate with SQL checks; majority vote over result sets [11](https://arxiv.org/pdf/2505.19988).