Improving SQL Queries with Fine-Tuning
In the space of data analysis and database management, generating accurate SQL queries is a critical skill. However, even experienced data professionals can sometimes struggle with complex queries, especially when dealing with nested aggregations or intricate table relationships. This challenge has led to the development of AI-powered SQL generators, which promise to simplify the query-writing process.
Our journey begins with a seemingly simple question: "What is the percentage of orders that have more than one product?" While this query might appear straightforward at first glance, it requires careful consideration of data structure and SQL syntax to execute correctly.
In this blog post, we'll explore how two popular AI SQL generators - SQLCoder2 and SQLCoder-34B - attempted to tackle this problem, and why their solutions fell short. We'll then delve into the process of custom training a model to generate the correct query, highlighting the importance of tailored datasets and fine-tuning in achieving accurate results.
By sharing this experience, we aim to shed light on the current capabilities and limitations of AI in SQL generation, as well as demonstrate a practical approach to improving these tools for specific use cases. Whether you're a data analyst, a machine learning enthusiast, or simply curious about the intersection of AI and database querying, this post will provide valuable insights into the challenges and potential solutions in this field.
Join us as we unravel the complexities of SQL query generation and explore how custom AI training can lead to more reliable and efficient database interactions.
Initial Attempts with Open-Source SQL Generators
To address our question about the percentage of orders with more than one product, we turned to two popular AI-powered SQL generators: SQLCoder2 and SQLCoder-34B. Both models are known for their ability to generate SQL queries from natural language prompts. Let's examine how they performed with our specific task.
We presented SQLCoder2 & SQLCoder-34B with the following prompt:
What is the percentage of orders that have more than one product?
This straightforward question aimed to test the model's ability to interpret the request and generate an appropriate SQL query.
SQLCoder2 responded with the following SQL query:
SELECT CAST(COUNT(DISTINCT CASE WHEN COUNT(shopify_orderlineitem.id) > 1 THEN shopify_order.id END) AS FLOAT) / NULLIF(COUNT(DISTINCT shopify_order.id), 0) AS proportion
FROM shopify_order
JOIN shopify_orderlineitem ON shopify_order.id = shopify_orderlineitem.order_id;
Upon executing this query, we encountered the following error:
SQL Error [42803]: ERROR: aggregate function calls cannot be nested
This error indicates that SQLCoder2's attempt to nest COUNT functions is not valid in standard SQL syntax.
Next, we tried SQLCoder-34B, which produced the following query:
SELECT CAST(COUNT(DISTINCT o.id) filter (
WHERE COUNT(DISTINCT oi.product_id) > 1) AS FLOAT) / NULLIF(COUNT(DISTINCT o.id), 0) AS proportion
FROM shopify_order o
JOIN shopify_orderlineitem oi ON o.id = oi.order_id
GROUP BY o.id;
Unfortunately, this query also resulted in an error:
SQL Error [42803]: ERROR: aggregate functions are not allowed in FILTER
The error suggests that SQLCoder-34B's use of an aggregate function within a FILTER clause is not supported by SQL.
Discussion of limitations of these models
The attempts by both SQLCoder2 and SQLCoder-34B reveal several limitations in their current implementations:
- Lack of dialect-specific knowledge: Both models generated queries that aren't universally compatible across different SQL dialects. This highlights the need for AI models to be aware of specific SQL dialect constraints.
- Incorrect handling of nested aggregations: The models struggled with the concept of counting orders with multiple products, resulting in invalid nested aggregations.
- Overcomplication of the query: Both attempts used more complex structures (like CASE statements and FILTER clauses) than necessary, indicating a potential overfit to certain query patterns.
- Misinterpretation of the task: The models seemed to misunderstand the core requirement of identifying orders with more than one product, leading to overly complex and incorrect solutions.
- Lack of error checking: Neither model was able to validate its output against common SQL syntax rules, resulting in queries that would fail upon execution.
These limitations underscore the challenges in creating AI models that can generate correct and efficient SQL queries for a wide range of scenarios. They also highlight the potential benefits of custom training for specific use cases and database schemas.
In the next section, we'll explore how we addressed these limitations through custom training, resulting in a more accurate and efficient solution to our original question.
Custom Training Approach
After encountering limitations with existing SQL generators, we decided to take a custom approach to solve our specific query challenge. This involved preparing a tailored dataset and fine-tuning a model to better understand and generate the required SQL query.
Preparing a Custom Dataset
We created a custom dataset centred around our original prompt: "What is the percentage of orders that have more than one product?" To create our custom dataset, we leveraged the capabilities of GPT-3.5. This dataset included various permutations of this question, along with similar queries that required counting and percentage calculations across multiple tables. This approach allowed us to quickly produce a relevant and varied dataset tailored to our specific use case.
Our final dataset consisted of 196 entries, each containing a natural language question paired with its corresponding SQL query. We then split this dataset as follows:
- Training set: 80% of the data (approximately 157 entries)
- Testing set: 20% of the data (approximately 39 entries)
This split provides a substantial amount of data for training while reserving a significant portion for testing, helping to ensure the model's ability to generalize to new, unseen examples.
Training Process
The training process took a total of 43 minutes. This relatively short training time demonstrates the efficiency of fine-tuning an existing model compared to training from scratch, making it a practical approach for customizing AI models for specific tasks.
To assess the quality of our training, we plotted the training loss against the validation loss over the course of the training process. This visualization helps us understand how well the model is learning and whether it's overfitting or underfitting the data.
Our analysis of the plot revealed that the model achieved a good fit. This means:
- The training loss decreased steadily over time, indicating that the model was learning from the training data.
- The validation loss also decreased and closely followed the trend of the training loss, suggesting that the model was generalizing well to unseen data.
- There was no significant divergence between training and validation loss, which would have indicated overfitting.
This good fit gives us confidence that our custom-trained model has effectively learned to generate SQL queries for our specific use case without memorizing the training data or failing to capture important patterns.
By taking this custom training approach, we were able to address the limitations of existing models and create a solution tailored to our specific SQL generation needs. The next section will reveal the results of this training and the SQL query our model was able to produce.
Post-Training Results
After our custom training process, we achieved a significant improvement in the model's ability to generate the correct SQL query for our specific problem. Let's examine the successful query and analyze its structure and functionality.
Presentation of the successful SQL query
The custom-trained model produced the following SQL query:
select
(COUNT(*) * 100.0 / (
select
COUNT(*)
from
shopify_order))
from
shopify_order
where
id in (
select
order_id
from
shopify_orderlineitem
group by
order_id
having
COUNT(*) > 1);
Analysis of the query's structure and functionality
1. Overall Structure:
- The main query calculates the percentage
- Subquery in WHERE clause identifies orders with multiple products
2. Key Features:
- Uses COUNT(*) for efficient counting
- Converts ratio to percentage with * 100.0
- Avoids nested aggregations that caused previous errors
3. Functionality:
- Correctly identifies orders with more than one product
- Calculates percentage without risk of division by zero
- Solves the original problem accurately
4. Improvements:
- Overcomes issues from SQLCoder2 and SQLCoder-34B attempts
- Demonstrates proper handling of subqueries and aggregations
5. Potential Optimization:
- For very large datasets, consider replacing IN clause with EXISTS or JOIN for better performance
This query successfully solves the problem, showing the model's improved SQL comprehension and ability to handle complex queries without common errors.
Technical Details
Overview of training methodology
Our training methodology focused on fine-tuning an existing language model to specialize in SQL query generation. We employed a transfer learning approach, leveraging a pre-trained model as our starting point and then adapting it to our specific task using a custom dataset. This method allows us to benefit from the broad knowledge captured in the original model while tailoring its capabilities to our particular use case.
Tools and libraries used
We utilized a robust stack of modern machine-learning tools and libraries for this project:
1. PyTorch: As our primary deep learning framework, PyTorch provided the foundation for model manipulation and training.
2. Hugging Face Transformers: This library offered easy access to state-of-the-art pre-trained models and tools for fine-tuning.
3. PEFT (Parameter-Efficient Fine-Tuning): A library from Hugging Face that implements various efficient fine-tuning methods, including LoRA.
4. Datasets: Another Hugging Face library used for efficient data handling and preprocessing.
5. Accelerate: This library was used to enable distributed training and mixed precision, optimizing our training process.
Discussion of the LoRA (Low-Rank Adaptation) technique
LoRA is an efficient fine-tuning technique introduced in 2021 by Microsoft researchers. It addresses the challenge of adapting large language models to specific tasks without the need to retrain all parameters, which can be computationally expensive and prone to overfitting.
Key Concepts:
1. Low-Rank Matrices: LoRA works by adding pairs of rank-decomposition matrices to the weights of the original model. These matrices are initialized to a zero matrix, ensuring that the model starts from the pre-trained weights.
2. Trainable Parameters: During fine-tuning, only these low-rank matrices are updated, while the original model weights remain frozen. This significantly reduces the number of trainable parameters.
3. Adaptability: Despite the reduced parameter count, LoRA can adapt the model to new tasks effectively, often matching or exceeding the performance of full fine-tuning.
For more information : [check this paper]
Lessons Learned
Importance of custom datasets for specific use cases
Custom datasets are crucial for training AI models on domain-specific tasks. In our case, a tailored dataset focusing on complex SQL queries with nested aggregations significantly improved the model's performance. This approach ensures that the AI learns from relevant examples, leading to more accurate and contextually appropriate outputs.
Benefits of fine-tuning existing models
Fine-tuning pre-trained models, rather than training from scratch, offers several advantages:
1. Reduced training time (our process took only 43 minutes)
2. Lower computational resource requirements
3. Ability to leverage the general knowledge of large language models while adapting to specific tasks
4. Improved performance on niche problems that general models might struggle with
These lessons underscore the value of customization in AI development, especially for specialized applications like SQL query generation. By combining the power of existing models with targeted fine-tuning, we can create more effective and efficient AI tools for specific data analysis needs.
Suggested Reads- How To Use Open Source LLMs (Large Language Model)
Future Improvements and Research Directions
As we reflect on our journey of fine-tuning an AI model for SQL query generation, it's clear that while we've made significant progress, there's still room for improvement and expansion. This section explores potential enhancements to our model and discusses how our approach could be applied to other SQL-related tasks.
Potential ways to enhance the model further
1. Expanding the training dataset:
- Incorporate a wider variety of SQL queries, covering more complex scenarios and edge cases.
- Include queries from different database systems (e.g., MySQL, Oracle) to improve versatility.
2. Fine-tuning hyperparameters:
- Experiment with different learning rates, batch sizes, and epochs to optimize model performance.
3. Incorporating user feedback:
- Develop a system to collect and integrate user feedback on generated queries, creating a continuous improvement loop.
Applicability to other SQL-related tasks
1. Query optimization:
- Adapt the model to suggest performance improvements for existing queries.
- Train on execution plans to recommend optimal indexing strategies.
2. Schema design:
- Extend the model's capabilities to suggest database schema improvements based on query patterns.
3. SQL to natural language:
- Train the model to generate human-readable explanations of complex SQL queries.
4. Data cleaning and preprocessing:
- Adapt the model to generate SQL for common data cleaning tasks, such as handling null values or standardizing formats.
5. Temporal and spatial data queries:
- Expand the model's capabilities to handle specialized queries for time-series or geographic data.
6. Security and access control:
- Train the model to generate SQL that adheres to specific security policies or role-based access controls.
By pursuing these improvements and exploring new applications, we can continue to push the boundaries of AI-assisted SQL query generation.
Limitations of Our Approach
While our custom-trained model successfully generated the correct SQL query, it's important to acknowledge the limitations of this approach:
1. Dataset Specificity: The model was trained on a custom dataset tailored to this particular problem. This means it may not generalize well to significantly different SQL query tasks without additional training.
2. Time and Resource Intensive: Creating a custom dataset and training the model required significant time and computational resources. This approach may not be feasible for every SQL query challenge, especially in time-sensitive situations.
3. Limited Scope: Our solution addresses a specific type of SQL query. It may not be equipped to handle a wide range of SQL operations or complex database schemas without further training.
4. Maintenance and Updates: As database structures or requirements change, the model may need to be retrained or fine-tuned, which requires ongoing effort and expertise.
5. Dependency on Initial Models: Our approach is built upon existing models (SQLCoder2). The quality and limitations of these base models influence the final output, even after custom training.
Conclusion
We started with a challenging task: generating an SQL query to find the percentage of orders with multiple products. Off-the-shelf AI models like SQLCoder2 and SQLCoder-34B couldn't solve this problem correctly. By creating a custom dataset and fine-tuning a model, we successfully generated an accurate SQL query for our specific need.
This experience shows that while AI models for SQL generation are promising, they still have limitations. Custom training can significantly improve their performance for specific tasks. As AI continues to evolve, we can expect more accurate and versatile SQL generation tools. However, human expertise remains crucial in training these models and verifying their outputs. The future of database query generation likely involves a collaboration between AI assistance and human knowledge, leading to more efficient and accurate data analysis processes.