Small Models, Big Impact
With such fast progress in the field of Large Language Models (LLMs), the race to develop the most powerful model is in full swing. Companies and researchers are pushing the limits, scaling models to unprecedented sizes. However, this growth brings significant challenges: the computational cost of training and running these massive models demands enormous hardware and energy, and they often introduce latency that makes real-time applications or user-facing products difficult to manage.
Bigger isn’t always better. In this post, we’ll explore how Small Language Models (SLMs), with fewer than 10 billion parameters, offer a compelling alternative. While smaller in scale, SLMs can be fine-tuned for specialized tasks to deliver results that rival their larger counterparts. This approach balances performance and practicality by reducing the need for extensive resources and enabling faster, more sustainable AI deployments. Join us as we dive into the advantages of SLMs and see how they can shape the future of AI applications.
Our use-case: keyword extraction
At Joko, we’re building an AI assistant powered by an extensive product catalog to help users easily find and explore products that match their needs. This project comes with many challenges, most notably: how do we design the optimal pipeline to deliver the best user experience?
One of the most crucial steps is crafting the best search query for our product catalog by extracting keywords from the user’s natural language shopping query. Indeed, extracting the most relevant keywords enables us to perform accurate searches within our product catalog. For instance, consider the query: “I would like to buy blue running shoes from Nike, and I have a budget of $200.” From this, we need to identify and organize key information into a structured format. The relevant data could be extracted into a JSON structure like this:
{
"name": "Name of the product",
"brand": "Brand of the product",
"size": "Size/dimensions of the product",
"color": "Color of the product",
"additional_keywords": "Other relevant keywords",
"price_range": {
"min": "Minimum price",
"max": "Maximum price",
"target": "Target or approximate price"
}
}
By structuring user inputs this way, we can ensure that the most relevant products are found, offering a more efficient and personalized shopping experience.
Note that this type of structured search is often combined with vector search (in what is often referred to as “hybrid search”), but this topic is outside the scope of this article.
Out-of-the-box small models are not great at this task
How good are small open-source models at our target task? Before proceeding with fine-tuning, we need to determine whether it is truly necessary. For this we created a test dataset on which to evaluate the models' output, as well as an evaluation benchmark to get a quantitative idea of the models' performance without fine-tuning.
For the dataset, we generated the user queries along with the corresponding keyword extractions in JSON format. We used a frontier model to create these “gold standard” labels.
For the evaluation benchmark, we opted for using an LLM as a judge, an approach that we describe in details in another post, inspired from this paper. This approach lets us efficiently evaluate model performance without needing human involvement, which, let’s face it, would be super tedious. Comparing two JSON outputs manually takes time, and people can easily miss details or make mistakes. Instead, our LLM judge, GPT-4o, compares the output against the gold standard and assigns a grade out of 10. This method is highly scalable, enabling us to easily swap out GPT-4o for a different model or change the dataset as needed.
With this evaluation benchmark in place, we tested a few SLMs with under 10B parameters. The highest grade we got was 6.7 with Llama 3 (8B), which shows us there’s definitely room for improvement and the need for fine-tuning to get those perfect results!
We also evaluated the percentage of outputs respecting the JSON schema in the test set. The results varied significantly, ranging from 15% for Mistral 7B to 92% for Phi 3. However, our aim is to achieve a 100% respected schema. This highlights the need for fine-tuning, not only to enhance content quality but also to ensure perfect adherence to the desired format.
Refining the fine-tuning dataset
Having established the need for fine-tuning, our next step is to prepare and format our dataset for this process. For this, we need to ensure it follows a specific structure and formatting to guide the model. Here’s a simple example:
### Question: I need you to extract keywords from the given user query and
provide the output in a formatted JSON with the extracted information.
If the query is not related to shopping, all fields must be empty.
Here is the user query: "livre de cuisine italienne max 40 euros"
### Answer: {
"name": "livre de cuisine",
"brand": "",
"size": "",
"color": "",
"price_range": {"min": null, "max": 40, "target": null},
"additional_keywords": "italienne"
}
<|endoftext|>
We use ### Question
and ### Answer
to mark the beginning of the prompt and the beginning of the answer, respectively. These markers are crucial in helping the model understand the task and preventing overfitting by blocking backpropagation on prompt tokens. Additionally, it’s important to include an “end of text” token to signal when the model should stop generating text.
When fine-tuning a model, each sample in the dataset contains both the input (the prompt) and the expected output. The prompt is the question in our example above, while the output is the answer. For models without fine-tuning, you generally need to provide detailed instructions in the prompt, including the expected JSON schema. However, when fine-tuning a model, an effective trick is to use a simplified prompt in your dataset. Instead of including all details about the expected output in the prompt, you give a simple instruction like in our example, and you let the model learn the nuances during fine-tuning by analyzing the outputs. This approach reduces token consumption during inference since the model has already internalized the task and format.
Regarding dataset size, there’s no one-size-fits-all answer. Research (for example here) suggests that even a small number of examples can be effective. For example, OpenAI’s fine-tuning API indicates that improvements can start with just 50 to 100 examples. In our case, we used 1000 examples to ensure robust results.
The fine-tuning process
Now, onto the exciting part-fine-tuning. This is where the magic really happens 🪄.
After exploring various fine-tuning techniques, particularly those focused on efficiency, we settled on a LoRA (Low-Rank Adaptation) approach. LoRA simplifies fine-tuning by focusing only on a small set of task-specific parameters instead of updating the entire model. The key lies in matrix decomposition, where weight updates are broken down into low-rank matrices. This makes fine-tuning faster, lighter on resources, and still highly effective. To further enhance efficiency, we incorporated 4-bit quantization, inspired by the QLoRA paper. This technique allows us to fine-tune larger models on more modest hardware without sacrificing performance.
This approach allows us to maximize the potential of smaller models while improving efficiency in terms of both time and computational resources.
For our fine-tuning process, we chose the unsloth library. Unsloth offers a straightforward interface that simplifies the fine-tuning process, allowing us to focus on optimizing our model. Its flexibility supports various fine-tuning strategies, aligning perfectly with our need for scalability and performance. Moreover, its efficiency helps us save both time and computational resources.
One key takeaway is that fine-tuning can be surprisingly quick. With a well-prepared dataset, just a few epochs often lead to significant improvements. However, overdoing it can lead to overfitting-which is the last thing we want. So, we aim for that sweet spot between performance and generalization!
Finally, we paid special attention to hyperparameter tuning. We focused on tuning two key parameters: the number of epochs and the LoRA rank. For this, we built a small test set to evaluate different combinations of these parameters, selecting the configuration that yielded the highest performance on this test set.
Results
The results from our fine-tuning efforts are truly exciting! We tested a range of models, and while fine-tuning clearly enhanced their performance, two stood out: Mistral 7B and Llama 3 8B.
Mistral 7B emerged as the star, scoring an impressive 9.31 out of 10 with 100% of respected JSON schema for the outputs. These findings highlight the power of fine-tuning smaller models. By focusing on specific tasks and leveraging techniques like LoRA and quantization, we’ve delivered impressive performance with a model that’s a fraction of the size of some industry giants. This approach not only provides strong results but also offers significant benefits in terms of speed, cost, and scalability.
If you need to perform a specific language task and want a scalable solution with flexibility, fine-tuning might be your answer! And if you’ve made it this far and are interested in hearing more about our challenges and research, don’t hesitate to reach out-we’re hiring! 👋