LLMs Judging LLMs: A New Evaluation Paradigm
In recent years, general benchmarks have become crucial for measuring the performance of large language models (LLMs). Widely-used benchmarks like MMLU, IFEval, and GPQA provide a broad evaluation of how models perform across a variety of tasks. While these benchmarks are great for assessing overall capability, they tend to fall short when it comes to more specialized or custom tasks.
When faced with complex or ambiguous challenges, creating an automatic evaluation metric becomes tricky. The most reliable way to measure performance is often through human evaluation, where people rate the quality of a model’s outputs. However, as datasets grow, relying solely on human judgment becomes costly and difficult to scale.
A better solution is to use another LLM as a judge, automatically evaluating the performance of the first model. This innovative approach not only streamlines the evaluation process but also offers unparalleled scalability, accelerates model comparisons, and delivers more nuanced results. Let’s dive into how we implemented this method and explore how it revolutionized our evaluation framework.
Our custom task to evaluate: cleaning product names
At Joko, we are building a universal product catalog from all e-commerce pages that exist on the web. This catalog is designed to create a next-generation product discovery experience for our users, allowing them to find any product from any website. Building this catalog comes with many challenges, the main one being the standardization of the data which comes from different sources. A common issue lies in product titles, which are often excessively long and filled with redundant information. If you’ve ever browsed product listings on Amazon, you know how messy these titles can get.
We clean product names for two main reasons: to enhance user experience by displaying concise names without redundant details, and to improve catalog search functionality, boosting both efficiency and user satisfaction.
Given the complexity of this task, which surpasses simple heuristics or traditional machine learning models, we turned to LLMs for an effective solution. With millions of product names to clean daily, scalability is crucial. Consequently, we opted to fine-tune smaller LLMs that are both efficient and capable of handling this task. However, before selecting the optimal model for our needs, we needed a robust and reliable evaluation process. This is where the idea of using an LLM as a judge emerged.
Two methods to evaluate the output with LLMs
Using an LLM as a judge consists in leveraging the capabilities of an LLM, generally a frontier LLM, to assess the performance of less powerful models on a specific task. The key advantage of this approach is that it minimizes the need for human intervention, allowing for scalable benchmarks and quicker iterations. Moreover, the LLM judge can provide not only scores but also detailed explanations, making the evaluations more interpretable and insightful.
To implement this approach, we were inspired by the paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena and developed our own evaluation system. We explored two core methods: pairwise comparison, where the LLM judge is presented with a question and two answers, and tasked to determine which one is better or declare a tie; and reference-guided grading, where the LLM judge is asked to assign a score to a single answer against a reference solution. Our goal is to determine the best evaluation benchmark, which involves three factors: the method, whether to use pairwise comparison or reference-guided grading; the LLM judge, identifying which frontier model is most suitable; and the prompt, which provides the instructions for the judge model.
Selecting the right prompt is crucial to our process. Think of the prompt as the instruction manual for the LLM, guiding its responses. The quality of the prompt significantly affects how well the model understands and performs the task. This art of crafting prompts, known as prompt engineering, is all about optimizing to obtain the best results, in this case the most accurate and reliable evaluations.
To evaluate the best setup, we rely on a metric called agreement rate, which measures the likelihood that two judges will agree. We aim for the setup with the highest agreement rate with human judges.
In the following experiments, we test GPT-4o, GPT-3.5-Turbo, and smaller models like Llama 3 (8B) and Phi 3 (3.8B) as LLM judges to evaluate their effectiveness.
Pairwise comparison suffers from position bias
At first glance, pairwise comparison is an effective evaluation method, as it allows for a direct head-to-head comparison between two model outputs on the same task. However, one known limitation of pairwise comparison is position bias, where the LLM judge might favor the output in one position (either first or second) over the other, regardless of the content. This bias can significantly skew results, making the evaluation less reliable.
To test for position bias, we can conduct an experiment in which we present two answers to the LLM judge, and then swap the answers and ask the judge to evaluate them again. If the judge is making decisions purely based on content, the results should reverse when the positions are swapped. However, what we found was that position bias was quite prevalent across all the LLM judges we tested.
For instance, as shown in the table below, GPT-3.5-Turbo consistently preferred Answer B, regardless of its actual content or position:
Answer Choice | Setup 1 (Original) | Setup 2 (Swapped) |
---|---|---|
Answer A (first position) | 156 times | 181 times |
Answer B (second position) | 230 times | 219 times |
Answer C (tie) | 114 times | 100 times |
This position bias was not unique to GPT-3.5-Turbo, it was observed across all tested judge models. As this would significantly compromise our evaluation process’s reliability, we concluded that the pairwise comparison method wasn’t sufficiently robust for our needs.
Reference-guided grading: a more reliable approach
With reference-guided grading, we first generate a set of reference answers with a frontier model, also known as gold-standard labels, which represent the ideal cleaned product names. The LLM judge then evaluates the output of different models by comparing them to these reference answers and assigning a score on a scale from 1 to 10, 1 being the worst and 10 the best. This approach allows us to avoid providing detailed cleaning rules to the judge model. Instead, we leverage the reference as an evaluation scale.
Qualitative tests on this method show that it does not present any obvious flaws, unlike pairwise comparison. To obtain a quantitative evaluation of the quality of this method, we measure the agreement rate between the LLM judge and human evaluators.
We tested different combinations of LLM judge and prompt to determine which one led to the highest agreement rate. To evaluate these different combinations, we selected a subset of 100 product names and generated cleaned versions of these names using two models: Llama 3 and Phi 3. We then had these versions rated by LLM judges and also rated these results ourselves, thus playing the role of human judge. To compare the models, we converted the grades into pairwise comparisons. For instance, if Model A received a higher grade than Model B, we considered A to be preferred. The agreement rate simply measures the frequency at which the LLM’s decision matches ours.
We experimented with six different prompts, focusing on key variables such as including explanations alongside grades, varying the detail in grading instructions, and providing examples for clarity. For instance, Prompt 1 asked for an explanation with the grade and provided three grading instructions, while Prompt 6 included examples to guide the LLM judge. We don’t detail the prompts here to avoid making the demonstration too cumbersome, but if you would like more details on the exact prompts used, please don’t hesitate to contact us.
We show in the following table the agreement rates obtained for all the different combinations:
Prompt | GPT-4o | GPT-3.5-Turbo | Llama 3 (8B) | Phi 3 (3.8B) |
---|---|---|---|---|
Prompt 1 | 86% | 60% | 62% | 60% |
Prompt 2 | 76% | 62% | 56% | 52% |
Prompt 3 | 72% | 57% | 42% | 50% |
Prompt 4 | 68% | 52% | 57% | 32% |
Prompt 5 | 66% | 50% | 26% | 30% |
Prompt 6 | 78% | 70% | 52% | 60% |
As can be seen, the best-performing combination is prompt 1 with GPT-4o, with an agreement rate of 86%. Since human agreement on complex tasks is typically around 80%, this is a great result that makes us confident the system is a reliable judge of the product-name-cleaning LLM’s performance. Note that it is not necessarily desirable to aim for a 100% agreement rate, given this would probably imply overfitting on our personal biases.
This LLM judge now keeps an eye on our systems
By incorporating this method into our processes, we’ve automated the evaluation of the product-name-cleaning systems. This has allowed us to compare the performance of different models, monitor their scores during fine-tuning on example datasets of the specific target task, and identify the best model for our production pipelines. If you’re interested in this approach, see here for a deep-dive into another research direction we have launched to analyze user product queries.
This automated evaluation benchmark is now essential to ensure quality at scale in our product catalog and to improve the search experience within this catalog, enabling us to offer an ever-better user experience in the Joko app.
We’re at the forefront of applying AI technologies to solve real-world challenges, and if you’re interested in being part of this journey, we’re hiring! Reach out to us to learn more 👋