Nov 24, 2023

Web Pages and LLMs, a Match Made in Heaven

Language Models (LLMs) are revolutionizing our ability to understand web pages with unparalleled precision. Before delving into what ‘understanding web pages’ entails, let’s make one thing clear: this isn’t just another hype post about LLMs. What you’re about to read is the result of years of dedicated effort, guided by real-world use cases centered on user needs. Our approach has led to state-of-the-art outcomes, validated on the largest publicly available dataset of product web pages, so we’re pretty confident we’re on the money here! Let us introduce this work to you and how we use this technology to radically transform the shopping experience - enjoy!

Back to basics

Let’s get back to the basics and explain what we actually mean by web page understanding and why it’s important for us. At Joko, we believe that today’s online shopping experience is fundamentally flawed. How many times have you found yourself juggling between dozens of tabs to find the right product? How many times have you felt completely lost in a jungle of information facing poorly designed interfaces? We’re convinced that this frustrating experience cannot be the future, and we devote all our energy to crafting the smoothest shopping experience available – effortless, with transparent information at your fingertips, and packed with unbeatable savings. We’re building the future of online shopping one feature at a time, and a lot of these features have deep underlying technical challenges that necessitate an understanding of web pages.

Take, say, the ability to track the price of a product across a number of merchants. To do this, you need to be able to fetch price information across merchants in close-to-real time; which is no easy task as it entails automatic parsing of product web pages that works on any product web page on the internet. I’ll spare you the list of all the features that are built on top of solving similar deeply technical problems, trust me, its long. There is, however, a lot of commonality between all of these features and technical challenges. They often boil down to one of two things: (i) extracting information from a web page i.e element extraction, and (ii) identifying an element on the web page so we can interact with it i.e element nomination. This is what we mean with the somewhat blanket term web page understanding.

Being the X (~~Twitter~~) addicts that we are, it was a no-brainer to turn to LLMs for solutions! More seriously, we’ve been eyeing the use of language models to tackle these challenges for some time now. However, it’s the recent major breakthroughs in this domain that have truly unlocked their full potential, making them incredibly efficient for solving our specific problems.

Extracting products from web pages automatically (AI-generated image)

LLMs to the rescue

At this point, you might be wondering why we would even want to use LLMs for web page understanding in the first place. To understand why, it’s important to know which other types of approaches exist and what a web page is. Oh so fortunately, we’ve covered these points in a previous blog post - go check it out! Even if you haven’t perused this article, you are likely familiar with HTML, the foundational language for structuring and presenting content on a web page.

Since HTML (Hyper Text Markup Language) is a Language it can potentially be parsed using Language Models! This idea, along with a few other factors, prompted us (LLM pun intended) towards the use of language models. Our first observation was that web page structure is very noisy. We saw on previous experiments that basic classifiers working with basic features on HTML elements could beat fancier graph-ML algorithms on certain tasks. This taught us that incorporating the tree structure into the prediction task isn’t always beneficial. The second observation is that LLMs are great at extractive Question Answering (QA) tasks. The challenge becomes whether they can extract structured information from HTML, but we reasoned it’s not too large of a gap to bridge.

A workflow for web page element extraction with LLMs

I have a product web page and information I want to extract (e.g. the product price and name: “20.00” and “Amazing shoes” in the image below), let’s use a LLM to do this. Obviously, life is never this easy and a number of challenges come up. The first challenge is that a web page is really quite long (100 000s of tokens on average), so you can’t feed it to a model directly. This is because, whatever you might read, current LLMs can only work reliably with limited context window sizes whose order of magnitude is around 10k tokens depending on the model. The second challenge is that there’s a lot of noise on a web page and we don’t want this as input as it might lead to poor extraction results. A last challenge is how we process the output of the LLM. Can we make sure we get a structured output? Can we validate the fields and make sure the name is indeed a product name and price is indeed a price? How can we solve all of these in one single workflow? Worry not, we did and we’ll share some aspects of how we were able to. The pipeline depicted below is quite simple, we just took each of the challenges we just listed and tackled them one by one. Much of the “secret sauce” 🥘 of this approach is in the nitty gritty details of the pre- and post-processing around the LLM.

You may be aware that a web page can be represented using a tree structure, through the Document Object Model (DOM). The first step of the pipeline involves pruning the DOM tree to make the corresponding HTML much shorter. The pruning itself is carried out by removing sub-trees with highly similar leaf nodes (such as long tables or lists). We thought about a few fancy metrics that could be used to quantify leaf node similarity but, as it often happens, the simpler approaches gave the best results. The final pruning/cleaning is the results of lots of iterations to make sure we remove noise from the product page while keeping the information of interest such as name, price, description, etc.

After cleaning, if the HTML fits in the LLM’s context window you can simply ask it to extract the information you’re looking for in a JSON format. If the HTML is still too long, we split it by using an HTML text splitter. We can then feed each individual batch to the LLM and ask for a structured JSON output with the information we want. If there are multiple batches we need to add another LLM call to synthesize the JSONs into a single coherent one before validating the output using the one and only pydantic.

Show me the money

Of course, each of these steps required multiple rounds of iterations to converge to something that is not only usable, but that outperforms previous approaches on the element extraction task on the Klarna Product Page dataset, our chosen benchmark. It’s important to mention that this benchmark is somewhat outdated (being 2 years old at the time of writing), yet there are few public datasets available for this task, and the Klarna Product Page dataset still remains one of the most comprehensive benchmarks we could use. Moreover, our method proved superior not only on this very relevant public dataset but also on our own extensive internal datasets. An interesting point is our success with relatively “small” language models like GPT 3.5 fine-tuned for function calling. The function calling part was quite useful for us as these models can output structured JSONs and we could leverage this for our extraction task. Below, you can get a sense of our results as our pipeline outperforms a traditional classifier based approach and Klarna’s implementation of Google’s FREEDOM algorithm, which, to our knowledge, is the leading published method in this field.

Oh, and since nothing can replace a small demo, we’ve designed a small web app for the purpose of this blog post to show the results of our approach on real web pages.

Web app showcasing element extraction capabilities

This is all great, but the LLM aficionados among you may already be wondering “hey, isn’t this going to cost me truckloads of 💵?”. Unfortunately, you would be correct - cost and latency are some of the core issues when trying to use LLMs in a production setting. But, as presented below, we managed to mold our current pipeline into something more production-ready.

Production, production, production

Repeat after me (and picture Steve Ballmer while doing so): ~~developers~~ production, production, production, production, production, production …

The previous pipeline implies multiple LLM calls for every single web page. Great that it works well, but if we want to extract information from 10 million product pages this is going to be expensive (think around $5000 per million web pages using GPT-3.5), whether we want to use a proprietary model or an open-source model. There’s also the slight problem that you need to wait for the LLM call which can take a few seconds for a single page. If you want to use the information straight away, that’s obviously a latency which is unacceptable for a user.

To fix this, the next step becomes how to call the LLM on only a few web pages but use that information well. To use the information well, we thought we should search for underlying structure in product web pages. Maybe we could exploit the fact that certain product web pages are very similar so as to limit the number of overall LLM calls. Then it became quite clear, web pages from a same merchant are usually similar. We validated this with an array of graph (tree) comparison metrics and wondered how we could best use this information.

Two key steps were necessary. First, we needed to be able to find nodes of interest in the DOM trees of a few web pages per merchant (element nomination). This is where the LLM comes in, but as you may recall, for now the LLM is only extracting text information from HTML. To make this particular step work, we used our previous extraction pipeline but had to come up with some shenanigans and custom string similarity metrics to match LLM-extracted text to a node of interest in the DOM. Then, since trees from a same merchant are similar, we can try to generalize the node positions we’ve found to all the other web pages of that same merchant. The question becomes, what positional information generalizes to all web pages of a certain merchant while being specific enough to point to only one node at test time? This is a delicate balance to find, and we converged to a final pipeline whose saucy details are kept just for us (and you, if you join us 😁).

Element nomination using our previous extraction pipeline

It works!

Finally, we put everything together after having tested the individual bricks extensively. Results showed a slight performance lag on the element nomination side but still state-of-the-art extraction accuracy on certain elements (95% for the price and 97% for the name on the Klarna Product Page dataset). This implies that our approach points to nodes in the DOM that contain the correct information at test time but that weren’t labelled as the name or the price. However, this pipeline can generalize to any new field with zero effort AND costs very little to run. It costs very little as you only call the LLM on a few web pages per merchant. The rest is handled by our ability to leverage intra merchant web page similarity by generalizing the node positional information we found at train time. We’re really getting the best of both worlds here, as we can rely on zero-shot extractive capabilities of the LLM to make our pipeline as general and versatile as possible, but without bearing the brunt of prohibitive costs that render an LLM-only pipeline unsuitable for production.

Congrats 👏 if you’ve made it all the way here - I hope you learned a thing or two. In any case, feel free to reach out with any comments you may have. Also, we’re hiring. If you’re interested in joining a fast-paced tech startup reinventing the shopping experience, writing coffee-fueled ☕ code and taking on big challenges that directly impact millions of users: come find us 😉.