Building a Smart Browser, One Graph at a Time
At Joko, our mission is to build a smart web browser and smart browser extensions to help our users preserve their purchasing power, buy more responsibly, and save time - all at once: we are transforming the online shopper’s user experience, with a tool designed to adapt to any merchant website.
Imagine having transparent carbon footprint information displayed next to the price for all products on the Internet, and being able to save these products if you want to get automatically notified when their price drops. Imagine having the best possible coupon automatically applied to your cart just before payment. Imagine being able to click “buy” on a product, and have our browser automatically select the cheapest seller for this exact product, and self-navigate the payment flow for you. Our goal is to offer a universal Amazon-like experience to our users, built on top of arbitrary merchant websites, with killer features allowing them to save money and time, while shopping more responsibly.
In order to achieve this goal, we need to overcome a great number of technical challenges. The main hurdle is that our browser needs a full-fledged machine understanding of the web, in order to comprehend arbitrary shopping webpages, and interact with them successfully. In this blog post, we present the Graph Machine Learning approach chosen by our research team to power this smart browser.
What does it mean for our browser to understand a webpage?
Let us give an example of a feature that we have already embedded in our browser and extensions: giving users the ability to save products while surfing the web. This requires our browser to be able to detect when a visited webpage corresponds to a product webpage, in order to activate the feature on pages that showcase a single specific product. On the contrary, we would like to avoid activating the feature when browsing e-commerce webpages that are not specific to a single product, such as the categories menu, or on non-e-commerce webpages such as Google results pages or Wikipedia articles. As a consequence, our browser needs to achieve understanding of what type of page is being browsed.
Once we detect that a webpage corresponds to a product page, we need to extract useful data from this webpage in order to save meaningful properties about the product. Hence, our smart browser needs to extract information from the webpage such as the product’s name, brand and price. This requires our browser to understand what the different elements on a page are.
Since the Internet was designed for us humans, both these tasks - recognizing the page type and identifying its elements - are very intuitive for us to perform. But they are very challenging for a machine, especially if we consider all websites on the Internet!
Faced with such technical challenges, where do we start? And what even is a webpage in the first place?
The array of pixels that you see in your browser window is the result of different operations, which start by parsing the content retrieved from the web. This includes HTML (Hyper-Text Markup Language) which is the standard language to define the structure and content of the document, CSS (Cascading Style Sheets) which defines the style rules, and JS (JavaScript) which modifies the document dynamically through scripting. It also includes external assets such as image files, and more advanced mechanisms like WebAssembly and WebGL for the most modern websites.
After parsing all this, the browser transforms the content into intermediate data structures, the most important being the DOM (Document Object Model), which is a tree where each node represents an element in the webpage, such as a document section, an image or a text element. From this internal representation, the browser is finally able to render the pixels of the webpage, by displaying each element at the correct position in the page. And this is how you can read this page!
The DOM tree, a key representation of a webpage
Let us delve into more details on what the DOM tree is. In graph theory and computer science, a tree is a special case of a graph, i.e. a set of edges connecting a set of nodes, where each node can be connected to several children but must be connected to exactly one parent, except for the root node which has no parent. In the DOM tree, the root represents the document itself, from which all elements descend. In addition, each node in the DOM tree is endowed with its corresponding attributes, such as its size, position in the page, color or style properties. The DOM tree is both the browser’s internal representation of the webpage and the API exposed for Javascript to interact with the content.
For our machine learning algorithms, making sense of the DOM tree is key in order to achieve our goal of understanding webpages. For example in the product-saving feature discussed above, we need to determine whether a given DOM resembles a typical “product page” DOM, and if it does, which of its nodes are the elements of interest such as the product’s price and name.
For some merchant websites, there are heuristic ways to do this, such as finding a button element with a description reading “Add to cart”, or a text element with a label reading “Product price”. However, this will not generalize well to an arbitrary merchant on the Internet and to more complex detection tasks. For this reason, we need to develop machine learning methods that apply to DOM trees.
Building a machine understanding of DOM trees
Graphs are widely used to store information about complex systems, from connections between people in social networks, to road connections between locations in maps, or to molecular interactions in chemistry. The analysis of these graphs is a wide subject of research and many Graph Machine Learning methods have been developed to extract information from them. In our product-saving example, we are interested in two specific tasks which are typical research problems in Graph Machine Learning.
The first is a task of graph classification: in order to identify whether the current webpage corresponds to a product webpage, we need to be able to classify the corresponding DOM tree as a whole, in order to put it in the ‘product webpage’ category or in the ‘non-product webpage’ category.
This kind of graph classification is a task that you do naturally with real-life trees (as opposed to their mathematical counterparts), each time you are walking in the countryside. If you are a plant enthusiast, you are probably able to classify each tree, and give it a label such as “oak”, “pine” or “sequoia”.
The second is a task of node classification: we need to classify individual nodes in the tree, to match them to element categories such as “price element”, “product image” etc.
In your countryside walk, the node classification task is equivalent to picking a single one of the tree’s elements, and classifying it as a “leaf”, a “bud” or a “branch”.
The shape of our algorithms
The models which solve these tasks on graphs usually make use of different kinds
of input. First, the structure of the graph itself, as the position of the
nodes with respect to their neighboring nodes gives a lot of information on the
role and nature of each node. Second, the node features, which are the
attributes carried by each node. In the context of webpages, these node features
are particularly rich: they include the HTML label, such as <body>
or
<title>
, the position in the page, the style properties such as the background
color, the hypertext link URL, and most importantly the content of the element,
i.e. the text or image displayed by this DOM node.
Because of this importance of node features in DOM trees, a fruitful research effort lies in feature engineering in addition to employing state-of-the-art graph algorithms, like Graph Neural Networks (GNNs). Feature engineering consists in enhancing the features with pre-processing steps, in order to improve their expressivity and increase the model’s performance. Since webpage element features are so rich, many strategies can be applied. One example is to map the text and image content into a given embedding space, by relying on natural language processing (NLP) approaches and computer vision techniques.
Finally, note that our use-cases imply hard constraints on the solutions’ implementation. Indeed, for most tasks, we need to perform real-time inferences, with a maximum latency of around 500 ms in order to enrich our users’ browsing experience in the smoothest way possible. Moreover, most of our models need to be embedded in our app and browser extensions, which requires that the calculation be manageable by the user’s smartphone or laptop.
What are the results? And what are the next steps?
All these efforts are already yielding promising results as our latest solutions improve our products and beat the competition’s published results. We’re writing up more detailed notes about our results, so stay tuned!
And this is far from the end of the story, because we are studying other challenges on which we will communicate soon. They range from the improvement of our current solutions, to NLP and computer vision investigations related to our efforts on node features, and go as far as trying NLP strategies directly on the webpage code, and reinforcement learning to automate user journeys. The engineering side of embedding these models inside our users’ browsers is also a key challenge.
Do reach out if you are interested in our research problems, we are hiring!