Mastering the Art of Search Systems: A Comprehensive Guide
A comprehensive guide from inception to production
After almost three years of working on projects around search and recommendation systems, I find it essential to recapitulate all my thoughts on how to start building a search system and, not die on the paralysis per analysis due to all the trade-offs that exist.
In this blog post, I want to provide a comprehensive guide on how to start building a search system successfully. Furthermore, I’ll talk about the optimization strategies required to integrate these systems in production. During the blog I will use the structured approach of “Do it, do it better, and do it efficiently and cost-effective” to explain the different phases.
While my primary focus will be on the pragmatic considerations of the backend, it’s crucial to acknowledge the significance of the frontend in delivering a good user experience and ensuring user satisfaction.
To illustrate the principles and techniques discussed, I will focus on two use cases. Trying to build a text search system for medical documents and building a text-to-image search system for social network posts. If any of the use cases is not what you want to build, don’t worry, most of the points discussed are generic for all search systems.
Do it 💻
In this phase the objective is to start with small datasets, validate the hypothesis, and obtain feedback as soon as possible, so not focusing on build a system that works as a Swiss watch. Here is important to work closely with the product team to decide which is the unit of indexing, which means:
- For text search, are we going to index at sentence level, paragraph level or document level?
- For text-to-image search is also important to determine which is going to be the format of the images, some interesting questions are:
- Do we have only images or do we need to include videos too?
- In case we have videos, how we are going to index the videos? Are we going to consider each frame as a unit? combinations of frames?
Once you have decided which is the unit of indexing we need to ask the question:
Do we need embeddings, or classical systems based on BM25 is enough at this phase?
This question is quite important because there are some trade-offs to consider.
- On the one hand, embeddings are a very powerful tool for representing unstructured information, they can handle typo errors and capture the semantics and syntactics of the questions / documents.
- On the other hand, having good embeddings for search is hard, generic embeddings usually underperform compared to the BM25 index in out-of-domain tasks (check Jo Kristian Bergum — AI-powered Semantic Search; A story of broken promises? for more details). That makes sense, because in the end generic embeddings are trained on a corpus that probably does not contain our medical documents or social network posts.
So what does that for our beloved use cases?
- In the text search system we can start with bm25 and do a simple exact match on ranking.
- In the text-to-image content search we can use a model like CLIP that represents both text and image in the same vector space, so we can compare them. That means our ranking expression should be based on a distance metric between vectors like dot product or cosine.
Independently if you choose to start with an embeddings search or not the most crucial part of this phase is to build the feedback mechanism to capture all that happens in the system. That means two things:
- You need to start working with the product team on defining what is relevant. This is crucial because you will need to obtain an annotated dataset of tuples (query, document, score) to validate at least you don’t suffer regressions in future iterations. Hopefully, after many iterations, you will have enough samples to start training your learning to rank algorithms or your embeddings.
- You need to capture everything that your users do on your platform and that will help you to create online metrics on how your system works but also will be interesting data points for training future ML models. That means capture:
- What are the queries that the users do on the system?
- Which are the results displayed on the search?
- Which results do the users click on?
- Which strong signals like share, save, print etc they click on?
Do it better 💪
In this phase the objective is to improve the system and that means focusing on filtering and ranking the correct documents. By the time you start working on this phase, you will have a decent amount of annotated documents that will help you avoid regressions and hopefully improve your ranking algorithms.
To start thinking about improving your embeddings you can check:
- Domain adaptation on SBERT models
- Jo Kristian Bergum — Boosting Ranking Performance with Minimal Supervision
There you can check some techniques that require minimum to no annotations, and help to improve the embeddings models to work in your specific domain. The main approaches described are based on (1) building datasets with tuples (query, relevant_document, irrelevant_document) with the help of LLMs and (2) the adaptation of the Language models with MLM to work in the target domain.
Also, thanks to working with the product team on defining what is relevant you can start thinking about Learning to Rank algorithms (LTR). Maybe you can start by adding, among others, weights like exactness (bm25 or native rank)and recency, or consider hybrid approaches combining BM25 and the similarity of embeddings.
Furthermore, you can start thinking on more strong signals like the number of likes in the social post, the quality of the medical document, or safety of the post among others. You can create models that provide scores for several signals and that will help you to improve your ranking.
Do it efficiently and cost-effective ⌚️
In this phase the objective is to continue improving your ranking/representation while focusing on making the system work as a Swiss watch and be cost effective. One of the biggest problems with a search system is the cost of the application when you have big scales. For example, imagine that in your application you need to handle 6K queries per second with only 5M of vectors, each vector with 768 dimensions and we use float32 as representation, that system will cost around 7500$ per month [https://github.com/vespa-cloud/vector-search#vespa-cloud---vector-search-price-examples]. Imagine scaling this system to 100M or 1B of documents… The cost can become a problem quite fast 🥴.
To mitigate some of these problems there are some paths we can follow while losing some precision in our system:
- Reduce the number of dimensions per vector. Instead of representing the document with 768 dimensions why not use 384 dimensions or even less? To accomplish this we have techniques like distillation or dimensionality reduction.
- Other options consist of using a type that requires fewer bytes. In the described example we use float32 which requires 4 bytes while a format like bfloat16 only needs 2 bytes. (More info on how to use this on Vespa here).
Also, depending on the scale of your problem you may need to introduce a kind of Approximate Nearest Neighbor. That kind of approach builds an index during the indexation process and helps to search faster across millions of documents but comes with some trade-offs because it requires more memory plus the results are approximate so may be incorrect for some queries. Some more info on here.
In the case of billion scales you may need an architecture more complex than an ANN, check the SPANN approach that combines ANN with HNSW and disk-based vectors.
Finally, you may need multiple phases of ranking. That way you can speed ranking by only using heavy rankers in the top-k results from the previous phase. Finding the K is a hyperparameter you should explore and will be a compromise between speed and quality.
Your day-to-day in this phase should be focusing on measure measure and measure. Some questions that may help in this phase:
- How well are the queries performing?
- Do we have bottlenecks in any type of query?
- Can we reduce the memory footprint?
- How is ranking impacting the search system?
- Which ranking phase is taking the most time?
- Which ranking features are taking the most time to compute? This question is crucial because some ranking features may be more expensive to compute, so better to move it to later phases of ranking.
Conclusions
In this article, a structured approach for constructing effective search systems has been proposed, offering key insights into the complex balance between quality, speed, and cost inherent in this exciting endeavour.
The main recommendation arising from this article is the imperative need to build a golden set by using tuples that encapsulate queries, corresponding documents, and assigned scores. This set helps as a safeguard against risks, acts as a mitigating force against regressions, and becomes crucial for enhancements to the system.
Once the main recommendation is satisfied, many techniques have been discussed to be applied to improve the representation and ranking in the system. Finally, once the quality of the system is good enough is time to wear the hat of optimization and reduce cost while maintaining the quality of the system.
Also, something to take into account is that these systems can suffer from the many problems that ML systems have, so monitoring, detecting drift and continuous retraining will be extremely important to keep the models working correctly.
If you enjoy this article you may like: Building an Image Content Search System Using Vespa. In that article I discuss and implement a simple search system for image to image search using Vespa as Vector database. As you may notice, while in this post I don’t focus on which database to use I added some links on how Vespa can solve the problems. Is the vector database I am used to, and offers enough flexibility and customization to solve many of the problems that arise with this kind of system.
I hope this post inspires you to embark on the journey of building search systems. Let me know any point of improvement to the post and happy building!
About the author ✍🏻
Marcos Esteve is a ML Engineer. At his work, Marcos works mainly on Natural Language Processing tasks. He is quite interested in multimodality, search tasks, Graph Neural Networks and building data science apps. Contact him on Linkedin or Twitter