Applying custom similarity calculation in Elasticsearch
Presently we’ve been discussing improving elasticsearch autocomplete functionality. For most cases, this is enough. However, in our case, things turned out to be not that rosy and we’ll have a look at why.
Setup
In our system we single document type per index. Since we have quite a lot of types in our system and the number of types and their mappings can be changed by the end-user, we have unpredictable number of indices. Currently, this figure stands at 17 different indices. Naturally, documents are not distributed equally across all types meaning that some indices contain more records than others.
The key feature of your project is the ability to perform a search across logical a group of indices. This and the data skew across indices leads to some relevance implications that will become obvious to you when we discuss our problem in detail.
Another thing worth mentioning is that the user search queries are rather specific. The aim of our project is to dissect a large number of documents to get one or two relevant documents. So we don’t expect vague queries like “Ukraine” that might yield thousands of documents. This will be important when we’ll move on to the solution we propose.
The standard way of measuring similarity
By default, Elasticsearch uses BM25 similarity in order to rank search results. There are three main factors that determine a document’s score: 1. Term frequency (TF) — The more times that a search term appears in the field we are searching in a document, the more relevant that document is. 2. Inverse document frequency (IDF) — The more documents that contain a search term in the field that we are searching, the less important that term is. 3. Field length — If a document contains a search term in a field that is very short (i.e. has few words), it is more likely relevant than a document that contains a search term in a field that is very long (i.e. has many words).
Let’s look more deeply at IDF. Inverse Document Frequency is the calculation function that takes Rare Globally words into account. The rarer the word is in the whole corpus, the more important it is.
Let’s have a look at the extract from explain section to get a better understanding of how it’s calculated.
{
"value": 0.5389965,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 3,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 5,
"description": "N, total number of documents with field",
"details": []
}
]
},
This means that IDF is inverse to the number of documents in the index. In our case, it leads to danger. Recall that we perform a search on multiple indices. Imagine two different indices one with 2K documents and the other one with 20K documents both containing exactly one match for a given query. This means that IDF for the first index will be higher and thus the result from the first index will be more relevant!
For the end user, however, the number of indices and the number of items in them is an irrelevant implementation detail. What the user cares about is how well the output result matches the query string input.
Discarding IDF
Taking into account that the user has no interest in data skew across indices and places specific enough queries we’ve decided that IDF is irrelevant in our case. How can we discard it from our search calculation formula?
The answer is scripted similarity.
Let’s reimplement BM25 but with constant IDF.
PUT /index
"settings": {
"index": {
"similarity": {
"discarded_idf": {
"type": "scripted",
"script": {
"source": "double tf = Math.sqrt(doc.freq); double idf = 1.0; double norm = 1 / Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
}
}
}
}
}
Conclusion
Elasticsearch has a wide range of tools for your search query to return the most relevant results possible. Once you exhaust dedicated datatypes and rich query DSL you might tweak similarity calculation as well by using either a set of predefined algorithms or rolling out your own.