Improving Elasticsearch-based autocomplete
Recently I’ve investigated autocomplete functionality of our system as there were a lot of complaints that it returns irrelevant results. The approach we’ve taken was pretty naive: our backend wrapped query into wildcard symbols and executed it as query_string
on fields __title
, title
and commonInfo.RealName
. Index we’ve executed search upon contained entity with _title
equal 3 foxes
but autocomplete query 3 foxes
suggested BRN-3 / QCK 3 / 19 foxes
, AC d / 3 foxes
, 3 BRW / 1 foxes
. The exact match was nowhere at sight!
So I’ve chosen 3 foxes
as my relevance baseline and turned my head on specific Elasticsearch queries that facilitate autocomplete functionality.
Search as you type
As the name implies search as you type seemed as a perfect fit for autocomplete functionality. To start off I’ve changed the mapping of my __title
field to search_as_you_type
and performed bool_prefix
query straight from the documentation.
{
"_source": [
"__title"
],
"from": 0,
"size": 3,
"query": {
"multi_match": {
"query": "3 foxe",
"type": "bool_prefix",
"fields": [
"__title",
"__title.2gram",
"__title.3gram"
]
}
}
}
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 272,
"relation": "eq"
},
"max_score": 4.5528774,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "cdf7f3aded8745d1827e9c92dea1e8b7",
"_score": 4.5528774,
"_source": {
"__title": "3 oxe/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 4.285463,
"_source": {
"__title": "3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "6f1588bbe1a440028af1de4337bf8fac",
"_score": 3.9906564,
"_source": {
"__title": "9 mfx/ 3 oxe/ 3 foxes"
}
}
]
}
}
search_as_you_type
mapping and n-gram
fields. After reading for a while I’ve learned that n-gram
s are basically sequences of words extracted from the text mixed in a random order which allows searching words out of order in my autocomplete query. The downside of this is that Elasticsearch cluster consumes extra memory to store n-gram
s which may affect cluster state. And fancy search_as_you_type
mapping just means that n-gram
fields are created automatically.
Since typing out-of-order words wasn’t my use case I’ve decided not to mess with it and improve my relevance query-time instead of index-time.
Match phrase query
In order to boost exact match relevance, I’ve switched to match phrase prefix query.
{
"_source": [
"__title"
],
"from": 0,
"size": 3,
"query": {
"match_phrase_prefix": {
"__title": {
"query": "3 foxe"
}
}
}
}
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 12.053555,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 12.053555,
"_source": {
"__title": "3 foxes"
}
},
//omited for brevity
]
}
}
match_phrase_prefix
doesn’t support multiple fields the first guess was the plain old bool
query.
{
"_source":[
"__title",
"title",
"commonInfo.RealNameShort"
],
"explain":false,
"from":0,
"size":3,
"query":{
"bool":{
"should":[
{
"match_phrase_prefix":{
"__title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"commonInfo.RealNameShort":{
"query":"3 foxe"
}
}
}
]
}
}
}
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 28.880083,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "15e4e503cc1d4284aeb34664cb61c5ae",
"_score": 28.880083,
"_source": {
"__title": "apt 3 foxes",
"commonInfo": {
"RealNameShort": "apt 3 foxes"
},
"title": "apartmetnt 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "83b2653a851c4ca19d3df0410ab1c41f",
"_score": 26.242756,
"_source": {
"__title": "rest/ 3 foxes",
"commonInfo": {
"RealNameShort": "rest/ 3 foxes"
},
"title": "restaraunt/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 23.940828,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
}
]
}
}
explain":true
to understand. Since the output is huge I’ll focus only on important parts.
In the topmost document we’ll notice
"value": 10.268458,
"description": "weight(__title:\"3 (foxe foxes)\" in 1156) [PerFieldSimilarity], result of:",
...
"value": 8.497357,
"description": "weight(title:\"3 foxes\" in 1156) [PerFieldSimilarity], result of:",
...
"value": 10.114267,
"description": "weight(commonInfo.RealNameShort:\"3 (foxes foxe)\" in 1156) [PerFieldSimilarity], result of:",
"value": 12.053555,
"description": "weight(__title:\"3 (foxe foxes)\" in 1180) [PerFieldSimilarity], result of:",
...
"value": 11.887274,
"description": "weight(commonInfo.RealNameShort:\"3 (foxes foxe)\" in 1180) [PerFieldSimilarity], result of:",
...
"value": 0.0,
"description": "match on required clause, product of:",
3 foxes
in __title
scores most by the field __title
. But since apt 3 foxes
contains somewhat relevant results in each field of interest it outweighs desired document. If only we could somehow order documents by most relevant match!
Disjunction max query
And indeed we can try Disjunction max query just for that case. Let’s try the example right from the docs
{
"_source":[
"__title",
"title",
"commonInfo.RealNameShort"
],
"explain":false,
"from":0,
"size":3,
"query": {
"dis_max": {
"queries": [
{
"match_phrase_prefix":{
"__title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"commonInfo.RealNameShort":{
"query":"3 foxe"
}
}
}
],
"tie_breaker": 0.7
}
}
}
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 23.296595,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "15e4e503cc1d4284aeb34664cb61c5ae",
"_score": 23.296595,
"_source": {
"__title": "apt 3 foxes",
"commonInfo": {
"RealNameShort": "apt 3 foxes"
},
"title": "apartmetnt 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "83b2653a851c4ca19d3df0410ab1c41f",
"_score": 21.053097,
"_source": {
"__title": "rest/ 3 foxes",
"commonInfo": {
"RealNameShort": "rest/ 3 foxes"
},
"title": "restaraunt/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 20.374645,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
}
]
}
}
tie_breaker
parameter does. Let’s tweak it to find out. At first we’ll set it to 1.
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 28.880083,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "15e4e503cc1d4284aeb34664cb61c5ae",
"_score": 28.880083,
"_source": {
"__title": "apt 3 foxes",
"commonInfo": {
"RealNameShort": "apt 3 foxes"
},
"title": "apartmetnt 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "83b2653a851c4ca19d3df0410ab1c41f",
"_score": 26.242756,
"_source": {
"__title": "rest/ 3 foxes",
"commonInfo": {
"RealNameShort": "rest/ 3 foxes"
},
"title": "restaraunt/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 23.940828,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
}
]
}
}
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 12.053555,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 12.053555,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
},
//omited for brevity
]
}
}
Conclusion
When implementing autocomplete functionality with Elasticsearch don’t jump straight away to the naive query_string
approach. Explore rich Elasticsearch query language first. Leveraging search_as_you_type
mapping at index-time might not be a silver bullet as well as the main aim of it is to combat search queries with out-of-order words by creating n-gram
fields for you. So it might be sufficient to resort solely to query-time improvements such as bool_prefix
query type if you want to get more lenient results or match_phrase_prefix
query type if you want your results to be more strict.
When combining autocomplete on multiple fields you may use dis_max
query type. In such a case increasing tie_breaker
parameter increases the degree by which all fields influence on resulting score.
And finally once in doubt about why query results don’t match your expectations you may resort to explain":true
query parameter.