N-grams
Overview
N-grams break text into smaller chunks, enhancing search features like autocomplete. Elasticsearch uses N-gram tokenizers to improve search performance.
- N-grams enable autocomplete and fuzzy search.
- Elasticsearch uses custom analyzers for N-grams.
- N-gram tokenizers index text for partial matching.
- Improves search speed and accuracy.
Index-time with N-grams
N-grams helps improve search functionalities like autocomplete. Each type of N-gram represents different token lengths used for matching text.
Consider the example term "star":
| N-gram | Description | Example tokens | |---------|-----------| | Unigram | Single tokens that represent individual characters or words | [ s, t, a, r ] | | Bigram | Pairs of consecutive characters or words | [ st, ta, ar ] | | Trigram | Sets of three consecutive characters or words | [ sta, tar ] | | 4-gram | Sets of four consecutive characters or words | [ star ] |
Edge N-grams
Edge n-grams work differently from standard n-grams. Instead of breaking down a term into n-grams for the entire word, edge n-grams generate tokens only from the beginning of each term. This is especially useful for autocomplete scenarios, where you want to match partial words typed by the user.
For example, using edge n-grams on the word "star":
N-gram | Description | Example tokens |
---|---|---|
Unigram | Single token from the start of the term | [ s ] |
Bigram | Two consecutive tokens from the start | [ st ] |
Trigram | Three consecutive tokens from the start | [ sta ] |
4-gram | Four consecutive tokens from the start | [ star ] |
Indexing N-grams
N-gram indexing is used to optimize search, allowing for efficient partial matching and autocomplete results. Here's how to set it up in Elasticsearch:
-
Create an Autocomplete Analyzer
Setup a custom analyzer for autocomplete. The
edge_ngram
filter breaks down the text into smaller chunks (e.g.,s
,st
,sta
) for fast matching.curl -XPUT 127.0.0.1:9200/movies?pretty \
-H "Content-Type: application/json" -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}' -
Apply the Analyzer During Indexing
This command applies the custom
autocomplete
analyzer to thetitle
field during indexing.curl -XPUT '127.0.0.1:9200/movies/_mapping?pretty' \
-H "Content-Type: application/json" -d '
{
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}' -
Use the Standard Analyzer on Queries
To avoid splitting the query into n-grams, use the standard analyzer on the search query. This ensures that only the indexed n-grams are used for matching.
curl -XGET 127.0.0.1:9200/movies/_search?pretty \
-H "Content-Type: application/json" -d '
{
"query": {
"match": {
"title": {
"query": "sta",
"analyzer": "standard"
}
}
}
}'In this query, we use "sta" as the search term for "Star Wars", and by specifying the standard analyzer, we ensure it is not broken down into n-grams.
-
Use Completion Suggesters
You can also pre-load lists of potential completions using completion suggesters. This allows you to efficiently suggest options like "star" while typing.
curl -XPUT 127.0.0.1:9200/movies/_mapping?pretty \
-H "Content-Type: application/json" -d '
{
"properties": {
"title_suggest": {
"type": "completion"
}
}
}'
Lab: N-grams in Action
-
Download the
movies.json
dataset. -
First, store the Elasticsearch endpoint and credentials in variables:
ELASTIC_ENDPOINT="https://your-elasticsearch-endpoint"
ELASTIC_USER="your-username"
ELASTIC_PW="your-password" -
Create the Movies Index with a custom autocomplete analyzer. This analyzer uses edge n-grams to enhance search functionality.
curl -s -u $ELASTIC_USER:$ELASTIC_PW \
-H 'Content-Type: application/json' \
-XPUT $ELASTIC_ENDPOINT:9200/movies?pretty -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}' | jqOutput:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "movies"
} -
Test the autocomplete analyzer on the term "Sta" and check the n-gram tokenization:
curl -s -u $ELASTIC_USER:$ELASTIC_PW \
-H "Content-Type: application/json" \
-XGET $ELASTIC_ENDPOINT:9200/movies/_analyze?pretty -d '
{
"analyzer": "autocomplete",
"text": "Sta"
}' | jqThe output will show how the term "Sta" is split into tokens by the edge n-gram analyzer:
{
"tokens": [
{
"token": "s",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "st",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "sta",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}
]
} -
Map the
autocomplete
analyzer to thetitle
field of the movies index to ensure it's used during indexing:curl -s -u $ELASTIC_USER:$ELASTIC_PW \
-H "Content-Type: application/json" \
-XPUT $ELASTIC_ENDPOINT:9200/movies/_mapping?pretty -d'
{
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}' | jqOutput:
{
"acknowledged": true
} -
Import the
movies.json
dataset into Elasticsearch using the bulk API:curl -s -u $ELASTIC_USER:$ELASTIC_PW \
-H 'Content-Type: application/json' \
-XPUT $ELASTIC_ENDPOINT:9200/_bulk?pretty \
--data-binary @movies.json | jq -
Finally, execute a search query using the
standard
analyzer on the query side, while theautocomplete
analyzer is used on the index side. This ensures the query does not split into n-grams:curl -s -u $ELASTIC_USER:$ELASTIC_PW \
-H 'Content-Type: application/json' \
-XGET $ELASTIC_ENDPOINT:9200/movies/_search?pretty -d'
{
"query": {
"match": {
"title": {
"query": "sta",
"analyzer": "standard"
}
}
}
}' | jqThis will return all movie entries related to the "Star Wars" and "Star Trek" franchises, demonstrating the autocomplete functionality with the edge n-gram analyzer.