A presentation at PHP CE in in Prague, Czechia by Philipp Krenn
Full-Text Search Internals Philipp Krenn @xeraa
Who uses a Database?
Who uses Search?
Apache Lucene Elasticsearch
Example These are <em>not</em> the droids you are looking for.
html_strip Char Filter These are not the droids you are looking for.
standard Tokenizer These are not the droids you looking for are
lowercase Token Filter these are not the droids looking for you are
stop Token Filter droids you looking
snowball Token Filter droid you look
Docker Compose --version: '2' services: kibana: image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION links: - elasticsearch ports: - 5601:5601 elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION volumes: - esdata1:/usr/share/elasticsearch/data ports: - 9200:9200 volumes: esdata1: driver: local
GET /_analyze { "analyzer": "english", "text": "These are not the droids you are looking for." }
{ } "tokens": [ { "token": "droid", "start_offset": 18, "end_offset": 24, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, ... ]
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }
{ } "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ]
Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L44-L50
Always Use Stop Words?
To be, or not to be.
Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai
Language Rules English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst
More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (Polish), Ukrainian Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,...
German GET /_analyze { "analyzer": "german", "text": "Das sind nicht die Droiden, nach denen du suchst." }
{ } "tokens": [ { "token": "droid", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 4 }, { "token": "den", "start_offset": 33, "end_offset": 38, "type": "<ALPHANUM>", "position": 6 }, { "token": "such", "start_offset": 42, "end_offset": 48, "type": "<ALPHANUM>", "position": 8 } ]
German with the English Analyzer da sind nicht die droiden denen du suchst nach
German Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ snowball/german_stop.txt
Detect Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect
PUT _ingest/pipeline/langdetect-pipeline { "description": "A pipeline to detect languages", "processors": [ { "langdetect" : { "field" : "quote", "target_field" : "language" } } ] }
POST _ingest/pipeline/langdetect-pipeline/_simulate { "docs": [ { "_source": { "quote": "Das sind nicht die Droiden, nach denen du suchst." } } ] }
{ } "docs": [ { "doc": { "_index": "_index", "_type": "_type", "_id": "_id", "_source": { "language": "de", "quote": "Das sind nicht die Droiden, nach denen du suchst." }, "_ingest": { "timestamp": "2018-10-26T00:06:42.320613Z" } } } ]
Phonetic GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "phonetic", "encoder": "beider_morse", "languageset": "any" } ], "text": "These are not the droids you are looking for." }
Phonetic ... drDts drits drots loknk... iou ari ori
Another Example Obi-Wan never told you what happened to your father.
Another Example obi wan never told you what happen your father
Another Example <b>No</b>. I am your father.
Another Example i am your father
Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]
To / The Index
PUT /starwars { "settings": { "number_of_shards": 1, "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] } },
}, } "analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } }
} "mappings": { "_doc": { "properties": { "quote": { "type": "text", "analyzer": "my_analyzer" } } } }
Synonyms Index synonym or query time synonym_graph
GET /starwars/_mapping GET /starwars/_settings
PUT /starwars/_doc/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/3 { "quote": "<b>No</b>. I am your father." }
GET /starwars/_doc/1 GET /starwars/_doc/1/_source
Multi Lingual Index PUT /starwars_en/_doc/1 Type Field { "quote_en": "...", "quote_de": "..." }
PS: Single Type per Index
POST /starwars/_search { "query": { "match_all": { } } }
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, ...
POST /starwars/_search { "query": { "match": { "quote": "droid" } } }
{ } "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }
POST /starwars/_search { "query": { "match": { "quote": "dad" } } }
... "hits": { "total": 2, "max_score": 0.41913947, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.41913947, "_source": { "quote": "<b>No</b>. I am your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.39291072, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }
POST /starwars/_doc/0/_explain { "query": { "match": { "quote": "dad" } } }
{ } "_index": "starwars", "_type": "_doc", "_id": "0", "matched": false
POST /starwars/_doc/1/_explain { "query": { "match": { "quote": "dad" } } }
{ } "_index": "starwars", "_type": "_doc", "_id": "1", "matched": false, "explanation": { "value": 0, "description": "no matching term", "details": [] }
POST /starwars/_doc/2/_explain { "query": { "match": { "quote": "dad" } } }
{ "_index": "starwars", "_type": "_doc", "_id": "2", "matched": true, "explanation": { ...
POST /starwars/_search { "query": { "match": { "quote": "machine" } } }
{ } "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }
POST /starwars/_search { "query": { "match_phrase": { "quote": "I am your father" } } }
{ } "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.5665855, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.5665855, "_source": { "quote": "<b>No</b>. I am your father." } } ] }
POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am father", "slop": 1 } } } }
{ } "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8327639, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8327639, "_source": { "quote": "<b>No</b>. I am your father." } } ] }
POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am not your father", "slop": 1 } } } }
{ } "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0409548, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0409548, "_source": { "quote": "<b>No</b>. I am your father." } } ] }
POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }
{ } "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }
POST /starwars/_search { "query": { "match": { "quote": { "query": "ovi-van", "fuzziness": 1 } } } }
{ } "took": 109, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.3798467, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.3798467, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }
FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton
SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"
Term Frequency / Inverse Document Frequency (TF/IDF) Search one term
BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25
Term Frequency
Inverse Document Frequency
Field-Length Norm
POST /starwars/_search?explain=true { "query": { "match": { "quote": "father" } } }
... "_explanation": { "value": 0.41913947, "description": "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.41913947, "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, { "value": 1.4569536, "description": "tfNorm, computed from:", "details": [ { "value": 2, "description": "termFreq=2.0", "details": [] }, ...
Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you
Vector Space Model Search multiple terms
Search your father
Coordination Factor Reward multiple terms
Search for 3 terms 1 term: 2 terms: 3 terms:
Practical Scoring Function Putting it all together
score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)
Function Score Script, weight, random, field value, decay (geo or date)
POST /starwars/_search { "query": { "function_score": { "query": { "match": { "quote": "father" } }, "random_score": {} } } }
Compare Scores "100% perfect" vs a "50%" match
Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages
GET /starwars/_analyze { "analyzer" : "my_analyzer", "text": "These are my father's machines." }
{ "tokens": [ { "token": "my", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 }, { "token": "father", "start_offset": 13, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "dad", "start_offset": 13, "end_offset": 21, "type": "SYNONYM", "position": 3 }, { "token": "machin", "start_offset": 22, "end_offset": 30, "type": "<ALPHANUM>", "position": 4 } ] }
PUT /starwars/_doc/4 { "quote": "These are my father's machines." }
POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }
"hits": { "total": 4, "max_score": 2.92523, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 2.92523, "_source": { "quote": "These are my father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.8617505, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...
2.92523 == 100%
DELETE /starwars/_doc/4 POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }
"hits": { "total": 3, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...
1.2499592 == 43% or 100%?
PUT /starwars/_doc/4 { "quote": "These droids are my father's father's machines." } POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }
"hits": { "total": 4, "max_score": 3.0068164, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 3.0068164, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.89701396, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...
3.0068164 == 103%?
PS: Shards Default? Effect on IDF?
Distributed Frequency Search GET starwars/_search?search_type=dfs_query_then_fetch { ... }
Don’t use dfs_query_then_fetch in production. It really isn’t required. — https://www.elastic.co/guide/en/elasticsearch/ guide/current/relevance-is-broken.html
More Search
POST /starwars/_search { "query": { "match": { "quote": "father" } }, "highlight": { "type": "unified", "pre_tags": [ "<tag>" ], "post_tags": [ "</tag>" ], "fields": { "quote": {} } } }
... "hits": { "total": 3, "max_score": 0.631961, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 0.631961, "_source": { "quote": "These droids are my father's father's machines." }, "highlight": { "quote": [ "These droids are my <tag>father's</tag> <tag>father's</tag> machines." ] } }, ...
Boolean Queries must must_not should filter
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }
... "hits": { "total": 3, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...
POST /starwars/_search { "query": { "bool": { "filter": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }
... "hits": { "total": 3, "max_score": 1.6694657, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1.6694657, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8317767, "_source": { "quote": "<b>No</b>. I am your father." } },
Named Queries & minimum_should_match
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": { "query": "your", "_name": "quote-your" } } }, { "match": { "quote": { "query": "obi", "_name": "quote-obi" } } }, { "match": { "quote": { "query": "droid", "_name": "quote-droid" } } } ], "minimum_should_match": 2 } } }
... "hits": { "total": 1, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, "matched_queries": [ "quote-obi", "quote-your" ] } ] } }
Boosting >1 increase, <1 decrease, <0 punish <0 removed in 7.0
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": { "query": "obi", "boost": 3 } } } ] } } }
... "hits": { "total": 3, "max_score": 4.2368493, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 4.2368493, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...
Search for father but prefer father father
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father father" } } } } }
... "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.263922, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.1077905, "_source": { "quote": "<b>No</b>. I am your father." } },
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": { "match_phrase": { "quote": "father father" } } } } }
... "hits": { "total": 3, "max_score": 9.146545, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 9.146545, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0454913, "_source": { "quote": "<b>No</b>. I am your father." } }, ...
Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0
POST /starwars/_search { "query": { "match": { "quote": "drui" } }, "suggest": { "my_suggestion" : { "text" : "drui", "term" : { "field" : "quote" } } } }
... "hits": { "total": 0, "max_score": null, "hits": [] }, "suggest": { "my_suggestion": [ { "text": "drui", "offset": 0, "length": 4, "options": [ { "text": "droid", "score": 0.5, "freq": 1 } ] } ] } }
Multiple Suggesters term phrase completion context
NGram Partial matches Edge Gram
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "hes", "start_offset": 1, "end_offset": 4, "type": "word", "position": 1 }, { "token": "ese", "start_offset": 2, "end_offset": 5, "type": "word", "position": 2 }, { "token": "are", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, ...
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "edge_ngram", "min_gram": "1", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }
{ "tokens": [ { "token": "t", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "th", "start_offset": 0, "end_offset": 2, "type": "word", "position": 1 }, { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 2 }, { "token": "a", "start_offset": 6, "end_offset": 7, "type": "word", "position": 3 }, { "token": "ar", "start_offset": 6, "end_offset": 8, "type": "word", "position": 4 }, ...
Combining Analyzers Reindex Store multiple times Tune BM25 Combine scores
BM25 Revisited
b field length amplification k1 term frequency saturation Default 0.75 Default 1.2
PUT /starwars_v42 { "settings": { "number_of_shards": 1, "index": { "similarity": { "default": { "type": "BM25", "b": 0, "k1": 0 } } },
"analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] }, "my_ngram_filter": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] } },
"analyzer": { "my_lowercase_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "my_full_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] },
}, } } "my_ngram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase", "stop", "my_ngram_filter" ] }
"mappings": { "_doc": { "properties": { "quote": { "type": "text", "fields": { "lowercase": { "type": "text", "analyzer": "my_lowercase_analyzer" }, "full": { "type": "text", "analyzer": "my_full_analyzer" }, "ngram": { "type": "text", "analyzer": "my_ngram_analyzer" } } } } } } }
POST /_reindex { "source": { "index": "starwars" }, "dest": { "index": "starwars_v42" } }
Aliases Atomic remove and add Point to multiple indices (read-only)
PUT _alias { "actions": [ { "add": { "index": "starwars_v42", "alias": "starwars_extended" } } ] }
POST /starwars/_search { "query": { "match": { "quote": "droid" } } }
"hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.1533037, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.1295731, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]
POST /starwars_extended/_search { "query": { "match": { "quote.full": "droid" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 0.6931472, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 0.6931472, "_source": { "quote": "These droids are my father's father's machines." } } ]
There are no "best" b and k1 values
POST /starwars_extended/_search?explain=true { "query": { "multi_match": { "query": "obiwan", "fields": [ "quote", "quote.lowercase", "quote.full", "quote.ngram" ], "type": "most_fields" } } }
... "hits": { "total": 1, "max_score": 0.4912064, "hits": [ { "_shard": "[starwars_v42][2]", "_node": "BCDwzJ4WSw2dyoGLTzwlqw", "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 0.4912064, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, ...
Whitespace Tokenizer "weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:"
POST /starwars_extended/_search { "query": { "multi_match": { "query": "you", "fields": [ "quote", "quote.lowercase^5", "quote.full", "quote.ngram" ], "type": "best_fields" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 3.465736, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 3.465736, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "3", "_score": 0.35667494, "_source": { "quote": "<b>No</b>. I am your father." } } ]
Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase
Different Analyzers for Indexing and Searching Per query In the mapping
POST /starwars_extended/_search { "query": { "match": { "quote.ngram": { "query": "the", "analyzer": "standard" } } } }
... "hits": [ { "_index": "starwars_extended", "_type": "_doc", "_id": "2", "_score": 0.38254172, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_extended", "_type": "_doc", "_id": "3", "_score": 0.36165747, "_source": { "quote": "<b>No</b>. I am your father." } } ] ...
Edge Gram vs Trigram Test a setting before adding a field
Shingle Token Filter Shingles (token ngrams) from a token stream
POST /starwars_extended/_close PUT /starwars_extended/_settings { "index": { "similarity": { "default": { "type": "BM25", "b": null, "k1": null } } },
"analysis": { "filter": { "my_edgegram_filter": { "type": "edge_ngram", "min_gram": 3, "max_gram": 10 }, "my_shingle_filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 2 } },
"analyzer": { "my_edgegram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "my_edgegram_filter" ] },
} } } "my_shingle_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "my_shingle_filter" ] } POST /starwars_extended/_open
GET starwars_extended/_analyze { "text": "Father", "analyzer": "my_edgegram_analyzer" }
{ } "tokens": [ { "token": "fat", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fath", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fathe", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "father", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 } ]
PUT /starwars_extended/_doc/_mapping { "properties": { "quote": { "type": "text", "fields": { "edgegram": { "type": "text", "analyzer": "my_edgegram_analyzer", "search_analyzer": "standard" }, "shingle": { "type": "text", "analyzer": "my_shingle_analyzer" } } } } }
PUT /starwars_extended/_doc/5 { "quote": "I find your lack of faith disturbing." } PUT /starwars_extended/_doc/6 { "quote": "That... is your failure." }
GET /starwars_extended/_doc/5/_termvectors { "fields": [ "quote.edgegram" ], "offsets": true, "payloads": true, "positions": true, "term_statistics": true, "field_statistics": true }
{ "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_version": 1, "found": true, "took": 3, "term_vectors": { "quote.edgegram": { "field_statistics": { "sum_doc_freq": 26, "doc_count": 2, "sum_ttf": 26 }, "terms": { "dis": { "doc_freq": 1, "ttf": 1, "term_freq": 1, "tokens": [ { "position": 6, "start_offset": 26, "end_offset": 36 } ] }, "dist": { "doc_freq": 1, "ttf": 1, ...
POST /starwars_extended/_search { "query": { "match": { "quote": "fail" } } }
POST /starwars_extended/_search { "query": { "match": { "quote.lowercase": "fail" } } }
POST /starwars_extended/_search { "query": { "match": { "quote.full": "fail" } } }
POST /starwars_extended/_search { "query": { "match": { "quote.ngram": "fail" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.8400999, "_source": { "quote": "That... is your failure." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_score": 1.442779, "_source": { "quote": "I find your lack of faith disturbing." } } ]
POST /starwars_extended/_search { "query": { "match": { "quote.edgegram": "fail" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.0114291, "_source": { "quote": "That... is your failure." } } ]
Updating Missing Fields Expensive
POST /starwars_extended/_update_by_query { "query": { "bool": { "must_not": { "exists": { "field": "quote.edgegram" } } } } }
Shingles: Context Should Matter
POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.lowercase": "these droids are" } } } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.1837702, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 2.137744, "_source": { "quote": "These droids are my father's father's machines." } } ]
POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.shingle": "these droids are" } } } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 3.1811738, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.6568544, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]
Decompounding Commonly in German, Scandinavian languages, Finnish, Korean
PUT /decompound_en { "settings": { "number_of_shards": 1, "analysis": { "filter": { "british_decompounder": { "type": "hyphenation_decompounder", "hyphenation_patterns_path": "hyph/en_GB.xml", "word_list": [ "death", "star" ] } }, "analyzer": { "british_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "british_decompounder" ] } } } } }
GET /decompound_en/_analyze { "analyzer" : "british_decompound", "text" : "deathstar" }
{ } "tokens": [ { "token": "deathstar", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "death", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "star", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 } ]
German Dictionaly (LGPL) https://github.com/uschindler/ german-decompounder
PUT /decompound_de { "settings": { "number_of_shards": 1, "analysis": { "filter": { "german_decompounder": { "type": "hyphenation_decompounder", "word_list_path": "dictionary-de.txt", "hyphenation_patterns_path": "hyph/de_DR.xml", "only_longest_match": true, "min_subword_size": 4 }, "german_stemmer": { "type": "stemmer", "language": "light_german" } }, "analyzer": { "german_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "german_decompounder", "german_normalization", "german_stemmer" ] } } } } }
GET /decompound_de/_analyze { "analyzer" : "german_decompound", "text" : "Todesstern" }
{ } "tokens": [ { "token": "todesst", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "tod", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "stern", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 } ]
Without Word Lists https://github.com/jprante/ elasticsearch-analysis-decompound
Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms
Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model
Advanced Queries Highlighting Suggestions NGrams, Edge Grams Multiple Analyzers
Advanced Queries Reindex & Alias Update by Query Shingles Decompound
There is more...
Trainings https://training.elastic.co
Thank You! Questions? Philipp Krenn PS: Stickers @xeraa
Today’s applications are expected to provide powerful full-text search. But how does that work in general and how do I implement it on my site or in my application?
Actually, this is not as hard as it sounds at first. This talk covers:
The following code examples from the presentation can be tried out live.
Trying out various search features in Elasticsearch.