Powerful Full-Text Search with Elasticsearch Philipp Krenn @xeraa

Who uses a Database?

Who uses Search?

Developer

Question https://sli.do/xeraa Answer https://twitter.com/xeraa

Store

Apache Lucene Elasticsearch

Example These are <em>not</em> the droids you are looking for.

html_strip Char Filter These are not the droids you are looking for.

standard Tokenizer These are not the droids you looking for are

lowercase Token Filter these are not the droids looking for you are

stop Token Filter droids you looking

snowball Token Filter droid you look

Setup

https://cloud.elastic.co

Docker Compose —version: ‘2’ services: kibana: image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION links: - elasticsearch ports: - 5601:5601 elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION volumes: - esdata1:/usr/share/elasticsearch/data ports: - 9200:9200 volumes: esdata1: driver: local

Analyze

GET /_analyze { “analyzer”: “english”, “text”: “These are not the droids you are looking for.” }

{ } “tokens”: [ { “token”: “droid”, “start_offset”: 18, “end_offset”: 24, “type”: “<ALPHANUM>”, “position”: 4 }, { “token”: “you”, “start_offset”: 25, “end_offset”: 28, “type”: “<ALPHANUM>”, “position”: 5 }, … ]

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball” ], “text”: “These are <em>not</em> the droids you are looking for.” }

{ } “tokens”: [ { “token”: “droid”, “start_offset”: 27, “end_offset”: 33, “type”: “<ALPHANUM>”, “position”: 4 }, { “token”: “you”, “start_offset”: 34, “end_offset”: 37, “type”: “<ALPHANUM>”, “position”: 5 }, … ]

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L45-L52

Always Use Stop Words?

To be, or not to be.

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Language Rules English: Philipp’s → philipp French: l’église → eglis German: äußerst → ausserst

More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (Polish), Ukrainian Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,…

Polish To nie są droidy, których szukasz.

GET /_analyze { “analyzer”: “polish”, “text”: “To nie są droidy, których szukasz.” }

Polish droid szukać

Polish with the English Analyzer nie są droidi których szukasz

Polish Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/stempel/src/resources/org/apache/lucene/analysis/ pl/stopwords.txt

Detect Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect

PUT /_ingest/pipeline/langdetect-pipeline { “description”: “A pipeline to detect languages”, “processors”: [ { “langdetect”: { “field”: “quote”, “target_field”: “language” } } ] }

GET /_ingest/pipeline/langdetect-pipeline/_simulate { “docs”: [ { “_source”: { “quote”: “To nie są droidy, których szukasz.” } } ] }

{ } “docs” : [ { “doc” : { “_index” : “_index”, “_type” : “_type”, “_id” : “_id”, “_source” : { “language” : “pl”, “quote” : “To nie są droidy, których szukasz.” }, “_ingest” : { “timestamp” : “2019-03-21T12:22:35.118896Z” } } } ]

Another Example Obi-Wan never told you what happened to your father.

Another Example obi wan never told you what happen your father

Another Example <b>No</b>. I am your father.

Another Example i am your father

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

To / The Index

PUT /starwars { “settings”: { “number_of_shards”: 1, “analysis”: { “filter”: { “my_synonym_filter”: { “type”: “synonym”, “synonyms”: [ “father,dad”, “droid => droid,machine” ] } },

}, } “analyzer”: { “my_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball”, “my_synonym_filter” ] } }

} “mappings”: { “_doc”: { “properties”: { “quote”: { “type”: “text”, “analyzer”: “my_analyzer” } } } }

Synonyms Index synonym or query time synonym_graph

GET /starwars/_mapping GET /starwars/_settings

PUT /starwars/_doc/1 { “quote”: “These are <em>not</em> the droids you are looking for.” } PUT /starwars/_doc/2 { “quote”: “Obi-Wan never told you what happened to your father.” } PUT /starwars/_doc/3 { “quote”: “<b>No</b>. I am your father.” }

GET /starwars/_doc/1 GET /starwars/_doc/1/_source

Multi Lingual Index PUT /starwars_en/_doc/1 Field { “quote_en”: “…”, “quote_de”: “…” }

Search

POST /starwars/_search { “query”: { “match_all”: { } } }

{ “took”: 1, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 3, “max_score”: 1, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, …

POST /starwars/_search { “query”: { “match”: { “quote”: “droid” } } }

{ } “took”: 2, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.39556286, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.39556286, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } } ] }

POST /starwars/_search { “query”: { “match”: { “quote”: “dad” } } }

… “hits”: { “total”: 2, “max_score”: 0.41913947, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.41913947, “_source”: { “quote”: “<b>No</b>. I am your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.39291072, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] } }

POST /starwars/_doc/0/_explain { “query”: { “match”: { “quote”: “dad” } } }

{ } “_index”: “starwars”, “_type”: “_doc”, “_id”: “0”, “matched”: false

POST /starwars/_doc/1/_explain { “query”: { “match”: { “quote”: “dad” } } }

{ } “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “matched”: false, “explanation”: { “value”: 0, “description”: “no matching term”, “details”: [] }

POST /starwars/_doc/2/_explain { “query”: { “match”: { “quote”: “dad” } } }

{ “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “matched”: true, “explanation”: { …

POST /starwars/_search { “query”: { “match”: { “quote”: “machine” } } }

{ } “took”: 2, “timed_out”: false, “_shards”: { “total”: 1, “successful”: 1, “skipped”: 0, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.2499592, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.2499592, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } } ] }

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: “I am your father” } } }

{ } “took”: 3, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.5665855, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.5665855, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: { “query”: “I am father”, “slop”: 1 } } } }

{ } “took”: 16, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.8327639, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.8327639, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: { “query”: “I am not your father”, “slop”: 1 } } } }

{ } “took”: 5, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.0409548, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.0409548, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

POST /starwars/_search { “query”: { “match”: { “quote”: { “query”: “van”, “fuzziness”: “AUTO” } } } }

{ } “took”: 14, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.18155496, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.18155496, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] }

POST /starwars/_search { “query”: { “match”: { “quote”: { “query”: “ovi-van”, “fuzziness”: 1 } } } }

{ } “took”: 109, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.3798467, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.3798467, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] }

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

SELECT * FROM starwars WHERE quote LIKE “?an” OR quote LIKE “V?n” OR quote LIKE “Va?”

Score

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Term Frequency

Inverse Document Frequency

Field-Length Norm

POST /starwars/_search?explain=true { “query”: { “match”: { “quote”: “father” } } }

… “_explanation”: { “value”: 0.41913947, “description”: “weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:”, “details”: [ { “value”: 0.41913947, “description”: “score(doc=0,freq=2.0 = termFreq=2.0\n), product of:”, “details”: [ { “value”: 0.2876821, “description”: “idf(docFreq=1, docCount=1)”, “details”: [] }, { “value”: 1.4569536, “description”: “tfNorm, computed from:”, “details”: [ { “value”: 2, “description”: “termFreq=2.0”, “details”: [] }, …

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Vector Space Model Search multiple terms

Search your father

Coordination Factor Reward multiple terms

Search for 3 terms 1 term: 2 terms: 3 terms:

Practical Scoring Function Putting it all together

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Function Score Script, weight, random, field value, decay (geo or date)

POST /starwars/_search { “query”: { “function_score”: { “query”: { “match”: { “quote”: “father” } }, “random_score”: {} } } }

Compare Scores “100% perfect” vs a “50%” match

Don’t do this. Seriously. Stop trying to think about your problem this way, it’s not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

GET /starwars/_analyze { “analyzer” : “my_analyzer”, “text”: “These are my father’s machines.” }

{ “tokens”: [ { “token”: “my”, “start_offset”: 10, “end_offset”: 12, “type”: “<ALPHANUM>”, “position”: 2 }, { “token”: “father”, “start_offset”: 13, “end_offset”: 21, “type”: “<ALPHANUM>”, “position”: 3 }, { “token”: “dad”, “start_offset”: 13, “end_offset”: 21, “type”: “SYNONYM”, “position”: 3 }, { “token”: “machin”, “start_offset”: 22, “end_offset”: 30, “type”: “<ALPHANUM>”, “position”: 4 } ] }

PUT /starwars/_doc/4 { “quote”: “These are my father’s machines.” }

POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

“hits”: { “total”: 4, “max_score”: 2.92523, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 2.92523, “_source”: { “quote”: “These are my father’s machines.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.8617505, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

2.92523 == 100%

DELETE /starwars/_doc/4 POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

“hits”: { “total”: 3, “max_score”: 1.2499592, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.2499592, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

1.2499592 == 43% or 100%?

PUT /starwars/_doc/4 { “quote”: “These droids are my father’s father’s machines.” } POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

“hits”: { “total”: 4, “max_score”: 3.0068164, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 3.0068164, “_source”: { “quote”: “These droids are my father’s father’s machines.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.89701396, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

3.0068164 == 103%?

More Search

Highlighting

POST /starwars/_search { “query”: { “match”: { “quote”: “father” } }, “highlight”: { “type”: “unified”, “pre_tags”: [ “<tag>” ], “post_tags”: [ “</tag>” ], “fields”: { “quote”: {} } } }

… “hits”: { “total”: 3, “max_score”: 0.631961, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 0.631961, “_source”: { “quote”: “These droids are my father’s father’s machines.” }, “highlight”: { “quote”: [ “These droids are my <tag>father’s</tag> <tag>father’s</tag> machines.” ] } }, …

Boolean Queries must must_not should filter

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: “obi” } } ] } } }

… “hits”: { “total”: 3, “max_score”: 2.117857, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 2.117857, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.3856719, “_source”: { “quote”: “<b>No</b>. I am your father.” } }, …

POST /starwars/_search { “query”: { “bool”: { “filter”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: “obi” } } ] } } }

… “hits”: { “total”: 3, “max_score”: 1.6694657, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.6694657, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.8317767, “_source”: { “quote”: “<b>No</b>. I am your father.” } },

Named Queries & minimum_should_match

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: { “query”: “your”, “_name”: “quote-your” } } }, { “match”: { “quote”: { “query”: “obi”, “_name”: “quote-obi” } } }, { “match”: { “quote”: { “query”: “droid”, “_name”: “quote-droid” } } } ], “minimum_should_match”: 2 } } }

… “hits”: { “total”: 1, “max_score”: 2.117857, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 2.117857, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” }, “matched_queries”: [ “quote-obi”, “quote-your” ] } ] } }

Boosting >1 increase, <1 decrease, <0 punish <0 removed in 7.0

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: { “query”: “obi”, “boost”: 3 } } } ] } } }

… “hits”: { “total”: 3, “max_score”: 4.2368493, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 4.2368493, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.3856719, “_source”: { “quote”: “<b>No</b>. I am your father.” } }, …

Performance

Conclusion

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Advanced Queries Highlighting Boolean Boost Performance

https://www.elastic.co/blog/elastic-app-search-beta-released

Thank You Questions & Stickers Philipp Krenn @xeraa