Full-Text Search Internals

A presentation at PHP CE in October 2018 in Prague, Czechia by Philipp Krenn

Slide 1

Slide 1

Full-Text Search Internals Philipp Krenn @xeraa

Slide 2

Slide 2

Developer

Slide 3

Slide 3

Slide 4

Slide 4

Slide 5

Slide 5

Who uses a Database?

Slide 6

Slide 6

Who uses Search?

Slide 7

Slide 7

Slide 8

Slide 8

Store

Slide 9

Slide 9

Apache Lucene Elasticsearch

Slide 10

Slide 10

Slide 11

Slide 11

Example These are <em>not</em> the droids you are looking for.

Slide 12

Slide 12

html_strip Char Filter These are not the droids you are looking for.

Slide 13

Slide 13

standard Tokenizer These are not the droids you looking for are

Slide 14

Slide 14

lowercase Token Filter these are not the droids looking for you are

Slide 15

Slide 15

stop Token Filter droids you looking

Slide 16

Slide 16

snowball Token Filter droid you look

Slide 17

Slide 17

Setup

Slide 18

Slide 18

https://cloud.elastic.co

Slide 19

Slide 19

Slide 20

Slide 20

Slide 21

Slide 21

Docker Compose --version: '2' services: kibana: image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION links: - elasticsearch ports: - 5601:5601 elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION volumes: - esdata1:/usr/share/elasticsearch/data ports: - 9200:9200 volumes: esdata1: driver: local

Slide 22

Slide 22

Analyze

Slide 23

Slide 23

GET /_analyze { "analyzer": "english", "text": "These are not the droids you are looking for." }

Slide 24

Slide 24

{ } "tokens": [ { "token": "droid", "start_offset": 18, "end_offset": 24, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, ... ]

Slide 25

Slide 25

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }

Slide 26

Slide 26

{ } "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ]

Slide 27

Slide 27

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L44-L50

Slide 28

Slide 28

Always Use Stop Words?

Slide 29

Slide 29

To be, or not to be.

Slide 30

Slide 30

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 31

Slide 31

Language Rules English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst

Slide 32

Slide 32

More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (Polish), Ukrainian Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,...

Slide 33

Slide 33

German GET /_analyze { "analyzer": "german", "text": "Das sind nicht die Droiden, nach denen du suchst." }

Slide 34

Slide 34

{ } "tokens": [ { "token": "droid", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 4 }, { "token": "den", "start_offset": 33, "end_offset": 38, "type": "<ALPHANUM>", "position": 6 }, { "token": "such", "start_offset": 42, "end_offset": 48, "type": "<ALPHANUM>", "position": 8 } ]

Slide 35

Slide 35

German with the English Analyzer da sind nicht die droiden denen du suchst nach

Slide 36

Slide 36

German Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ snowball/german_stop.txt

Slide 37

Slide 37

Detect Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect

Slide 38

Slide 38

PUT _ingest/pipeline/langdetect-pipeline { "description": "A pipeline to detect languages", "processors": [ { "langdetect" : { "field" : "quote", "target_field" : "language" } } ] }

Slide 39

Slide 39

POST _ingest/pipeline/langdetect-pipeline/_simulate { "docs": [ { "_source": { "quote": "Das sind nicht die Droiden, nach denen du suchst." } } ] }

Slide 40

Slide 40

{ } "docs": [ { "doc": { "_index": "_index", "_type": "_type", "_id": "_id", "_source": { "language": "de", "quote": "Das sind nicht die Droiden, nach denen du suchst." }, "_ingest": { "timestamp": "2018-10-26T00:06:42.320613Z" } } } ]

Slide 41

Slide 41

Phonetic GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "phonetic", "encoder": "beider_morse", "languageset": "any" } ], "text": "These are not the droids you are looking for." }

Slide 42

Slide 42

Phonetic ... drDts drits drots loknk... iou ari ori

Slide 43

Slide 43

Another Example Obi-Wan never told you what happened to your father.

Slide 44

Slide 44

Another Example obi wan never told you what happen your father

Slide 45

Slide 45

Another Example <b>No</b>. I am your father.

Slide 46

Slide 46

Another Example i am your father

Slide 47

Slide 47

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

Slide 48

Slide 48

To / The Index

Slide 49

Slide 49

PUT /starwars { "settings": { "number_of_shards": 1, "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] } },

Slide 50

Slide 50

}, } "analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } }

Slide 51

Slide 51

} "mappings": { "_doc": { "properties": { "quote": { "type": "text", "analyzer": "my_analyzer" } } } }

Slide 52

Slide 52

Synonyms Index synonym or query time synonym_graph

Slide 53

Slide 53

GET /starwars/_mapping GET /starwars/_settings

Slide 54

Slide 54

PUT /starwars/_doc/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/3 { "quote": "<b>No</b>. I am your father." }

Slide 55

Slide 55

GET /starwars/_doc/1 GET /starwars/_doc/1/_source

Slide 56

Slide 56

Multi Lingual Index PUT /starwars_en/_doc/1 Type Field { "quote_en": "...", "quote_de": "..." }

Slide 57

Slide 57

PS: Single Type per Index

Slide 58

Slide 58

Search

Slide 59

Slide 59

POST /starwars/_search { "query": { "match_all": { } } }

Slide 60

Slide 60

GET vs POST

Slide 61

Slide 61

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, ...

Slide 62

Slide 62

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

Slide 63

Slide 63

{ } "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

Slide 64

Slide 64

POST /starwars/_search { "query": { "match": { "quote": "dad" } } }

Slide 65

Slide 65

... "hits": { "total": 2, "max_score": 0.41913947, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.41913947, "_source": { "quote": "<b>No</b>. I am your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.39291072, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }

Slide 66

Slide 66

POST /starwars/_doc/0/_explain { "query": { "match": { "quote": "dad" } } }

Slide 67

Slide 67

{ } "_index": "starwars", "_type": "_doc", "_id": "0", "matched": false

Slide 68

Slide 68

POST /starwars/_doc/1/_explain { "query": { "match": { "quote": "dad" } } }

Slide 69

Slide 69

{ } "_index": "starwars", "_type": "_doc", "_id": "1", "matched": false, "explanation": { "value": 0, "description": "no matching term", "details": [] }

Slide 70

Slide 70

POST /starwars/_doc/2/_explain { "query": { "match": { "quote": "dad" } } }

Slide 71

Slide 71

{ "_index": "starwars", "_type": "_doc", "_id": "2", "matched": true, "explanation": { ...

Slide 72

Slide 72

POST /starwars/_search { "query": { "match": { "quote": "machine" } } }

Slide 73

Slide 73

{ } "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

Slide 74

Slide 74

POST /starwars/_search { "query": { "match_phrase": { "quote": "I am your father" } } }

Slide 75

Slide 75

{ } "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.5665855, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.5665855, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 76

Slide 76

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am father", "slop": 1 } } } }

Slide 77

Slide 77

{ } "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8327639, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8327639, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 78

Slide 78

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am not your father", "slop": 1 } } } }

Slide 79

Slide 79

{ } "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0409548, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0409548, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 80

Slide 80

POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }

Slide 81

Slide 81

{ } "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

Slide 82

Slide 82

POST /starwars/_search { "query": { "match": { "quote": { "query": "ovi-van", "fuzziness": 1 } } } }

Slide 83

Slide 83

{ } "took": 109, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.3798467, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.3798467, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

Slide 84

Slide 84

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 85

Slide 85

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 86

Slide 86

SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"

Slide 87

Slide 87

Score

Slide 88

Slide 88

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 89

Slide 89

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Slide 90

Slide 90

Term Frequency

Slide 91

Slide 91

Slide 92

Slide 92

Inverse Document Frequency

Slide 93

Slide 93

Slide 94

Slide 94

Field-Length Norm

Slide 95

Slide 95

POST /starwars/_search?explain=true { "query": { "match": { "quote": "father" } } }

Slide 96

Slide 96

... "_explanation": { "value": 0.41913947, "description": "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.41913947, "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, { "value": 1.4569536, "description": "tfNorm, computed from:", "details": [ { "value": 2, "description": "termFreq=2.0", "details": [] }, ...

Slide 97

Slide 97

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Slide 98

Slide 98

Vector Space Model Search multiple terms

Slide 99

Slide 99

Search your father

Slide 100

Slide 100

Slide 101

Slide 101

Coordination Factor Reward multiple terms

Slide 102

Slide 102

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 103

Slide 103

Practical Scoring Function Putting it all together

Slide 104

Slide 104

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Slide 105

Slide 105

Function Score Script, weight, random, field value, decay (geo or date)

Slide 106

Slide 106

POST /starwars/_search { "query": { "function_score": { "query": { "match": { "quote": "father" } }, "random_score": {} } } }

Slide 107

Slide 107

Compare Scores "100% perfect" vs a "50%" match

Slide 108

Slide 108

Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 109

Slide 109

GET /starwars/_analyze { "analyzer" : "my_analyzer", "text": "These are my father's machines." }

Slide 110

Slide 110

{ "tokens": [ { "token": "my", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 }, { "token": "father", "start_offset": 13, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "dad", "start_offset": 13, "end_offset": 21, "type": "SYNONYM", "position": 3 }, { "token": "machin", "start_offset": 22, "end_offset": 30, "type": "<ALPHANUM>", "position": 4 } ] }

Slide 111

Slide 111

PUT /starwars/_doc/4 { "quote": "These are my father's machines." }

Slide 112

Slide 112

POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 113

Slide 113

"hits": { "total": 4, "max_score": 2.92523, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 2.92523, "_source": { "quote": "These are my father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.8617505, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 114

Slide 114

2.92523 == 100%

Slide 115

Slide 115

DELETE /starwars/_doc/4 POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 116

Slide 116

"hits": { "total": 3, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 117

Slide 117

1.2499592 == 43% or 100%?

Slide 118

Slide 118

PUT /starwars/_doc/4 { "quote": "These droids are my father's father's machines." } POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 119

Slide 119

"hits": { "total": 4, "max_score": 3.0068164, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 3.0068164, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.89701396, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 120

Slide 120

3.0068164 == 103%?

Slide 121

Slide 121

Slide 122

Slide 122

PS: Shards Default? Effect on IDF?

Slide 123

Slide 123

Distributed Frequency Search GET starwars/_search?search_type=dfs_query_then_fetch { ... }

Slide 124

Slide 124

Don’t use dfs_query_then_fetch in production. It really isn’t required. — https://www.elastic.co/guide/en/elasticsearch/ guide/current/relevance-is-broken.html

Slide 125

Slide 125

More Search

Slide 126

Slide 126

Highlighting

Slide 127

Slide 127

POST /starwars/_search { "query": { "match": { "quote": "father" } }, "highlight": { "type": "unified", "pre_tags": [ "<tag>" ], "post_tags": [ "</tag>" ], "fields": { "quote": {} } } }

Slide 128

Slide 128

... "hits": { "total": 3, "max_score": 0.631961, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 0.631961, "_source": { "quote": "These droids are my father's father's machines." }, "highlight": { "quote": [ "These droids are my <tag>father's</tag> <tag>father's</tag> machines." ] } }, ...

Slide 129

Slide 129

Boolean Queries must must_not should filter

Slide 130

Slide 130

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }

Slide 131

Slide 131

... "hits": { "total": 3, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...

Slide 132

Slide 132

POST /starwars/_search { "query": { "bool": { "filter": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }

Slide 133

Slide 133

... "hits": { "total": 3, "max_score": 1.6694657, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1.6694657, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8317767, "_source": { "quote": "<b>No</b>. I am your father." } },

Slide 134

Slide 134

Named Queries & minimum_should_match

Slide 135

Slide 135

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": { "query": "your", "_name": "quote-your" } } }, { "match": { "quote": { "query": "obi", "_name": "quote-obi" } } }, { "match": { "quote": { "query": "droid", "_name": "quote-droid" } } } ], "minimum_should_match": 2 } } }

Slide 136

Slide 136

... "hits": { "total": 1, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, "matched_queries": [ "quote-obi", "quote-your" ] } ] } }

Slide 137

Slide 137

Boosting >1 increase, <1 decrease, <0 punish <0 removed in 7.0

Slide 138

Slide 138

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": { "query": "obi", "boost": 3 } } } ] } } }

Slide 139

Slide 139

... "hits": { "total": 3, "max_score": 4.2368493, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 4.2368493, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...

Slide 140

Slide 140

Search for father but prefer father father

Slide 141

Slide 141

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father father" } } } } }

Slide 142

Slide 142

... "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.263922, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.1077905, "_source": { "quote": "<b>No</b>. I am your father." } },

Slide 143

Slide 143

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": { "match_phrase": { "quote": "father father" } } } } }

Slide 144

Slide 144

... "hits": { "total": 3, "max_score": 9.146545, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 9.146545, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0454913, "_source": { "quote": "<b>No</b>. I am your father." } }, ...

Slide 145

Slide 145

Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0

Slide 146

Slide 146

POST /starwars/_search { "query": { "match": { "quote": "drui" } }, "suggest": { "my_suggestion" : { "text" : "drui", "term" : { "field" : "quote" } } } }

Slide 147

Slide 147

... "hits": { "total": 0, "max_score": null, "hits": [] }, "suggest": { "my_suggestion": [ { "text": "drui", "offset": 0, "length": 4, "options": [ { "text": "droid", "score": 0.5, "freq": 1 } ] } ] } }

Slide 148

Slide 148

Multiple Suggesters term phrase completion context

Slide 149

Slide 149

NGram Partial matches Edge Gram

Slide 150

Slide 150

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }

Slide 151

Slide 151

{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "hes", "start_offset": 1, "end_offset": 4, "type": "word", "position": 1 }, { "token": "ese", "start_offset": 2, "end_offset": 5, "type": "word", "position": 2 }, { "token": "are", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, ...

Slide 152

Slide 152

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "edge_ngram", "min_gram": "1", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }

Slide 153

Slide 153

{ "tokens": [ { "token": "t", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "th", "start_offset": 0, "end_offset": 2, "type": "word", "position": 1 }, { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 2 }, { "token": "a", "start_offset": 6, "end_offset": 7, "type": "word", "position": 3 }, { "token": "ar", "start_offset": 6, "end_offset": 8, "type": "word", "position": 4 }, ...

Slide 154

Slide 154

Combining Analyzers Reindex Store multiple times Tune BM25 Combine scores

Slide 155

Slide 155

BM25 Revisited

Slide 156

Slide 156

https://www.elastic.co/blog/practical-bm25-part-2-the-bm25algorithm-and-its-variables

Slide 157

Slide 157

b field length amplification k1 term frequency saturation Default 0.75 Default 1.2

Slide 158

Slide 158

PUT /starwars_v42 { "settings": { "number_of_shards": 1, "index": { "similarity": { "default": { "type": "BM25", "b": 0, "k1": 0 } } },

Slide 159

Slide 159

"analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] }, "my_ngram_filter": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] } },

Slide 160

Slide 160

"analyzer": { "my_lowercase_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "my_full_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] },

Slide 161

Slide 161

}, } } "my_ngram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase", "stop", "my_ngram_filter" ] }

Slide 162

Slide 162

"mappings": { "_doc": { "properties": { "quote": { "type": "text", "fields": { "lowercase": { "type": "text", "analyzer": "my_lowercase_analyzer" }, "full": { "type": "text", "analyzer": "my_full_analyzer" }, "ngram": { "type": "text", "analyzer": "my_ngram_analyzer" } } } } } } }

Slide 163

Slide 163

POST /_reindex { "source": { "index": "starwars" }, "dest": { "index": "starwars_v42" } }

Slide 164

Slide 164

Aliases Atomic remove and add Point to multiple indices (read-only)

Slide 165

Slide 165

PUT _alias { "actions": [ { "add": { "index": "starwars_v42", "alias": "starwars_extended" } } ] }

Slide 166

Slide 166

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

Slide 167

Slide 167

"hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.1533037, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.1295731, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]

Slide 168

Slide 168

POST /starwars_extended/_search { "query": { "match": { "quote.full": "droid" } } }

Slide 169

Slide 169

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 0.6931472, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 0.6931472, "_source": { "quote": "These droids are my father's father's machines." } } ]

Slide 170

Slide 170

There are no "best" b and k1 values

Slide 171

Slide 171

POST /starwars_extended/_search?explain=true { "query": { "multi_match": { "query": "obiwan", "fields": [ "quote", "quote.lowercase", "quote.full", "quote.ngram" ], "type": "most_fields" } } }

Slide 172

Slide 172

... "hits": { "total": 1, "max_score": 0.4912064, "hits": [ { "_shard": "[starwars_v42][2]", "_node": "BCDwzJ4WSw2dyoGLTzwlqw", "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 0.4912064, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, ...

Slide 173

Slide 173

Whitespace Tokenizer "weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:"

Slide 174

Slide 174

POST /starwars_extended/_search { "query": { "multi_match": { "query": "you", "fields": [ "quote", "quote.lowercase^5", "quote.full", "quote.ngram" ], "type": "best_fields" } } }

Slide 175

Slide 175

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 3.465736, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 3.465736, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "3", "_score": 0.35667494, "_source": { "quote": "<b>No</b>. I am your father." } } ]

Slide 176

Slide 176

Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase

Slide 177

Slide 177

Different Analyzers for Indexing and Searching Per query In the mapping

Slide 178

Slide 178

POST /starwars_extended/_search { "query": { "match": { "quote.ngram": { "query": "the", "analyzer": "standard" } } } }

Slide 179

Slide 179

... "hits": [ { "_index": "starwars_extended", "_type": "_doc", "_id": "2", "_score": 0.38254172, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_extended", "_type": "_doc", "_id": "3", "_score": 0.36165747, "_source": { "quote": "<b>No</b>. I am your father." } } ] ...

Slide 180

Slide 180

Edge Gram vs Trigram Test a setting before adding a field

Slide 181

Slide 181

Shingle Token Filter Shingles (token ngrams) from a token stream

Slide 182

Slide 182

POST /starwars_extended/_close PUT /starwars_extended/_settings { "index": { "similarity": { "default": { "type": "BM25", "b": null, "k1": null } } },

Slide 183

Slide 183

"analysis": { "filter": { "my_edgegram_filter": { "type": "edge_ngram", "min_gram": 3, "max_gram": 10 }, "my_shingle_filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 2 } },

Slide 184

Slide 184

"analyzer": { "my_edgegram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "my_edgegram_filter" ] },

Slide 185

Slide 185

} } } "my_shingle_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "my_shingle_filter" ] } POST /starwars_extended/_open

Slide 186

Slide 186

GET starwars_extended/_analyze { "text": "Father", "analyzer": "my_edgegram_analyzer" }

Slide 187

Slide 187

{ } "tokens": [ { "token": "fat", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fath", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fathe", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "father", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 } ]

Slide 188

Slide 188

PUT /starwars_extended/_doc/_mapping { "properties": { "quote": { "type": "text", "fields": { "edgegram": { "type": "text", "analyzer": "my_edgegram_analyzer", "search_analyzer": "standard" }, "shingle": { "type": "text", "analyzer": "my_shingle_analyzer" } } } } }

Slide 189

Slide 189

PUT /starwars_extended/_doc/5 { "quote": "I find your lack of faith disturbing." } PUT /starwars_extended/_doc/6 { "quote": "That... is your failure." }

Slide 190

Slide 190

GET /starwars_extended/_doc/5/_termvectors { "fields": [ "quote.edgegram" ], "offsets": true, "payloads": true, "positions": true, "term_statistics": true, "field_statistics": true }

Slide 191

Slide 191

{ "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_version": 1, "found": true, "took": 3, "term_vectors": { "quote.edgegram": { "field_statistics": { "sum_doc_freq": 26, "doc_count": 2, "sum_ttf": 26 }, "terms": { "dis": { "doc_freq": 1, "ttf": 1, "term_freq": 1, "tokens": [ { "position": 6, "start_offset": 26, "end_offset": 36 } ] }, "dist": { "doc_freq": 1, "ttf": 1, ...

Slide 192

Slide 192

POST /starwars_extended/_search { "query": { "match": { "quote": "fail" } } }

Slide 193

Slide 193

POST /starwars_extended/_search { "query": { "match": { "quote.lowercase": "fail" } } }

Slide 194

Slide 194

POST /starwars_extended/_search { "query": { "match": { "quote.full": "fail" } } }

Slide 195

Slide 195

POST /starwars_extended/_search { "query": { "match": { "quote.ngram": "fail" } } }

Slide 196

Slide 196

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.8400999, "_source": { "quote": "That... is your failure." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_score": 1.442779, "_source": { "quote": "I find your lack of faith disturbing." } } ]

Slide 197

Slide 197

POST /starwars_extended/_search { "query": { "match": { "quote.edgegram": "fail" } } }

Slide 198

Slide 198

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.0114291, "_source": { "quote": "That... is your failure." } } ]

Slide 199

Slide 199

Updating Missing Fields Expensive

Slide 200

Slide 200

POST /starwars_extended/_update_by_query { "query": { "bool": { "must_not": { "exists": { "field": "quote.edgegram" } } } } }

Slide 201

Slide 201

Shingles: Context Should Matter

Slide 202

Slide 202

POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.lowercase": "these droids are" } } } } }

Slide 203

Slide 203

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.1837702, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 2.137744, "_source": { "quote": "These droids are my father's father's machines." } } ]

Slide 204

Slide 204

POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.shingle": "these droids are" } } } } }

Slide 205

Slide 205

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 3.1811738, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.6568544, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]

Slide 206

Slide 206

Decompounding Commonly in German, Scandinavian languages, Finnish, Korean

Slide 207

Slide 207

PUT /decompound_en { "settings": { "number_of_shards": 1, "analysis": { "filter": { "british_decompounder": { "type": "hyphenation_decompounder", "hyphenation_patterns_path": "hyph/en_GB.xml", "word_list": [ "death", "star" ] } }, "analyzer": { "british_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "british_decompounder" ] } } } } }

Slide 208

Slide 208

GET /decompound_en/_analyze { "analyzer" : "british_decompound", "text" : "deathstar" }

Slide 209

Slide 209

{ } "tokens": [ { "token": "deathstar", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "death", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "star", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 } ]

Slide 210

Slide 210

German Dictionaly (LGPL) https://github.com/uschindler/ german-decompounder

Slide 211

Slide 211

PUT /decompound_de { "settings": { "number_of_shards": 1, "analysis": { "filter": { "german_decompounder": { "type": "hyphenation_decompounder", "word_list_path": "dictionary-de.txt", "hyphenation_patterns_path": "hyph/de_DR.xml", "only_longest_match": true, "min_subword_size": 4 }, "german_stemmer": { "type": "stemmer", "language": "light_german" } }, "analyzer": { "german_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "german_decompounder", "german_normalization", "german_stemmer" ] } } } } }

Slide 212

Slide 212

GET /decompound_de/_analyze { "analyzer" : "german_decompound", "text" : "Todesstern" }

Slide 213

Slide 213

{ } "tokens": [ { "token": "todesst", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "tod", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "stern", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 } ]

Slide 214

Slide 214

Without Word Lists https://github.com/jprante/ elasticsearch-analysis-decompound

Slide 215

Slide 215

Performance

Slide 216

Slide 216

Slide 217

Slide 217

Slide 218

Slide 218

Conclusion

Slide 219

Slide 219

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 220

Slide 220

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 221

Slide 221

Advanced Queries Highlighting Suggestions NGrams, Edge Grams Multiple Analyzers

Slide 222

Slide 222

Advanced Queries Reindex & Alias Update by Query Shingles Decompound

Slide 223

Slide 223

There is more...

Slide 224

Slide 224

Slide 225

Slide 225

Slide 226

Slide 226

Slide 227

Slide 227

Trainings https://training.elastic.co

Slide 228

Slide 228

Thank You! Questions? Philipp Krenn PS: Stickers @xeraa