Elastic Full-Text Search

A presentation at KohaCon19 in May 2019 in Dublin, Ireland by Philipp Krenn

Slide 1

Slide 1

Elastic Full-Text Search Philipp Krenn @xeraa

Slide 2

Slide 2

Developer

Slide 3

Slide 3

Slide 4

Slide 4

Store

Slide 5

Slide 5

Apache Lucene Elasticsearch

Slide 6

Slide 6

Slide 7

Slide 7

https://cloud.elastic.co

Slide 8

Slide 8

Slide 9

Slide 9

Slide 10

Slide 10

—version: ‘2’ services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION environment: - bootstrap.memory_lock=true - “ES_JAVA_OPTS=-Xms512m -Xmx512m” - discovery.type=single-node ulimits: memlock: soft: -1 hard: -1 mem_limit: 1g volumes: - esdata1:/usr/share/elasticsearch/data ports: - 9200:9200 kibana: image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION links: - elasticsearch ports: - 5601:5601 volumes: esdata1: driver: local

Slide 11

Slide 11

Slide 12

Slide 12

Example These are <em>not</em> the droids you are looking for.

Slide 13

Slide 13

html_strip Char Filter These are not the droids you are looking for.

Slide 14

Slide 14

standard Tokenizer These are not the droids you looking for are

Slide 15

Slide 15

lowercase Token Filter these are not the droids looking for you are

Slide 16

Slide 16

stop Token Filter droids you looking

Slide 17

Slide 17

snowball Token Filter droid you look

Slide 18

Slide 18

Analyze

Slide 19

Slide 19

GET /_analyze { “analyzer”: “english”, “text”: “These are not the droids you are looking for.” }

Slide 20

Slide 20

{ } “tokens”: [ { “token”: “droid”, “start_offset”: 18, “end_offset”: 24, “type”: “<ALPHANUM>”, “position”: 4 }, { “token”: “you”, “start_offset”: 25, “end_offset”: 28, “type”: “<ALPHANUM>”, “position”: 5 }, … ]

Slide 21

Slide 21

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 22

Slide 22

{ } “tokens”: [ { “token”: “droid”, “start_offset”: 27, “end_offset”: 33, “type”: “<ALPHANUM>”, “position”: 4 }, { “token”: “you”, “start_offset”: 34, “end_offset”: 37, “type”: “<ALPHANUM>”, “position”: 5 }, … ]

Slide 23

Slide 23

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ core/src/java/org/apache/lucene/analysis/standard/ StandardAnalyzer.java#L44-L50

Slide 24

Slide 24

Always Use Stop Words?

Slide 25

Slide 25

To be, or not to be.

Slide 26

Slide 26

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 27

Slide 27

More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (better Polish stemming), Ukrainian (stemming) Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,…

Slide 28

Slide 28

Language Rules English: Philipp’s → philipp French: l’église → eglis German: äußerst → ausserst

Slide 29

Slide 29

German Das sind nicht die Droiden nach denen du suchst.

Slide 30

Slide 30

German droid den such

Slide 31

Slide 31

German with the English Analyzer

Slide 32

Slide 32

Another Example Obi-Wan never told you what happened to your father.

Slide 33

Slide 33

Another Example obi wan never told you what happen your father

Slide 34

Slide 34

Another Example <b>No</b>. I am your father.

Slide 35

Slide 35

Another Example i am your father

Slide 36

Slide 36

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

Slide 37

Slide 37

To / The Index

Slide 38

Slide 38

PUT /starwars { “settings”: { “analysis”: { “filter”: { “my_synonym_filter”: { “type”: “synonym”, “synonyms”: [ “father,dad”, “droid => droid,machine” ] } },

Slide 39

Slide 39

}, } “analyzer”: { “my_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball”, “my_synonym_filter” ] } }

Slide 40

Slide 40

} “mappings”: { “properties”: { “quote”: { “type”: “text”, “analyzer”: “my_analyzer” } } }

Slide 41

Slide 41

Synonyms Index synonym or query time synonym_graph

Slide 42

Slide 42

GET /starwars/_mapping GET /starwars/_settings

Slide 43

Slide 43

PUT /starwars/_doc/1 { “quote”: “These are <em>not</em> the droids you are looking for.” } PUT /starwars/_doc/2 { “quote”: “Obi-Wan never told you what happened to your father.” } PUT /starwars/_doc/3 { “quote”: “<b>No</b>. I am your father.” }

Slide 44

Slide 44

GET /starwars/_doc/1 GET /starwars/_source/1

Slide 45

Slide 45

Multi Lingual Index: PUT /starwars_en/_doc/1 Type Field: { “quote_en”: “…” }

Slide 46

Slide 46

Search

Slide 47

Slide 47

POST /starwars/_search { “query”: { “match_all”: { } } }

Slide 48

Slide 48

GET vs POST

Slide 49

Slide 49

{ “took”: 1, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 3, “max_score”: 1, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, …

Slide 50

Slide 50

POST /starwars/_search { “query”: { “match”: { “quote”: “Droid” } } }

Slide 51

Slide 51

{ } “took”: 2, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.39556286, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.39556286, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } } ] }

Slide 52

Slide 52

POST /starwars/_search { “query”: { “match”: { “quote”: “dad” } } }

Slide 53

Slide 53

… “hits”: { “total”: 2, “max_score”: 0.41913947, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.41913947, “_source”: { “quote”: “<b>No</b>. I am your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.39291072, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] } }

Slide 54

Slide 54

POST /starwars/_explain/0 { “query”: { “match”: { “quote”: “dad” } } }

Slide 55

Slide 55

{ } “_index”: “starwars”, “_type”: “_doc”, “_id”: “0”, “matched”: false

Slide 56

Slide 56

POST /starwars/_doc/1/_explain { “query”: { “match”: { “quote”: “dad” } } }

Slide 57

Slide 57

{ } “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “matched”: false, “explanation”: { “value”: 0, “description”: “no matching term”, “details”: [] }

Slide 58

Slide 58

POST /starwars/_doc/2/_explain { “query”: { “match”: { “quote”: “dad” } } }

Slide 59

Slide 59

{ “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “matched”: true, “explanation”: { …

Slide 60

Slide 60

POST /starwars/_search { “query”: { “match”: { “quote”: “machine” } } }

Slide 61

Slide 61

{ } “took”: 2, “timed_out”: false, “_shards”: { “total”: 1, “successful”: 1, “skipped”: 0, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.2499592, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.2499592, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } } ] }

Slide 62

Slide 62

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: “I am your father” } } }

Slide 63

Slide 63

{ } “took”: 3, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.5665855, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.5665855, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

Slide 64

Slide 64

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: { “query”: “I am father”, “slop”: 1 } } } }

Slide 65

Slide 65

{ } “took”: 16, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.8327639, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.8327639, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

Slide 66

Slide 66

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: { “query”: “I am not your father”, “slop”: 1 } } } }

Slide 67

Slide 67

{ } “took”: 5, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.0409548, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.0409548, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

Slide 68

Slide 68

POST /starwars/_search { “query”: { “match”: { “quote”: { “query”: “van”, “fuzziness”: “AUTO” } } } }

Slide 69

Slide 69

{ } “took”: 14, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.18155496, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.18155496, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] }

Slide 70

Slide 70

POST /starwars/_search { “query”: { “match”: { “quote”: { “query”: “ovi-van”, “fuzziness”: 1 } } } }

Slide 71

Slide 71

{ } “took”: 109, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.3798467, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.3798467, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] }

Slide 72

Slide 72

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 73

Slide 73

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 74

Slide 74

SELECT * FROM starwars WHERE quote LIKE “?an” OR quote LIKE “V?n” OR quote LIKE “Va?”

Slide 75

Slide 75

Scoring

Slide 76

Slide 76

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 77

Slide 77

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Slide 78

Slide 78

Term Frequency

Slide 79

Slide 79

Slide 80

Slide 80

Inverse Document Frequency

Slide 81

Slide 81

Slide 82

Slide 82

Field-Length Norm

Slide 83

Slide 83

POST /starwars/_search?explain=true { “query”: { “match”: { “quote”: “father” } } }

Slide 84

Slide 84

… “_explanation”: { “value”: 0.41913947, “description”: “weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:”, “details”: [ { “value”: 0.41913947, “description”: “score(doc=0,freq=2.0 = termFreq=2.0\n), product of:”, “details”: [ { “value”: 0.2876821, “description”: “idf(docFreq=1, docCount=1)”, “details”: [] }, { “value”: 1.4569536, “description”: “tfNorm, computed from:”, “details”: [ { “value”: 2, “description”: “termFreq=2.0”, “details”: [] }, …

Slide 85

Slide 85

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Slide 86

Slide 86

Vector Space Model Search multiple terms

Slide 87

Slide 87

Search your father

Slide 88

Slide 88

Slide 89

Slide 89

Coordination Factor Reward multiple terms

Slide 90

Slide 90

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 91

Slide 91

Practical Scoring Function Putting it all together

Slide 92

Slide 92

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Slide 93

Slide 93

Function Score Script, weight, random, field value, decay (geo or date)

Slide 94

Slide 94

POST /starwars/_search { “query”: { “function_score”: { “query”: { “match”: { “quote”: “father” } }, “random_score”: {} } } }

Slide 95

Slide 95

Compare Scores “100% perfect” vs a “50%” match

Slide 96

Slide 96

Don’t do this. Seriously. Stop trying to think about your problem this way, it’s not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 97

Slide 97

GET /starwars/_analyze { “analyzer” : “my_analyzer”, “text”: “These are my father’s machines.” }

Slide 98

Slide 98

{ “tokens”: [ { “token”: “my”, “start_offset”: 10, “end_offset”: 12, “type”: “<ALPHANUM>”, “position”: 2 }, { “token”: “father”, “start_offset”: 13, “end_offset”: 21, “type”: “<ALPHANUM>”, “position”: 3 }, { “token”: “dad”, “start_offset”: 13, “end_offset”: 21, “type”: “SYNONYM”, “position”: 3 }, { “token”: “machin”, “start_offset”: 22, “end_offset”: 30, “type”: “<ALPHANUM>”, “position”: 4 } ] }

Slide 99

Slide 99

PUT /starwars/_doc/4 { “quote”: “These are my father’s machines.” }

Slide 100

Slide 100

POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

Slide 101

Slide 101

“hits”: { “total”: 4, “max_score”: 2.92523, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 2.92523, “_source”: { “quote”: “These are my father’s machines.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.8617505, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

Slide 102

Slide 102

2.92523 == 100%

Slide 103

Slide 103

DELETE /starwars/_doc/4 POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

Slide 104

Slide 104

“hits”: { “total”: 3, “max_score”: 1.2499592, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.2499592, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

Slide 105

Slide 105

1.2499592 == 43% or 100%?

Slide 106

Slide 106

PUT /starwars/_doc/4 { “quote”: “These droids are my father’s father’s machines.” } POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

Slide 107

Slide 107

“hits”: { “total”: 4, “max_score”: 3.0068164, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 3.0068164, “_source”: { “quote”: “These droids are my father’s father’s machines.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.89701396, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

Slide 108

Slide 108

3.0068164 == 103%?

Slide 109

Slide 109

Slide 110

Slide 110

Performance

Slide 111

Slide 111

Slide 112

Slide 112

Slide 113

Slide 113

More

Slide 114

Slide 114

POST /starwars/_search { “query”: { “match”: { “quote”: “father” } }, “highlight”: { “type”: “unified”, “pre_tags”: [ “<tag>” ], “post_tags”: [ “</tag>” ], “fields”: { “quote”: {} } } }

Slide 115

Slide 115

… “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.41913947, “_source”: { “quote”: “<b>No</b>. I am your father.” }, “highlight”: { “quote”: [ “<b>No</b>. I am your <tag>father</tag>.” ] } }, …

Slide 116

Slide 116

Boolean Queries must must_not should filter

Slide 117

Slide 117

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: “obi” } } ] } } }

Slide 118

Slide 118

… “hits”: { “total”: 2, “max_score”: 0.96268076, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.96268076, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.73245656, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] } }

Slide 119

Slide 119

POST /starwars/_search { “query”: { “bool”: { “filter”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: “obi” } } ] } } }

Slide 120

Slide 120

… “hits”: { “total”: 2, “max_score”: 0.56977004, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.56977004, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.31331712, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] } }

Slide 121

Slide 121

Named Queries & minimum_should_match

Slide 122

Slide 122

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: { “query”: “your”, “_name”: “quote-your” } } }, { “match”: { “quote”: { “query”: “obi”, “_name”: “quote-obi” } } }, { “match”: { “quote”: { “query”: “droid”, “_name”: “quote-droid” } } } ], “minimum_should_match”: 2 } } }

Slide 123

Slide 123

… “hits”: { “total”: 1, “max_score”: 1.8154771, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.8154771, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” }, “matched_queries”: [ “quote-obi”, “quote-your” ] } ] } }

Slide 124

Slide 124

Boosting >1 increase, <1 decrease, <0 punish

Slide 125

Slide 125

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: { “query”: “obi”, “boost”: 3 } } } ] } } }

Slide 126

Slide 126

… “hits”: { “total”: 2, “max_score”: 1.5324509, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.5324509, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.73245656, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] } }

Slide 127

Slide 127

Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0

Slide 128

Slide 128

POST /starwars/_search { “query”: { “match”: { “quote”: “drui” } }, “suggest”: { “my_suggestion” : { “text” : “drui”, “term” : { “field” : “quote” } } } }

Slide 129

Slide 129

… “hits”: { “total”: 0, “max_score”: null, “hits”: [] }, “suggest”: { “my_suggestion”: [ { “text”: “drui”, “offset”: 0, “length”: 4, “options”: [ { “text”: “droid”, “score”: 0.5, “freq”: 1 } ] } ] } }

Slide 130

Slide 130

NGram Partial matches Trigram Edge Gram

Slide 131

Slide 131

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: { “type”: “ngram”, “min_gram”: “3”, “max_gram”: “3”, “token_chars”: [ “letter” ] }, “filter”: [ “lowercase” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 132

Slide 132

{ “tokens”: [ { “token”: “the”, “start_offset”: 0, “end_offset”: 3, “type”: “word”, “position”: 0 }, { “token”: “hes”, “start_offset”: 1, “end_offset”: 4, “type”: “word”, “position”: 1 }, { “token”: “ese”, “start_offset”: 2, “end_offset”: 5, “type”: “word”, “position”: 2 }, { “token”: “are”, “start_offset”: 6, “end_offset”: 9, “type”: “word”, “position”: 3 }, …

Slide 133

Slide 133

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: { “type”: “edge_ngram”, “min_gram”: “1”, “max_gram”: “3”, “token_chars”: [ “letter” ] }, “filter”: [ “lowercase” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 134

Slide 134

{ “tokens”: [ { “token”: “t”, “start_offset”: 0, “end_offset”: 1, “type”: “word”, “position”: 0 }, { “token”: “th”, “start_offset”: 0, “end_offset”: 2, “type”: “word”, “position”: 1 }, { “token”: “the”, “start_offset”: 0, “end_offset”: 3, “type”: “word”, “position”: 2 }, { “token”: “a”, “start_offset”: 6, “end_offset”: 7, “type”: “word”, “position”: 3 }, { “token”: “ar”, “start_offset”: 6, “end_offset”: 8, “type”: “word”, “position”: 4 }, …

Slide 135

Slide 135

Combining Analyzers Reindex Store multiple times Combine scores

Slide 136

Slide 136

PUT /starwars_v42 { “settings”: { “analysis”: { “filter”: { “my_synonym_filter”: { “type”: “synonym”, “synonyms”: [ “droid,machine”, “father,dad” ] }, “my_ngram_filter”: { “type”: “ngram”, “min_gram”: “3”, “max_gram”: “3”, “token_chars”: [ “letter” ] } },

Slide 137

Slide 137

“analyzer”: { “my_lowercase_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “whitespace”, “filter”: [ “lowercase” ] }, “my_full_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball”, “my_synonym_filter” ] },

Slide 138

Slide 138

}, } } “my_ngram_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “whitespace”, “filter”: [ “lowercase”, “stop”, “my_ngram_filter” ] }

Slide 139

Slide 139

} “mappings”: { “properties”: { “quote”: { “type”: “text”, “fields”: { “lowercase”: { “type”: “text”, “analyzer”: “my_lowercase_analyzer” }, “full”: { “type”: “text”, “analyzer”: “my_full_analyzer” }, “ngram”: { “type”: “text”, “analyzer”: “my_ngram_analyzer” } } } } }

Slide 140

Slide 140

POST /_reindex { “source”: { “index”: “starwars” }, “dest”: { “index”: “starwars_v42” } }

Slide 141

Slide 141

PUT _alias { “actions”: [ { “add”: { “index”: “starwars_v42”, “alias”: “starwars_extended” } } ] }

Slide 142

Slide 142

Aliases Atomic remove and add Point to multiple indices (read-only)

Slide 143

Slide 143

POST /starwars_extended/_search?explain=true { “query”: { “multi_match”: { “query”: “obiwan”, “fields”: [ “quote”, “quote.lowercase”, “quote.full”, “quote.ngram” ], “type”: “most_fields” } } }

Slide 144

Slide 144

… “hits”: { “total”: 1, “max_score”: 0.4912064, “hits”: [ { “_shard”: “[starwars_v42][2]”, “_node”: “BCDwzJ4WSw2dyoGLTzwlqw”, “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.4912064, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” }, …

Slide 145

Slide 145

Whitespace Tokenizer “weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:”

Slide 146

Slide 146

POST /starwars_extended/_search { “query”: { “multi_match”: { “query”: “you”, “fields”: [ “quote”, “quote.lowercase”, “quote.full^5”, “quote.ngram” ], “type”: “best_fields” } } }

Slide 147

Slide 147

“hits”: [ { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.6022799, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.4997643, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.38650417, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ]

Slide 148

Slide 148

Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase

Slide 149

Slide 149

Different Analyzers for Indexing and Searching Per query In the mapping

Slide 150

Slide 150

POST /starwars_extended/_search { “query”: { “match”: { “quote.ngram”: { “query”: “the”, “analyzer”: “standard” } } } }

Slide 151

Slide 151

… “hits”: [ { “_index”: “starwars_extended”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.38254172, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars_extended”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.36165747, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] …

Slide 152

Slide 152

Edge Gram vs Trigram Extending a mapping Testing a custom mapping

Slide 153

Slide 153

POST /starwars_extended/_close PUT /starwars_extended/_settings { “analysis”: { “filter”: { “my_edgegram_filter”: { “type”: “edge_ngram”, “min_gram”: 3, “max_gram”: 10 } }, “analyzer”: { “my_edgegram_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “my_edgegram_filter” ] } } } } POST /starwars_extended/_open

Slide 154

Slide 154

GET starwars_extended/_analyze { “text”: “Father”, “analyzer”: “my_edgegram_analyzer” }

Slide 155

Slide 155

{ } “tokens”: [ { “token”: “fat”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 }, { “token”: “fath”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 }, { “token”: “fathe”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 }, { “token”: “father”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 } ]

Slide 156

Slide 156

PUT /starwars_extended/_mapping { “properties”: { “quote”: { “type”: “text”, “fields”: { “edgegram”: { “type”: “text”, “analyzer”: “my_edgegram_analyzer”, “search_analyzer”: “standard” } } } } }

Slide 157

Slide 157

PUT /starwars_extended/_doc/4 { “quote”: “I find your lack of faith disturbing.” } PUT /starwars_extended/_doc/5 { “quote”: “That… is your failure.” }

Slide 158

Slide 158

GET /starwars_extended/_termvectors/4 { “fields”: [ “quote.edgegram” ], “offsets”: true, “payloads”: true, “positions”: true, “term_statistics”: true, “field_statistics”: true }

Slide 159

Slide 159

{ “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “4”, “_version”: 1, “found”: true, “took”: 3, “term_vectors”: { “quote.edgegram”: { “field_statistics”: { “sum_doc_freq”: 26, “doc_count”: 2, “sum_ttf”: 26 }, “terms”: { “dis”: { “doc_freq”: 1, “ttf”: 1, “term_freq”: 1, “tokens”: [ { “position”: 6, “start_offset”: 26, “end_offset”: 36 } ] }, “dist”: { “doc_freq”: 1, “ttf”: 1, …

Slide 160

Slide 160

POST /starwars_extended/_search { “query”: { “match”: { “quote”: “fail” } } }

Slide 161

Slide 161

POST /starwars_extended/_search { “query”: { “match”: { “quote.lowercase”: “fail” } } }

Slide 162

Slide 162

POST /starwars_extended/_search { “query”: { “match”: { “quote.full”: “fail” } } }

Slide 163

Slide 163

POST /starwars_extended/_search { “query”: { “match”: { “quote.ngram”: “fail” } } }

Slide 164

Slide 164

… “hits”: { “total”: 2, “max_score”: 1.0135446, “hits”: [ { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “4”, “_score”: 1.0135446, “_source”: { “quote”: “I find your lack of faith disturbing.” } }, { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “5”, “_score”: 0.50476736, “_source”: { “quote”: “That… is your failure.” } } ] …

Slide 165

Slide 165

POST /starwars_extended/_search { “query”: { “match”: { “quote.edgegram”: “fail” } } }

Slide 166

Slide 166

… “hits”: { “total”: 1, “max_score”: 0.39556286, “hits”: [ { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “5”, “_score”: 0.39556286, “_source”: { “quote”: “That… is your failure.” } } ] …

Slide 167

Slide 167

Conclusion

Slide 168

Slide 168

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 169

Slide 169

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 170

Slide 170

Advanced Queries Highlighting NGrams & Edge Grams Multiple Analyzers Reindex & Alias

Slide 171

Slide 171

There is more Elastic Stack

Slide 172

Slide 172

Trainings https://training.elastic.co

Slide 173

Slide 173

Slide 174

Slide 174

Slide 175

Slide 175

Slide 176

Slide 176

Thank You! Questions? Philipp Krenn PS: Stickers @xeraa

Slide 177

Slide 177

The End