Full-Text Search Internals

A presentation at DataOps in June 2019 in Barcelona, Spain by Philipp Krenn

Slide 1

Slide 1

Slide 2

Slide 2

Who is using databases?

Slide 3

Slide 3

Who is using search?

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

Developer

Slide 7

Slide 7

Store

Slide 8

Slide 8

Apache Lucene Elasticsearch

Slide 9

Slide 9

https://cloud.elastic.co

Slide 10

Slide 10

Slide 11

Slide 11

—version: ‘2’ services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION environment: - bootstrap.memory_lock=true - “ES_JAVA_OPTS=-Xms512m -Xmx512m” - discovery.type=single-node ulimits: memlock: soft: -1 hard: -1 mem_limit: 1g volumes: - esdata1:/usr/share/elasticsearch/data ports: - 9200:9200 kibana: image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION links: - elasticsearch ports: - 5601:5601 volumes: esdata1: driver: local

Slide 12

Slide 12

Slide 13

Slide 13

Example These are <em>not</em> the droids you are looking for.

Slide 14

Slide 14

html_strip Char Filter These are not the droids you are looking for.

Slide 15

Slide 15

standard Tokenizer These are not the droids you looking for are

Slide 16

Slide 16

lowercase Token Filter these are not the droids looking for you are

Slide 17

Slide 17

stop Token Filter droids you looking

Slide 18

Slide 18

snowball Token Filter droid you look

Slide 19

Slide 19

Analyze

Slide 20

Slide 20

GET /_analyze { “analyzer”: “english”, “text”: “These are not the droids you are looking for.” }

Slide 21

Slide 21

{ } “tokens”: [ { “token”: “droid”, “start_offset”: 18, “end_offset”: 24, “type”: “<ALPHANUM>”, “position”: 4 }, { “token”: “you”, “start_offset”: 25, “end_offset”: 28, “type”: “<ALPHANUM>”, “position”: 5 }, … ]

Slide 22

Slide 22

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 23

Slide 23

{ } “tokens”: [ { “token”: “droid”, “start_offset”: 27, “end_offset”: 33, “type”: “<ALPHANUM>”, “position”: 4 }, { “token”: “you”, “start_offset”: 34, “end_offset”: 37, “type”: “<ALPHANUM>”, “position”: 5 }, … ]

Slide 24

Slide 24

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ core/src/java/org/apache/lucene/analysis/standard/ StandardAnalyzer.java#L44-L50

Slide 25

Slide 25

Always Use Stop Words?

Slide 26

Slide 26

To be, or not to be.

Slide 27

Slide 27

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 28

Slide 28

More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (better Polish stemming), Ukrainian (stemming) Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,…

Slide 29

Slide 29

Language Rules English: Philipp’s → philipp French: l’église → eglis German: äußerst → ausserst

Slide 30

Slide 30

Spanish Éstos no son los androides que estáis buscando.

Slide 31

Slide 31

Spanish est android buscand

Slide 32

Slide 32

Spanish with the English Analyzer

Slide 33

Slide 33

Another Example Obi-Wan never told you what happened to your father.

Slide 34

Slide 34

Another Example obi wan never told you what happen your father

Slide 35

Slide 35

Another Example <b>No</b>. I am your father.

Slide 36

Slide 36

Another Example i am your father

Slide 37

Slide 37

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

Slide 38

Slide 38

To / The Index

Slide 39

Slide 39

PUT /starwars { “settings”: { “analysis”: { “filter”: { “my_synonym_filter”: { “type”: “synonym”, “synonyms”: [ “father,dad”, “droid => droid,machine” ] } },

Slide 40

Slide 40

}, } “analyzer”: { “my_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball”, “my_synonym_filter” ] } }

Slide 41

Slide 41

} “mappings”: { “properties”: { “quote”: { “type”: “text”, “analyzer”: “my_analyzer” } } }

Slide 42

Slide 42

PUT /starwars/_doc/1 { “quote”: “These are <em>not</em> the droids you are looking for.” } PUT /starwars/_doc/2 { “quote”: “Obi-Wan never told you what happened to your father.” } PUT /starwars/_doc/3 { “quote”: “<b>No</b>. I am your father.” }

Slide 43

Slide 43

GET /starwars/_doc/1 GET /starwars/_source/1

Slide 44

Slide 44

Search

Slide 45

Slide 45

POST /starwars/_search { “query”: { “match_all”: { } } }

Slide 46

Slide 46

GET vs POST

Slide 47

Slide 47

{ “took”: 1, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 3, “max_score”: 1, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, …

Slide 48

Slide 48

POST /starwars/_search { “query”: { “match”: { “quote”: “Droid” } } }

Slide 49

Slide 49

{ } “took”: 2, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.39556286, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.39556286, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } } ] }

Slide 50

Slide 50

POST /starwars/_search { “query”: { “match”: { “quote”: “dad” } } }

Slide 51

Slide 51

… “hits”: { “total”: 2, “max_score”: 0.41913947, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.41913947, “_source”: { “quote”: “<b>No</b>. I am your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.39291072, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] } }

Slide 52

Slide 52

POST /starwars/_search { “query”: { “match”: { “quote”: “machine” } } }

Slide 53

Slide 53

{ } “took”: 2, “timed_out”: false, “_shards”: { “total”: 1, “successful”: 1, “skipped”: 0, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.2499592, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.2499592, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } } ] }

Slide 54

Slide 54

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: “I am your father” } } }

Slide 55

Slide 55

{ } “took”: 3, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.5665855, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.5665855, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

Slide 56

Slide 56

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: { “query”: “I am father”, “slop”: 1 } } } }

Slide 57

Slide 57

{ } “took”: 16, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.8327639, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.8327639, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

Slide 58

Slide 58

POST /starwars/_search { “query”: { “match_phrase”: { “quote”: { “query”: “I am not your father”, “slop”: 1 } } } }

Slide 59

Slide 59

{ } “took”: 5, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 1.0409548, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 1.0409548, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] }

Slide 60

Slide 60

POST /starwars/_search { “query”: { “match”: { “quote”: { “query”: “van”, “fuzziness”: “AUTO” } } } }

Slide 61

Slide 61

{ } “took”: 14, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.18155496, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.18155496, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] }

Slide 62

Slide 62

POST /starwars/_search { “query”: { “match”: { “quote”: { “query”: “ovi-van”, “fuzziness”: 1 } } } }

Slide 63

Slide 63

{ } “took”: 109, “timed_out”: false, “_shards”: { “total”: 5, “successful”: 5, “failed”: 0 }, “hits”: { “total”: 1, “max_score”: 0.3798467, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.3798467, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } } ] }

Slide 64

Slide 64

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 65

Slide 65

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 66

Slide 66

SELECT * FROM starwars WHERE quote LIKE “?an” OR quote LIKE “V?n” OR quote LIKE “Va?”

Slide 67

Slide 67

Scoring

Slide 68

Slide 68

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 69

Slide 69

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Slide 70

Slide 70

Term Frequency

Slide 71

Slide 71

Slide 72

Slide 72

Inverse Document Frequency

Slide 73

Slide 73

Slide 74

Slide 74

Field-Length Norm

Slide 75

Slide 75

POST /starwars/_search?explain=true { “query”: { “match”: { “quote”: “father” } } }

Slide 76

Slide 76

… “_explanation”: { “value”: 0.41913947, “description”: “weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:”, “details”: [ { “value”: 0.41913947, “description”: “score(doc=0,freq=2.0 = termFreq=2.0\n), product of:”, “details”: [ { “value”: 0.2876821, “description”: “idf(docFreq=1, docCount=1)”, “details”: [] }, { “value”: 1.4569536, “description”: “tfNorm, computed from:”, “details”: [ { “value”: 2, “description”: “termFreq=2.0”, “details”: [] }, …

Slide 77

Slide 77

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Slide 78

Slide 78

Vector Space Model Search multiple terms

Slide 79

Slide 79

Search your father

Slide 80

Slide 80

Slide 81

Slide 81

Coordination Factor Reward multiple terms

Slide 82

Slide 82

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 83

Slide 83

Practical Scoring Function Putting it all together

Slide 84

Slide 84

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Slide 85

Slide 85

Function Score Script, weight, random, field value, decay (geo or date)

Slide 86

Slide 86

POST /starwars/_search { “query”: { “function_score”: { “query”: { “match”: { “quote”: “father” } }, “random_score”: {} } } }

Slide 87

Slide 87

Compare Scores “100% perfect” vs a “50%” match

Slide 88

Slide 88

Don’t do this. Seriously. Stop trying to think about your problem this way, it’s not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 89

Slide 89

GET /starwars/_analyze { “analyzer” : “my_analyzer”, “text”: “These are my father’s machines.” }

Slide 90

Slide 90

{ “tokens”: [ { “token”: “my”, “start_offset”: 10, “end_offset”: 12, “type”: “<ALPHANUM>”, “position”: 2 }, { “token”: “father”, “start_offset”: 13, “end_offset”: 21, “type”: “<ALPHANUM>”, “position”: 3 }, { “token”: “dad”, “start_offset”: 13, “end_offset”: 21, “type”: “SYNONYM”, “position”: 3 }, { “token”: “machin”, “start_offset”: 22, “end_offset”: 30, “type”: “<ALPHANUM>”, “position”: 4 } ] }

Slide 91

Slide 91

PUT /starwars/_doc/4 { “quote”: “These are my father’s machines.” }

Slide 92

Slide 92

POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

Slide 93

Slide 93

“hits”: { “total”: 4, “max_score”: 2.92523, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 2.92523, “_source”: { “quote”: “These are my father’s machines.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.8617505, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

Slide 94

Slide 94

2.92523 == 100%

Slide 95

Slide 95

DELETE /starwars/_doc/4 POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

Slide 96

Slide 96

“hits”: { “total”: 3, “max_score”: 1.2499592, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.2499592, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

Slide 97

Slide 97

1.2499592 == 43% or 100%?

Slide 98

Slide 98

PUT /starwars/_doc/4 { “quote”: “These droids are my father’s father’s machines.” } POST /starwars/_search { “query”: { “match”: { “quote”: “my father machine” } } }

Slide 99

Slide 99

“hits”: { “total”: 4, “max_score”: 3.0068164, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “4”, “_score”: 3.0068164, “_source”: { “quote”: “These droids are my father’s father’s machines.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “1”, “_score”: 0.89701396, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, …

Slide 100

Slide 100

3.0068164 == 103%?

Slide 101

Slide 101

Slide 102

Slide 102

Performance

Slide 103

Slide 103

Slide 104

Slide 104

Slide 105

Slide 105

Conclusion

Slide 106

Slide 106

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 107

Slide 107

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 108

Slide 108

Slide 109

Slide 109

Slide 110

Slide 110

Slide 111

Slide 111

Thank You! Questions? Philipp Krenn PS: Stickers @xeraa

Slide 112

Slide 112

The End

Slide 113

Slide 113

More

Slide 114

Slide 114

POST /starwars/_search { “query”: { “match”: { “quote”: “father” } }, “highlight”: { “type”: “unified”, “pre_tags”: [ “<tag>” ], “post_tags”: [ “</tag>” ], “fields”: { “quote”: {} } } }

Slide 115

Slide 115

… “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.41913947, “_source”: { “quote”: “<b>No</b>. I am your father.” }, “highlight”: { “quote”: [ “<b>No</b>. I am your <tag>father</tag>.” ] } }, …

Slide 116

Slide 116

Boolean Queries must must_not should filter

Slide 117

Slide 117

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: “obi” } } ] } } }

Slide 118

Slide 118

… “hits”: { “total”: 2, “max_score”: 0.96268076, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.96268076, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.73245656, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] } }

Slide 119

Slide 119

POST /starwars/_search { “query”: { “bool”: { “filter”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: “obi” } } ] } } }

Slide 120

Slide 120

… “hits”: { “total”: 2, “max_score”: 0.56977004, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.56977004, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.31331712, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] } }

Slide 121

Slide 121

Named Queries & minimum_should_match

Slide 122

Slide 122

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: { “query”: “your”, “_name”: “quote-your” } } }, { “match”: { “quote”: { “query”: “obi”, “_name”: “quote-obi” } } }, { “match”: { “quote”: { “query”: “droid”, “_name”: “quote-droid” } } } ], “minimum_should_match”: 2 } } }

Slide 123

Slide 123

… “hits”: { “total”: 1, “max_score”: 1.8154771, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.8154771, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” }, “matched_queries”: [ “quote-obi”, “quote-your” ] } ] } }

Slide 124

Slide 124

Boosting >1 increase, <1 decrease, <0 punish

Slide 125

Slide 125

POST /starwars/_search { “query”: { “bool”: { “must”: { “match”: { “quote”: “father” } }, “should”: [ { “match”: { “quote”: “your” } }, { “match”: { “quote”: { “query”: “obi”, “boost”: 3 } } } ] } } }

Slide 126

Slide 126

… “hits”: { “total”: 2, “max_score”: 1.5324509, “hits”: [ { “_index”: “starwars”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.5324509, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.73245656, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] } }

Slide 127

Slide 127

Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0

Slide 128

Slide 128

POST /starwars/_search { “query”: { “match”: { “quote”: “drui” } }, “suggest”: { “my_suggestion” : { “text” : “drui”, “term” : { “field” : “quote” } } } }

Slide 129

Slide 129

… “hits”: { “total”: 0, “max_score”: null, “hits”: [] }, “suggest”: { “my_suggestion”: [ { “text”: “drui”, “offset”: 0, “length”: 4, “options”: [ { “text”: “droid”, “score”: 0.5, “freq”: 1 } ] } ] } }

Slide 130

Slide 130

NGram Partial matches Trigram Edge Gram

Slide 131

Slide 131

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: { “type”: “ngram”, “min_gram”: “3”, “max_gram”: “3”, “token_chars”: [ “letter” ] }, “filter”: [ “lowercase” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 132

Slide 132

{ “tokens”: [ { “token”: “the”, “start_offset”: 0, “end_offset”: 3, “type”: “word”, “position”: 0 }, { “token”: “hes”, “start_offset”: 1, “end_offset”: 4, “type”: “word”, “position”: 1 }, { “token”: “ese”, “start_offset”: 2, “end_offset”: 5, “type”: “word”, “position”: 2 }, { “token”: “are”, “start_offset”: 6, “end_offset”: 9, “type”: “word”, “position”: 3 }, …

Slide 133

Slide 133

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: { “type”: “edge_ngram”, “min_gram”: “1”, “max_gram”: “3”, “token_chars”: [ “letter” ] }, “filter”: [ “lowercase” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 134

Slide 134

{ “tokens”: [ { “token”: “t”, “start_offset”: 0, “end_offset”: 1, “type”: “word”, “position”: 0 }, { “token”: “th”, “start_offset”: 0, “end_offset”: 2, “type”: “word”, “position”: 1 }, { “token”: “the”, “start_offset”: 0, “end_offset”: 3, “type”: “word”, “position”: 2 }, { “token”: “a”, “start_offset”: 6, “end_offset”: 7, “type”: “word”, “position”: 3 }, { “token”: “ar”, “start_offset”: 6, “end_offset”: 8, “type”: “word”, “position”: 4 }, …

Slide 135

Slide 135

Combining Analyzers Reindex Store multiple times Combine scores

Slide 136

Slide 136

PUT /starwars_v42 { “settings”: { “analysis”: { “filter”: { “my_synonym_filter”: { “type”: “synonym”, “synonyms”: [ “droid,machine”, “father,dad” ] }, “my_ngram_filter”: { “type”: “ngram”, “min_gram”: “3”, “max_gram”: “3”, “token_chars”: [ “letter” ] } },

Slide 137

Slide 137

“analyzer”: { “my_lowercase_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “whitespace”, “filter”: [ “lowercase” ] }, “my_full_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball”, “my_synonym_filter” ] },

Slide 138

Slide 138

}, } } “my_ngram_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “whitespace”, “filter”: [ “lowercase”, “stop”, “my_ngram_filter” ] }

Slide 139

Slide 139

} “mappings”: { “properties”: { “quote”: { “type”: “text”, “fields”: { “lowercase”: { “type”: “text”, “analyzer”: “my_lowercase_analyzer” }, “full”: { “type”: “text”, “analyzer”: “my_full_analyzer” }, “ngram”: { “type”: “text”, “analyzer”: “my_ngram_analyzer” } } } } }

Slide 140

Slide 140

POST /_reindex { “source”: { “index”: “starwars” }, “dest”: { “index”: “starwars_v42” } }

Slide 141

Slide 141

PUT _alias { “actions”: [ { “add”: { “index”: “starwars_v42”, “alias”: “starwars_extended” } } ] }

Slide 142

Slide 142

Aliases Atomic remove and add Point to multiple indices (read-only)

Slide 143

Slide 143

POST /starwars_extended/_search?explain=true { “query”: { “multi_match”: { “query”: “obiwan”, “fields”: [ “quote”, “quote.lowercase”, “quote.full”, “quote.ngram” ], “type”: “most_fields” } } }

Slide 144

Slide 144

… “hits”: { “total”: 1, “max_score”: 0.4912064, “hits”: [ { “_shard”: “[starwars_v42][2]”, “_node”: “BCDwzJ4WSw2dyoGLTzwlqw”, “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.4912064, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” }, …

Slide 145

Slide 145

Whitespace Tokenizer “weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:”

Slide 146

Slide 146

POST /starwars_extended/_search { “query”: { “multi_match”: { “query”: “you”, “fields”: [ “quote”, “quote.lowercase”, “quote.full^5”, “quote.ngram” ], “type”: “best_fields” } } }

Slide 147

Slide 147

“hits”: [ { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “1”, “_score”: 1.6022799, “_source”: { “quote”: “These are <em>not</em> the droids you are looking for.” } }, { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “2”, “_score”: 1.4997643, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.38650417, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ]

Slide 148

Slide 148

Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase

Slide 149

Slide 149

Different Analyzers for Indexing and Searching Per query In the mapping

Slide 150

Slide 150

POST /starwars_extended/_search { “query”: { “match”: { “quote.ngram”: { “query”: “the”, “analyzer”: “standard” } } } }

Slide 151

Slide 151

… “hits”: [ { “_index”: “starwars_extended”, “_type”: “_doc”, “_id”: “2”, “_score”: 0.38254172, “_source”: { “quote”: “Obi-Wan never told you what happened to your father.” } }, { “_index”: “starwars_extended”, “_type”: “_doc”, “_id”: “3”, “_score”: 0.36165747, “_source”: { “quote”: “<b>No</b>. I am your father.” } } ] …

Slide 152

Slide 152

Edge Gram vs Trigram Extending a mapping Testing a custom mapping

Slide 153

Slide 153

POST /starwars_extended/_close PUT /starwars_extended/_settings { “analysis”: { “filter”: { “my_edgegram_filter”: { “type”: “edge_ngram”, “min_gram”: 3, “max_gram”: 10 } }, “analyzer”: { “my_edgegram_analyzer”: { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “my_edgegram_filter” ] } } } } POST /starwars_extended/_open

Slide 154

Slide 154

GET starwars_extended/_analyze { “text”: “Father”, “analyzer”: “my_edgegram_analyzer” }

Slide 155

Slide 155

{ } “tokens”: [ { “token”: “fat”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 }, { “token”: “fath”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 }, { “token”: “fathe”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 }, { “token”: “father”, “start_offset”: 0, “end_offset”: 6, “type”: “<ALPHANUM>”, “position”: 0 } ]

Slide 156

Slide 156

PUT /starwars_extended/_mapping { “properties”: { “quote”: { “type”: “text”, “fields”: { “edgegram”: { “type”: “text”, “analyzer”: “my_edgegram_analyzer”, “search_analyzer”: “standard” } } } } }

Slide 157

Slide 157

PUT /starwars_extended/_doc/4 { “quote”: “I find your lack of faith disturbing.” } PUT /starwars_extended/_doc/5 { “quote”: “That… is your failure.” }

Slide 158

Slide 158

GET /starwars_extended/_termvectors/4 { “fields”: [ “quote.edgegram” ], “offsets”: true, “payloads”: true, “positions”: true, “term_statistics”: true, “field_statistics”: true }

Slide 159

Slide 159

{ “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “4”, “_version”: 1, “found”: true, “took”: 3, “term_vectors”: { “quote.edgegram”: { “field_statistics”: { “sum_doc_freq”: 26, “doc_count”: 2, “sum_ttf”: 26 }, “terms”: { “dis”: { “doc_freq”: 1, “ttf”: 1, “term_freq”: 1, “tokens”: [ { “position”: 6, “start_offset”: 26, “end_offset”: 36 } ] }, “dist”: { “doc_freq”: 1, “ttf”: 1, …

Slide 160

Slide 160

POST /starwars_extended/_search { “query”: { “match”: { “quote”: “fail” } } }

Slide 161

Slide 161

POST /starwars_extended/_search { “query”: { “match”: { “quote.lowercase”: “fail” } } }

Slide 162

Slide 162

POST /starwars_extended/_search { “query”: { “match”: { “quote.full”: “fail” } } }

Slide 163

Slide 163

POST /starwars_extended/_search { “query”: { “match”: { “quote.ngram”: “fail” } } }

Slide 164

Slide 164

… “hits”: { “total”: 2, “max_score”: 1.0135446, “hits”: [ { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “4”, “_score”: 1.0135446, “_source”: { “quote”: “I find your lack of faith disturbing.” } }, { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “5”, “_score”: 0.50476736, “_source”: { “quote”: “That… is your failure.” } } ] …

Slide 165

Slide 165

POST /starwars_extended/_search { “query”: { “match”: { “quote.edgegram”: “fail” } } }

Slide 166

Slide 166

… “hits”: { “total”: 1, “max_score”: 0.39556286, “hits”: [ { “_index”: “starwars_v42”, “_type”: “_doc”, “_id”: “5”, “_score”: 0.39556286, “_source”: { “quote”: “That… is your failure.” } } ] …