Full-Text Search Explained

A presentation at Barcamp Armenia in June 2018 in Yerevan, Armenia by Philipp Krenn

Slide 1

Slide 1

Full-Text Search Explained Philipp Krenn 4444 @xeraa

Slide 2

Slide 2

Developer !

Slide 3

Slide 3

Who uses a Database?

Slide 4

Slide 4

Who uses Search?

Slide 5

Slide 5

Slide 6

Slide 6

Slide 7

Slide 7

Store

Slide 8

Slide 8

Apache Lucene Elasticsearch

Slide 9

Slide 9

Slide 10

Slide 10

Example These are <em>not</em> the droids you are looking for.

Slide 11

Slide 11

html_strip Char Filter These are not the droids you are looking for.

Slide 12

Slide 12

standard Tokenizer These 4 are 4 not 4 the 4 droids 4 you 4 are 4 looking 4 for

Slide 13

Slide 13

lowercase Token Filter these 4 are 4 not 4 the 4 droids 4 you 4 are 4 looking 4 for

Slide 14

Slide 14

stop Token Filter droids 4 you 4 looking

Slide 15

Slide 15

snowball Token Filter droid 4 you 4 look

Slide 16

Slide 16

Analyze

Slide 17

Slide 17

GET /_analyze {

"analyzer" : "english" ,

"text" : "These are not the droids you are looking for." }

Slide 18

Slide 18

{

"tokens" : [ {

"token" : "droid" ,

"start_offset" : 18 ,

"end_offset" : 24 ,

"type" : "<ALPHANUM>" ,

"position" : 4 }, {

"token" : "you" ,

"start_offset" : 25 ,

"end_offset" : 28 ,

"type" : "<ALPHANUM>" ,

"position" : 5 }, ... ] }

Slide 19

Slide 19

GET /_analyze {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"stop" ,

"snowball" ],

"text" : "These are <em>not</em> the droids you are looking for." }

Slide 20

Slide 20

{

"tokens" : [ {

"token" : "droid" ,

"start_offset" : 27 ,

"end_offset" : 33 ,

"type" : "<ALPHANUM>" ,

"position" : 4 }, {

"token" : "you" ,

"start_offset" : 34 ,

"end_offset" : 37 ,

"type" : "<ALPHANUM>" ,

"position" : 5 }, ... ] }

Slide 21

Slide 21

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ core/src/java/org/apache/lucene/analysis/standard/ StandardAnalyzer.java#L44-L50

Slide 22

Slide 22

Always Use Stop Words?

Slide 23

Slide 23

To be, or not to be.

Slide 24

Slide 24

Armenian Սրանք

ձեր

փնտրած

դրոիդները

չեն։

Slide 25

Slide 25

Armenian սրան 4 ձեր 4 փնտրած 4 դրոիդները 4 չեն

Slide 26

Slide 26

Armenian with the English Analyzer սրանք 4 ձեր 4 փնտրած 4 դրոիդները 4 չեն

Slide 27

Slide 27

Armenian Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ hy/stopwords.txt

Slide 28

Slide 28

Russian Это

не

те

дроиды , которых

ты

ищешь .

Slide 29

Slide 29

Russian эт 4 те 4 дроид 4 котор 4 ищеш

Slide 30

Slide 30

Russian Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ snowball/russian_stop.txt

Slide 31

Slide 31

Detecting Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect

Slide 32

Slide 32

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 33

Slide 33

More Language Plugins Core : ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (better Polish stemming), Ukrainian (stemming) Community : Hebrew, Vietnamese, Network Address Analysis, String2Integer,...

Slide 34

Slide 34

Language Rules English: Philipp's

philipp French: l'église

eglis German: äußerst

ausserst

Slide 35

Slide 35

Another Example Obi-Wan never told you what happened to your father.

Slide 36

Slide 36

Another Example obi 4 wan 4 never 4 told 4 you 4 what 4 happen 4 your 4 father

Slide 37

Slide 37

Another Example <b>No</b>. I am your father.

Slide 38

Slide 38

Another Example i 4 am 4 your 4 father

Slide 39

Slide 39

Inverted Index ID 1 ID 2 ID 3 am 0 0 1[2] droid 1[4] 0 0 father 0 1[9] 1[4] happen 0 1[6] 0 i 0 0 1[1] look 1[7] 0 0 never 0 1[2] 0 obi 0 1[0] 0 told 0 1[3] 0 wan 0 1[1] 0 what 0 1[5] 0 you 1[5] 1[4] 0 your 0 1[8] 1[3]

Slide 40

Slide 40

To / The Index

Slide 41

Slide 41

PUT /starwars {

"settings" : {

"number_of_shards" : 1 ,

"analysis" : {

"filter" : {

"my_synonym_filter" : {

"type" : "synonym" ,

"synonyms" : [

"father,dad" ,

"droid => droid,machine" ] } },

Slide 42

Slide 42

"analyzer" : {

"my_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"stop" ,

"snowball" ,

"my_synonym_filter" ] } } } },

Slide 43

Slide 43

"mappings" : {

"_doc" : {

"properties" : {

"quote" : {

"type" : "text" ,

"analyzer" : "my_analyzer" } } } } }

Slide 44

Slide 44

Synonyms Index synonym or query time synonym_graph

Slide 45

Slide 45

GET /starwars/_mapping GET /starwars/_settings

Slide 46

Slide 46

PUT /starwars/_doc/ 1 {

"quote" : "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/ 2 {

"quote" : "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/ 3 {

"quote" : "<b>No</b>. I am your father." }

Slide 47

Slide 47

GET /starwars/_doc/ 1 GET /starwars/_doc/ 1 /_source

Slide 48

Slide 48

Multi Lingual Index PUT /starwars_en/_doc/1 Type Field { "quote_en": "...", "quote_de": "..." }

Slide 49

Slide 49

PS: Single Type per Index

Slide 50

Slide 50

Search

Slide 51

Slide 51

POST /starwars/_search {

"query" : {

"match_all" : { } } }

Slide 52

Slide 52

GET vs POST

Slide 53

Slide 53

{

"took" : 1 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 3 ,

"max_score" : 1 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 1 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, ...

Slide 54

Slide 54

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "droid" } } }

Slide 55

Slide 55

{

"took" : 2 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.39556286 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.39556286 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } } ] } }

Slide 56

Slide 56

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 57

Slide 57

...

"hits" : {

"total" : 2 ,

"max_score" : 0.41913947 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.41913947 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.39291072 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } } ] } }

Slide 58

Slide 58

POST /starwars/_doc/ 0 /_explain {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 59

Slide 59

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "0" ,

"matched" : false }

Slide 60

Slide 60

POST /starwars/_doc/ 1 /_explain {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 61

Slide 61

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"matched" : false ,

"explanation" : {

"value" : 0 ,

"description" : "no matching term" ,

"details" : [] } }

Slide 62

Slide 62

POST /starwars/_doc/ 2 /_explain {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 63

Slide 63

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"matched" : true ,

"explanation" : { ...

Slide 64

Slide 64

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "machine" } } }

Slide 65

Slide 65

{

"took" : 2 ,

"timed_out" : false ,

"_shards" : {

"total" : 1 ,

"successful" : 1 ,

"skipped" : 0 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 1.2499592 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 1.2499592 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } } ] } }

Slide 66

Slide 66

POST /starwars/_search {

"query" : {

"match_phrase" : {

"quote" : "I am your father" } } }

Slide 67

Slide 67

{

"took" : 3 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 1.5665855 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.5665855 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 68

Slide 68

POST /starwars/_search {

"query" : {

"match_phrase" : {

"quote" : {

"query" : "I am father" ,

"slop" : 1 } } } }

Slide 69

Slide 69

{

"took" : 16 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.8327639 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.8327639 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 70

Slide 70

POST /starwars/_search {

"query" : {

"match_phrase" : {

"quote" : {

"query" : "I am not your father" ,

"slop" : 1 } } } }

Slide 71

Slide 71

{

"took" : 5 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 1.0409548 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.0409548 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 72

Slide 72

POST /starwars/_search {

"query" : {

"match" : {

"quote" : {

"query" : "van" ,

"fuzziness" : "AUTO" } } } }

Slide 73

Slide 73

{

"took" : 14 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.18155496 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.18155496 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } } ] } }

Slide 74

Slide 74

POST /starwars/_search {

"query" : {

"match" : {

"quote" : {

"query" : "ovi-van" ,

"fuzziness" : 1 } } } }

Slide 75

Slide 75

{

"took" : 109 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.3798467 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.3798467 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } } ] } }

Slide 76

Slide 76

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 77

Slide 77

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 78

Slide 78

SELECT *

FROM starwars

WHERE quote LIKE

"?an"

OR quote LIKE

"V?n"

OR quote LIKE

"Va?"

Slide 79

Slide 79

Scoring

Slide 80

Slide 80

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 81

Slide 81

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-with- bm25

Slide 82

Slide 82

Term Frequency

Slide 83

Slide 83

Slide 84

Slide 84

Inverse Document Frequency

Slide 85

Slide 85

Slide 86

Slide 86

Field-Length Norm

Slide 87

Slide 87

POST /starwars/_search?explain= true {

"query" : {

"match" : {

"quote" : "father" } } }

Slide 88

Slide 88

... "_explanation" : {

"value" : 0.41913947 ,

"description" : "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:" ,

"details" : [ {

"value" : 0.41913947 ,

"description" : "score(doc=0,freq=2.0 = termFreq=2.0 ), product of:" ,

"details" : [ {

"value" : 0.2876821 ,

"description" : "idf(docFreq=1, docCount=1)" ,

"details" : [] }, {

"value" : 1.4569536 ,

"description" : "tfNorm, computed from:" ,

"details" : [ {

"value" : 2 ,

"description" : "termFreq=2.0" ,

"details" : [] }, ...

Slide 89

Slide 89

Score 0.41913947: i 4 am 4 your 4 father 0.39291072: obi 4 wan 4 never 4 told 4 you 4 what 4 happen 4 your 4 father

Slide 90

Slide 90

Vector Space Model Search multiple terms

Slide 91

Slide 91

Search your father

Slide 92

Slide 92

Slide 93

Slide 93

Coordination Factor Reward multiple terms

Slide 94

Slide 94

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 95

Slide 95

Practical Scoring Function Putting it all together

Slide 96

Slide 96

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t) ² · t.getBoost() · norm(t,d) ) (t in q)

Slide 97

Slide 97

Function Score Script, weight, random, field value, decay (geo or date)

Slide 98

Slide 98

POST /starwars/_search {

"query" : {

"function_score" : {

"query" : {

"match" : {

"quote" : "father" } },

"random_score" : {} } } }

Slide 99

Slide 99

Compare Scores "100% perfect" vs a "50%" match

Slide 100

Slide 100

Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 101

Slide 101

GET /starwars/_analyze {

"analyzer" : "my_analyzer" ,

"text" : "These are my father's machines." }

Slide 102

Slide 102

{ "tokens" : [ {

"token" : "my" ,

"start_offset" : 10 ,

"end_offset" : 12 ,

"type" : "<ALPHANUM>" ,

"position" : 2 }, {

"token" : "father" ,

"start_offset" : 13 ,

"end_offset" : 21 ,

"type" : "<ALPHANUM>" ,

"position" : 3 }, {

"token" : "dad" ,

"start_offset" : 13 ,

"end_offset" : 21 ,

"type" : "SYNONYM" ,

"position" : 3 }, {

"token" : "machin" ,

"start_offset" : 22 ,

"end_offset" : 30 ,

"type" : "<ALPHANUM>" ,

"position" : 4 } ] }

Slide 103

Slide 103

PUT /starwars/_doc/ 4 {

"quote" : "These are my father's machines." }

Slide 104

Slide 104

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "my father machine" } } }

Slide 105

Slide 105

"hits" : {

"total" : 4 ,

"max_score" : 2.92523 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 2.92523 ,

"_source" : {

"quote" : "These are my father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.8617505 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, ...

Slide 106

Slide 106

2.92523 == 100%

Slide 107

Slide 107

DELETE /starwars/_doc/ 4 POST /starwars/_search {

"query" : {

"match" : {

"quote" : "my father machine" } } }

Slide 108

Slide 108

"hits" : {

"total" : 3 ,

"max_score" : 1.2499592 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 1.2499592 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, ...

Slide 109

Slide 109

1.2499592 == 43% or 100%?

Slide 110

Slide 110

PUT /starwars/_doc/ 4 {

"quote" : "These droids are my father's father's machines." } POST /starwars/_search {

"query" : {

"match" : {

"quote" : "my father machine" } } }

Slide 111

Slide 111

"hits" : {

"total" : 4 ,

"max_score" : 3.0068164 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 3.0068164 ,

"_source" : {

"quote" : "These droids are my father's father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.89701396 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, ...

Slide 112

Slide 112

3.0068164 == 103%?

Slide 113

Slide 113

Slide 114

Slide 114

PS: Shards Default? Effect on IDF?

Slide 115

Slide 115

Distributed Frequency Search GET starwars/_search?search_type=dfs_query_then_fetch { ... }

Slide 116

Slide 116

Don’t use dfs_query_then_fetch in production. It really isn’t required. — https://www.elastic.co/guide/en/elasticsearch/ guide/current/relevance-is-broken.html

Slide 117

Slide 117

Performance

Slide 118

Slide 118

Slide 119

Slide 119

Slide 120

Slide 120

Conclusion

Slide 121

Slide 121

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 122

Slide 122

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 123

Slide 123

Advanced Queries Highlighting NGrams & Edge Grams Multiple Analyzers Reindex & Alias

Slide 124

Slide 124

There is more Elastic Stack

Slide 125

Slide 125

Slide 126

Slide 126

Slide 127

Slide 127

https://cloud.elastic.co

Slide 128

Slide 128

Slide 129

Slide 129

Thank You! Questions? Philipp Krenn 44444 @xeraa PS: Stickers

Slide 130

Slide 130

More

Slide 131

Slide 131

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "father" } },

"highlight" : {

"type" : "unified" ,

"pre_tags" : [

"<tag>" ],

"post_tags" : [

"</tag>" ],

"fields" : {

"quote" : {} } } }

Slide 132

Slide 132

... "hits" : {

"total" : 3 ,

"max_score" : 0.631961 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 0.631961 ,

"_source" : {

"quote" : "These droids are my father's father's machines." },

"highlight" : {

"quote" : [

"These droids are my <tag>father's</tag> <tag>father's</tag> machines." ] } }, ...

Slide 133

Slide 133

Boolean Queries must 4 must_not 4 should 4 filter

Slide 134

Slide 134

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : {

"quote" : "father" } },

"should" : [ {

"match" : {

"quote" : "your" } }, {

"match" : {

"quote" : "obi" } } ] } } }

Slide 135

Slide 135

... "hits" : {

"total" : 3 ,

"max_score" : 2.117857 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 2.117857 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.3856719 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } }, ...

Slide 136

Slide 136

POST /starwars/_search {

"query" : {

"bool" : {

"filter" : {

"match" : {

"quote" : "father" } },

"should" : [ {

"match" : {

"quote" : "your" } }, {

"match" : {

"quote" : "obi" } } ] } } }

Slide 137

Slide 137

... "hits" : {

"total" : 3 ,

"max_score" : 1.6694657 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 1.6694657 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.8317767 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } },

Slide 138

Slide 138

Named Queries & minimum_should_match

Slide 139

Slide 139

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : { "quote" : "father" } },

"should" : [ {

"match" : {

"quote" : { "query" : "your" , "_name" : "quote-your" } } }, {

"match" : {

"quote" : { "query" : "obi" , "_name" : "quote-obi" } } }, {

"match" : {

"quote" : { "query" : "droid" , "_name" : "quote-droid" } } } ],

"minimum_should_match" : 2 } } }

Slide 140

Slide 140

...

"hits" : {

"total" : 1 ,

"max_score" : 2.117857 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 2.117857 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." },

"matched_queries" : [

"quote-obi" ,

"quote-your" ] } ] } }

Slide 141

Slide 141

Boosting

1 increase, <1 decrease, <0 punish

Slide 142

Slide 142

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : {

"quote" : "father" } },

"should" : [ {

"match" : {

"quote" : "your" } }, {

"match" : {

"quote" : {

"query" : "obi" ,

"boost" : 3 } } } ] } } }

Slide 143

Slide 143

... "hits" : {

"total" : 3 ,

"max_score" : 4.2368493 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 4.2368493 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.3856719 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } }, ...

Slide 144

Slide 144

Search for father , but prefer father father

Slide 145

Slide 145

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : {

"quote" : "father father" } } } } }

Slide 146

Slide 146

...

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 1.263922 ,

"_source" : {

"quote" : "These droids are my father's father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.1077905 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } },

Slide 147

Slide 147

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : {

"quote" : "father father" } },

"should" : {

"match_phrase" : {

"quote" : "father father" } } } } }

Slide 148

Slide 148

...

"hits" : {

"total" : 3 ,

"max_score" : 3.3799262 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 3.3799262 ,

"_source" : {

"quote" : "These droids are my father's father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.1077905 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } }, ...

Slide 149

Slide 149

Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0

Slide 150

Slide 150

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "drui" } },

"suggest" : {

"my_suggestion" : {

"text" : "drui" ,

"term" : {

"field" : "quote" } } } }

Slide 151

Slide 151

...

"hits" : {

"total" : 0 ,

"max_score" : null ,

"hits" : [] },

"suggest" : {

"my_suggestion" : [ {

"text" : "drui" ,

"offset" : 0 ,

"length" : 4 ,

"options" : [ {

"text" : "droid" ,

"score" : 0.5 ,

"freq" : 1 } ] } ] } }

Slide 152

Slide 152

NGram Partial matches Edge Gram

Slide 153

Slide 153

GET /_analyze {

"char_filter" : [

"html_strip" ],

"tokenizer" : {

"type" : "ngram" ,

"min_gram" : "3" ,

"max_gram" : "3" ,

"token_chars" : [

"letter" ] },

"filter" : [

"lowercase" ],

"text" : "These are <em>not</em> the droids you are looking for." }

Slide 154

Slide 154

{

"tokens" : [ {

"token" : "the" ,

"start_offset" : 0 ,

"end_offset" : 3 ,

"type" : "word" ,

"position" : 0 }, {

"token" : "hes" ,

"start_offset" : 1 ,

"end_offset" : 4 ,

"type" : "word" ,

"position" : 1 }, {

"token" : "ese" ,

"start_offset" : 2 ,

"end_offset" : 5 ,

"type" : "word" ,

"position" : 2 }, {

"token" : "are" ,

"start_offset" : 6 ,

"end_offset" : 9 ,

"type" : "word" ,

"position" : 3 }, ...

Slide 155

Slide 155

GET /_analyze {

"char_filter" : [

"html_strip" ],

"tokenizer" : {

"type" : "edge_ngram" ,

"min_gram" : "1" ,

"max_gram" : "3" ,

"token_chars" : [

"letter" ] },

"filter" : [

"lowercase" ],

"text" : "These are <em>not</em> the droids you are looking for." }

Slide 156

Slide 156

{

"tokens" : [ {

"token" : "t" ,

"start_offset" : 0 ,

"end_offset" : 1 ,

"type" : "word" ,

"position" : 0 }, {

"token" : "th" ,

"start_offset" : 0 ,

"end_offset" : 2 ,

"type" : "word" ,

"position" : 1 }, {

"token" : "the" ,

"start_offset" : 0 ,

"end_offset" : 3 ,

"type" : "word" ,

"position" : 2 }, {

"token" : "a" ,

"start_offset" : 6 ,

"end_offset" : 7 ,

"type" : "word" ,

"position" : 3 }, {

"token" : "ar" ,

"start_offset" : 6 ,

"end_offset" : 8 ,

"type" : "word" ,

"position" : 4 }, ...

Slide 157

Slide 157

Combining Analyzers Reindex Store multiple times Tune BM25 Combine scores

Slide 158

Slide 158

BM25 Revisited

Slide 159

Slide 159

https://www.elastic.co/blog/practical-bm25-part-2-the-bm25- algorithm-and-its-variables

Slide 160

Slide 160

b 4 field length amplification Default 0.75 k1 4 term frequency saturation Default 1.2

Slide 161

Slide 161

PUT /starwars_v42 {

"settings" : {

"number_of_shards" : 1 ,

"index" : {

"similarity" : {

"default" : {

"type" : "BM25" ,

"b" : 0 ,

"k1" : 0 } } },

Slide 162

Slide 162

"analysis" : {

"filter" : {

"my_synonym_filter" : {

"type" : "synonym" ,

"synonyms" : [

"droid,machine" ,

"father,dad" ] },

"my_ngram_filter" : {

"type" : "ngram" ,

"min_gram" : "3" ,

"max_gram" : "3" ,

"token_chars" : [

"letter" ] } },

Slide 163

Slide 163

"analyzer" : {

"my_lowercase_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "whitespace" ,

"filter" : [

"lowercase" ] },

"my_full_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"stop" ,

"snowball" ,

"my_synonym_filter" ] },

Slide 164

Slide 164

"my_ngram_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "whitespace" ,

"filter" : [

"lowercase" ,

"stop" ,

"my_ngram_filter" ] } } } },

Slide 165

Slide 165

"mappings" : {

"_doc" : {

"properties" : {

"quote" : {

"type" : "text" ,

"fields" : {

"lowercase" : {

"type" : "text" ,

"analyzer" : "my_lowercase_analyzer" },

"full" : {

"type" : "text" ,

"analyzer" : "my_full_analyzer" },

"ngram" : {

"type" : "text" ,

"analyzer" : "my_ngram_analyzer" } } } } } } }

Slide 166

Slide 166

POST /_reindex {

"source" : {

"index" : "starwars" },

"dest" : {

"index" : "starwars_v42" } }

Slide 167

Slide 167

Aliases Atomic remove and add Point to multiple indices (read-only)

Slide 168

Slide 168

PUT _alias {

"actions" : [ {

"add" : {

"index" : "starwars_v42" ,

"alias" : "starwars_extended" } } ] }

Slide 169

Slide 169

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "droid" } } }

Slide 170

Slide 170

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 1.1533037 ,

"_source" : {

"quote" : "These droids are my father's father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 1.1295731 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } } ]

Slide 171

Slide 171

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.full" : "droid" } } }

Slide 172

Slide 172

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.6931472 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 0.6931472 ,

"_source" : {

"quote" : "These droids are my father's father's machines." } } ]

Slide 173

Slide 173

There are no "best" b and k1 values

Slide 174

Slide 174

POST /starwars_extended/_search?explain= true {

"query" : {

"multi_match" : {

"query" : "obiwan" ,

"fields" : [

"quote" ,

"quote.lowercase" ,

"quote.full" ,

"quote.ngram" ],

"type" : "most_fields" } } }

Slide 175

Slide 175

... "hits" : {

"total" : 1 ,

"max_score" : 0.4912064 ,

"hits" : [ {

"_shard" : "[starwars_v42][2]" ,

"_node" : "BCDwzJ4WSw2dyoGLTzwlqw" ,

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.4912064 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." }, ...

Slide 176

Slide 176

Whitespace Tokenizer "weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:"

Slide 177

Slide 177

POST /starwars_extended/_search {

"query" : {

"multi_match" : {

"query" : "you" ,

"fields" : [

"quote" ,

"quote.lowercase^5" ,

"quote.full" ,

"quote.ngram" ],

"type" : "best_fields" } } }

Slide 178

Slide 178

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 2.1939285 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 2.1939285 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.1990188 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ]

Slide 179

Slide 179

Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase

Slide 180

Slide 180

Different Analyzers for Indexing and Searching Per query In the mapping

Slide 181

Slide 181

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.ngram" : {

"query" : "the" ,

"analyzer" : "standard" } } } }

Slide 182

Slide 182

... "hits" : [ {

"_index" : "starwars_extended" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.38254172 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars_extended" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.36165747 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] ...

Slide 183

Slide 183

Edge Gram vs Trigram Extending a mapping Testing a custom mapping

Slide 184

Slide 184

POST /starwars_extended/_close PUT /starwars_extended/_settings {

"analysis" : {

"filter" : {

"my_edgegram_filter" : {

"type" : "edge_ngram" ,

"min_gram" : 3 ,

"max_gram" : 10 } },

"analyzer" : {

"my_edgegram_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"my_edgegram_filter" ] } } } } POST /starwars_extended/_open

Slide 185

Slide 185

GET starwars_extended/_analyze {

"text" : "Father" ,

"analyzer" : "my_edgegram_analyzer" }

Slide 186

Slide 186

{

"tokens" : [ {

"token" : "fat" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 }, {

"token" : "fath" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 }, {

"token" : "fathe" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 }, {

"token" : "father" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 } ] }

Slide 187

Slide 187

PUT /starwars_extended/_doc/_mapping {

"properties" : {

"quote" : {

"type" : "text" ,

"fields" : {

"edgegram" : {

"type" : "text" ,

"analyzer" : "my_edgegram_analyzer" ,

"search_analyzer" : "standard" } } } } }

Slide 188

Slide 188

PUT /starwars_extended/_doc/ 4 {

"quote" : "I find your lack of faith disturbing." } PUT /starwars_extended/_doc/ 5 {

"quote" : "That... is your failure." }

Slide 189

Slide 189

GET /starwars_extended/_doc/ 4 /_termvectors {

"fields" : [

"quote.edgegram" ],

"offsets" : true ,

"payloads" : true ,

"positions" : true ,

"term_statistics" : true ,

"field_statistics" : true }

Slide 190

Slide 190

{

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_version" : 1 ,

"found" : true ,

"took" : 3 ,

"term_vectors" : {

"quote.edgegram" : {

"field_statistics" : {

"sum_doc_freq" : 26 ,

"doc_count" : 2 ,

"sum_ttf" : 26 },

"terms" : {

"dis" : {

"doc_freq" : 1 ,

"ttf" : 1 ,

"term_freq" : 1 ,

"tokens" : [ {

"position" : 6 ,

"start_offset" : 26 ,

"end_offset" : 36 } ] },

"dist" : {

"doc_freq" : 1 ,

"ttf" : 1 , ...

Slide 191

Slide 191

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote" : "fail" } } }

Slide 192

Slide 192

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.lowercase" : "fail" } } }

Slide 193

Slide 193

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.full" : "fail" } } }

Slide 194

Slide 194

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.ngram" : "fail" } } }

Slide 195

Slide 195

... "hits" : {

"total" : 2 ,

"max_score" : 1.0135446 ,

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 1.0135446 ,

"_source" : {

"quote" : "I find your lack of faith disturbing." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "5" ,

"_score" : 0.50476736 ,

"_source" : {

"quote" : "That... is your failure." } } ] ...

Slide 196

Slide 196

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.edgegram" : "fail" } } }

Slide 197

Slide 197

... "hits" : {

"total" : 1 ,

"max_score" : 0.39556286 ,

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "5" ,

"_score" : 0.39556286 ,

"_source" : {

"quote" : "That... is your failure." } } ] ...

Slide 198

Slide 198

Trainings https://training.elastic.co

Slide 199

Slide 199

The End