Elastic Full-Text Search in Liferay

A presentation at Liferay Symposium in June 2018 in Paris, France by Philipp Krenn

Slide 1

Slide 1

Slide 2

Slide 2

Who uses a Database?

Slide 3

Slide 3

Who uses Search?

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

Question https://sli.do/xeraa Answer https://twitter.com/xeraa

Slide 7

Slide 7

Store

Slide 8

Slide 8

Apache Lucene Elasticsearch

Slide 9

Slide 9

Slide 10

Slide 10

Example These are <em>not</em> the droids you are looking for.

Slide 11

Slide 11

html_strip Char Filter These are not the droids you are looking for.

Slide 12

Slide 12

standard Tokenizer These 4 are 4 not 4 the 4 droids 4 you 4 are 4 looking 4 for

Slide 13

Slide 13

lowercase Token Filter these 4 are 4 not 4 the 4 droids 4 you 4 are 4 looking 4 for

Slide 14

Slide 14

stop Token Filter droids 4 you 4 looking

Slide 15

Slide 15

snowball Token Filter droid 4 you 4 look

Slide 16

Slide 16

Analyze

Slide 17

Slide 17

GET /_analyze {

"analyzer" : "english" ,

"text" : "These are not the droids you are looking for." }

Slide 18

Slide 18

{

"tokens" : [ {

"token" : "droid" ,

"start_offset" : 18 ,

"end_offset" : 24 ,

"type" : "<ALPHANUM>" ,

"position" : 4 }, {

"token" : "you" ,

"start_offset" : 25 ,

"end_offset" : 28 ,

"type" : "<ALPHANUM>" ,

"position" : 5 }, ... ] }

Slide 19

Slide 19

GET /_analyze {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"stop" ,

"snowball" ],

"text" : "These are <em>not</em> the droids you are looking for." }

Slide 20

Slide 20

{

"tokens" : [ {

"token" : "droid" ,

"start_offset" : 27 ,

"end_offset" : 33 ,

"type" : "<ALPHANUM>" ,

"position" : 4 }, {

"token" : "you" ,

"start_offset" : 34 ,

"end_offset" : 37 ,

"type" : "<ALPHANUM>" ,

"position" : 5 }, ... ] }

Slide 21

Slide 21

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/ master/lucene/core/src/java/org/apache/lucene/ analysis/standard/StandardAnalyzer.java#L44-L50

Slide 22

Slide 22

Always Use Stop Words?

Slide 23

Slide 23

To be, or not to be.

Slide 24

Slide 24

French Ce ne sont pas ces droïdes là que vous recherchez.

Slide 25

Slide 25

French droïd 4 là 4 recherchez

Slide 26

Slide 26

French with the English Analyzer ce 4 ne 4 sont 4 pa 4 ce 4 droïd 4 là 4 que 4 vou 4 recherchez

Slide 27

Slide 27

French Stop Words https://github.com/apache/lucene-solr/blob/ master/lucene/analysis/common/src/resources/ org/apache/lucene/analysis/snowball/ french_stop.txt

Slide 28

Slide 28

Detecting Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect

Slide 29

Slide 29

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 30

Slide 30

More Language Plugins Core : ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (better Polish stemming), Ukrainian (stemming) Community : Hebrew, Vietnamese, Network Address Analysis, String2Integer,...

Slide 31

Slide 31

Language Rules English: Philipp's

philipp French: l'église

eglis German: äußerst

ausserst

Slide 32

Slide 32

Another Example Obi-Wan never told you what happened to your father.

Slide 33

Slide 33

Another Example obi 4 wan 4 never 4 told 4 you 4 what 4 happen 4 your 4 father

Slide 34

Slide 34

Another Example <b>No</b>. I am your father.

Slide 35

Slide 35

Another Example i 4 am 4 your 4 father

Slide 36

Slide 36

Inverted Index ID 1 ID 2 ID 3 am 0 0 1[2] droid 1[4] 0 0 father 0 1[9] 1[4] happen 0 1[6] 0 i 0 0 1[1] look 1[7] 0 0 never 0 1[2] 0 obi 0 1[0] 0 told 0 1[3] 0 wan 0 1[1] 0 what 0 1[5] 0 you 1[5] 1[4] 0 your 0 1[8] 1[3]

Slide 37

Slide 37

To / The Index

Slide 38

Slide 38

PUT /starwars {

"settings" : {

"number_of_shards" : 1 ,

"analysis" : {

"filter" : {

"my_synonym_filter" : {

"type" : "synonym" ,

"synonyms" : [

"father,dad" ,

"droid => droid,machine" ] } },

Slide 39

Slide 39

"analyzer" : {

"my_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"stop" ,

"snowball" ,

"my_synonym_filter" ] } } } },

Slide 40

Slide 40

"mappings" : {

"_doc" : {

"properties" : {

"quote" : {

"type" : "text" ,

"analyzer" : "my_analyzer" } } } } }

Slide 41

Slide 41

Synonyms Index synonym or query time synonym_graph

Slide 42

Slide 42

GET /starwars/_mapping GET /starwars/_settings

Slide 43

Slide 43

PUT /starwars/_doc/ 1 {

"quote" : "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/ 2 {

"quote" : "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/ 3 {

"quote" : "<b>No</b>. I am your father." }

Slide 44

Slide 44

GET /starwars/_doc/ 1 GET /starwars/_doc/ 1 /_source

Slide 45

Slide 45

Multi Lingual Index: PUT /starwars_en/_doc/1 Type Field: { "quote_en": "..." }

Slide 46

Slide 46

PS: Single Type per Index

Slide 47

Slide 47

Search

Slide 48

Slide 48

POST /starwars/_search {

"query" : {

"match_all" : { } } }

Slide 49

Slide 49

GET vs POST

Slide 50

Slide 50

{

"took" : 1 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 3 ,

"max_score" : 1 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 1 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, ...

Slide 51

Slide 51

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "droid" } } }

Slide 52

Slide 52

{

"took" : 2 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.39556286 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.39556286 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } } ] } }

Slide 53

Slide 53

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 54

Slide 54

...

"hits" : {

"total" : 2 ,

"max_score" : 0.41913947 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.41913947 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.39291072 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } } ] } }

Slide 55

Slide 55

POST /starwars/_doc/ 0 /_explain {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 56

Slide 56

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "0" ,

"matched" : false }

Slide 57

Slide 57

POST /starwars/_doc/ 1 /_explain {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 58

Slide 58

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"matched" : false ,

"explanation" : {

"value" : 0 ,

"description" : "no matching term" ,

"details" : [] } }

Slide 59

Slide 59

POST /starwars/_doc/ 2 /_explain {

"query" : {

"match" : {

"quote" : "dad" } } }

Slide 60

Slide 60

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"matched" : true ,

"explanation" : { ...

Slide 61

Slide 61

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "machine" } } }

Slide 62

Slide 62

{

"took" : 2 ,

"timed_out" : false ,

"_shards" : {

"total" : 1 ,

"successful" : 1 ,

"skipped" : 0 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 1.2499592 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 1.2499592 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } } ] } }

Slide 63

Slide 63

POST /starwars/_search {

"query" : {

"match_phrase" : {

"quote" : "I am your father" } } }

Slide 64

Slide 64

{

"took" : 3 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 1.5665855 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.5665855 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 65

Slide 65

POST /starwars/_search {

"query" : {

"match_phrase" : {

"quote" : {

"query" : "I am father" ,

"slop" : 1 } } } }

Slide 66

Slide 66

{

"took" : 16 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.8327639 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.8327639 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 67

Slide 67

POST /starwars/_search {

"query" : {

"match_phrase" : {

"quote" : {

"query" : "I am not your father" ,

"slop" : 1 } } } }

Slide 68

Slide 68

{

"took" : 5 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 1.0409548 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 1.0409548 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 69

Slide 69

POST /starwars/_search {

"query" : {

"match" : {

"quote" : {

"query" : "van" ,

"fuzziness" : "AUTO" } } } }

Slide 70

Slide 70

{

"took" : 14 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.18155496 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.18155496 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } } ] } }

Slide 71

Slide 71

POST /starwars/_search {

"query" : {

"match" : {

"quote" : {

"query" : "ovi-van" ,

"fuzziness" : 1 } } } }

Slide 72

Slide 72

{

"took" : 109 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"failed" : 0 },

"hits" : {

"total" : 1 ,

"max_score" : 0.3798467 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.3798467 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } } ] } }

Slide 73

Slide 73

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 74

Slide 74

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 75

Slide 75

SELECT *

FROM starwars

WHERE quote LIKE

"?an"

OR quote LIKE

"V?n"

OR quote LIKE

"Va?"

Slide 76

Slide 76

Scoring

Slide 77

Slide 77

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 78

Slide 78

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text- scoring-with-bm25

Slide 79

Slide 79

Term Frequency

Slide 80

Slide 80

Slide 81

Slide 81

Inverse Document Frequency

Slide 82

Slide 82

Slide 83

Slide 83

Field-Length Norm

Slide 84

Slide 84

POST /starwars/_search?explain= true {

"query" : {

"match" : {

"quote" : "father" } } }

Slide 85

Slide 85

... "_explanation" : {

"value" : 0.41913947 ,

"description" : "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:" ,

"details" : [ {

"value" : 0.41913947 ,

"description" : "score(doc=0,freq=2.0 = termFreq=2.0 ), product of:" ,

"details" : [ {

"value" : 0.2876821 ,

"description" : "idf(docFreq=1, docCount=1)" ,

"details" : [] }, {

"value" : 1.4569536 ,

"description" : "tfNorm, computed from:" ,

"details" : [ {

"value" : 2 ,

"description" : "termFreq=2.0" ,

"details" : [] }, ...

Slide 86

Slide 86

Score 0.41913947: i 4 am 4 your 4 father 0.39291072: obi 4 wan 4 never 4 told 4 you 4 what 4 happen 4 your 4 father

Slide 87

Slide 87

Vector Space Model Search multiple terms

Slide 88

Slide 88

Search your father

Slide 89

Slide 89

Slide 90

Slide 90

Coordination Factor Reward multiple terms

Slide 91

Slide 91

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 92

Slide 92

Practical Scoring Function Putting it all together

Slide 93

Slide 93

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t) ² · t.getBoost() · norm(t,d) ) (t in q)

Slide 94

Slide 94

Function Score Script, weight, random, field value, decay (geo or date)

Slide 95

Slide 95

POST /starwars/_search {

"query" : {

"function_score" : {

"query" : {

"match" : {

"quote" : "father" } },

"random_score" : {} } } }

Slide 96

Slide 96

Compare Scores "100% perfect" vs a "50%" match

Slide 97

Slide 97

Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 98

Slide 98

GET /starwars/_analyze {

"analyzer" : "my_analyzer" ,

"text" : "These are my father's machines." }

Slide 99

Slide 99

{ "tokens" : [ {

"token" : "my" ,

"start_offset" : 10 ,

"end_offset" : 12 ,

"type" : "<ALPHANUM>" ,

"position" : 2 }, {

"token" : "father" ,

"start_offset" : 13 ,

"end_offset" : 21 ,

"type" : "<ALPHANUM>" ,

"position" : 3 }, {

"token" : "dad" ,

"start_offset" : 13 ,

"end_offset" : 21 ,

"type" : "SYNONYM" ,

"position" : 3 }, {

"token" : "machin" ,

"start_offset" : 22 ,

"end_offset" : 30 ,

"type" : "<ALPHANUM>" ,

"position" : 4 } ] }

Slide 100

Slide 100

PUT /starwars/_doc/ 4 {

"quote" : "These are my father's machines." }

Slide 101

Slide 101

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "my father machine" } } }

Slide 102

Slide 102

"hits" : {

"total" : 4 ,

"max_score" : 2.92523 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 2.92523 ,

"_source" : {

"quote" : "These are my father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.8617505 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, ...

Slide 103

Slide 103

2.92523 == 100%

Slide 104

Slide 104

DELETE /starwars/_doc/ 4 POST /starwars/_search {

"query" : {

"match" : {

"quote" : "my father machine" } } }

Slide 105

Slide 105

"hits" : {

"total" : 3 ,

"max_score" : 1.2499592 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 1.2499592 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, ...

Slide 106

Slide 106

1.2499592 == 43% or 100%?

Slide 107

Slide 107

PUT /starwars/_doc/ 4 {

"quote" : "These droids are my father's father's machines." } POST /starwars/_search {

"query" : {

"match" : {

"quote" : "my father machine" } } }

Slide 108

Slide 108

"hits" : {

"total" : 4 ,

"max_score" : 3.0068164 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 3.0068164 ,

"_source" : {

"quote" : "These droids are my father's father's machines." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 0.89701396 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, ...

Slide 109

Slide 109

3.0068164 == 103%?

Slide 110

Slide 110

Slide 111

Slide 111

Performance

Slide 112

Slide 112

Slide 113

Slide 113

Slide 114

Slide 114

Conclusion

Slide 115

Slide 115

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 116

Slide 116

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 117

Slide 117

Advanced Queries Highlighting NGrams & Edge Grams Multiple Analyzers Reindex & Alias

Slide 118

Slide 118

There is more Elastic Stack

Slide 119

Slide 119

Slide 120

Slide 120

Thank You! Questions? Philipp Krenn 44444 @xeraa PS: Stickers

Slide 121

Slide 121

More

Slide 122

Slide 122

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "father" } },

"highlight" : {

"type" : "unified" ,

"pre_tags" : [

"<tag>" ],

"post_tags" : [

"</tag>" ],

"fields" : {

"quote" : {} } } }

Slide 123

Slide 123

... "hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.41913947 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." },

"highlight" : {

"quote" : [

"<b>No</b>. I am your <tag>father</tag>." ] } }, ...

Slide 124

Slide 124

Boolean Queries must

must_not

should

filter

Slide 125

Slide 125

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : {

"quote" : "father" } },

"should" : [ {

"match" : {

"quote" : "your" } }, {

"match" : {

"quote" : "obi" } } ] } } }

Slide 126

Slide 126

...

"hits" : {

"total" : 2 ,

"max_score" : 0.96268076 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.96268076 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.73245656 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 127

Slide 127

POST /starwars/_search {

"query" : {

"bool" : {

"filter" : {

"match" : {

"quote" : "father" } },

"should" : [ {

"match" : {

"quote" : "your" } }, {

"match" : {

"quote" : "obi" } } ] } } }

Slide 128

Slide 128

...

"hits" : {

"total" : 2 ,

"max_score" : 0.56977004 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.56977004 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.31331712 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 129

Slide 129

Named Queries & minimum_should_match

Slide 130

Slide 130

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : { "quote" : "father" } },

"should" : [ {

"match" : {

"quote" : { "query" : "your" , "_name" : "quote-your" } } }, {

"match" : {

"quote" : { "query" : "obi" , "_name" : "quote-obi" } } }, {

"match" : {

"quote" : { "query" : "droid" , "_name" : "quote-droid" } } } ],

"minimum_should_match" : 2 } } }

Slide 131

Slide 131

...

"hits" : {

"total" : 1 ,

"max_score" : 1.8154771 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 1.8154771 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." },

"matched_queries" : [

"quote-obi" ,

"quote-your" ] } ] } }

Slide 132

Slide 132

Boosting

1 increase, <1 decrease, <0 punish

Slide 133

Slide 133

POST /starwars/_search {

"query" : {

"bool" : {

"must" : {

"match" : {

"quote" : "father" } },

"should" : [ {

"match" : {

"quote" : "your" } }, {

"match" : {

"quote" : {

"query" : "obi" ,

"boost" : 3 } } } ] } } }

Slide 134

Slide 134

...

"hits" : {

"total" : 2 ,

"max_score" : 1.5324509 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 1.5324509 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.73245656 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] } }

Slide 135

Slide 135

Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0

Slide 136

Slide 136

POST /starwars/_search {

"query" : {

"match" : {

"quote" : "drui" } },

"suggest" : {

"my_suggestion" : {

"text" : "drui" ,

"term" : {

"field" : "quote" } } } }

Slide 137

Slide 137

...

"hits" : {

"total" : 0 ,

"max_score" : null ,

"hits" : [] },

"suggest" : {

"my_suggestion" : [ {

"text" : "drui" ,

"offset" : 0 ,

"length" : 4 ,

"options" : [ {

"text" : "droid" ,

"score" : 0.5 ,

"freq" : 1 } ] } ] } }

Slide 138

Slide 138

NGram Partial matches Trigram Edge Gram

Slide 139

Slide 139

GET /_analyze {

"char_filter" : [

"html_strip" ],

"tokenizer" : {

"type" : "ngram" ,

"min_gram" : "3" ,

"max_gram" : "3" ,

"token_chars" : [

"letter" ] },

"filter" : [

"lowercase" ],

"text" : "These are <em>not</em> the droids you are looking for." }

Slide 140

Slide 140

{

"tokens" : [ {

"token" : "the" ,

"start_offset" : 0 ,

"end_offset" : 3 ,

"type" : "word" ,

"position" : 0 }, {

"token" : "hes" ,

"start_offset" : 1 ,

"end_offset" : 4 ,

"type" : "word" ,

"position" : 1 }, {

"token" : "ese" ,

"start_offset" : 2 ,

"end_offset" : 5 ,

"type" : "word" ,

"position" : 2 }, {

"token" : "are" ,

"start_offset" : 6 ,

"end_offset" : 9 ,

"type" : "word" ,

"position" : 3 }, ...

Slide 141

Slide 141

GET /_analyze {

"char_filter" : [

"html_strip" ],

"tokenizer" : {

"type" : "edge_ngram" ,

"min_gram" : "1" ,

"max_gram" : "3" ,

"token_chars" : [

"letter" ] },

"filter" : [

"lowercase" ],

"text" : "These are <em>not</em> the droids you are looking for." }

Slide 142

Slide 142

{

"tokens" : [ {

"token" : "t" ,

"start_offset" : 0 ,

"end_offset" : 1 ,

"type" : "word" ,

"position" : 0 }, {

"token" : "th" ,

"start_offset" : 0 ,

"end_offset" : 2 ,

"type" : "word" ,

"position" : 1 }, {

"token" : "the" ,

"start_offset" : 0 ,

"end_offset" : 3 ,

"type" : "word" ,

"position" : 2 }, {

"token" : "a" ,

"start_offset" : 6 ,

"end_offset" : 7 ,

"type" : "word" ,

"position" : 3 }, {

"token" : "ar" ,

"start_offset" : 6 ,

"end_offset" : 8 ,

"type" : "word" ,

"position" : 4 }, ...

Slide 143

Slide 143

Combining Analyzers Reindex Store multiple times Combine scores

Slide 144

Slide 144

PUT /starwars_v42 {

"settings" : {

"number_of_shards" : 1 ,

"analysis" : {

"filter" : {

"my_synonym_filter" : {

"type" : "synonym" ,

"synonyms" : [

"droid,machine" ,

"father,dad" ] },

"my_ngram_filter" : {

"type" : "ngram" ,

"min_gram" : "3" ,

"max_gram" : "3" ,

"token_chars" : [

"letter" ] } },

Slide 145

Slide 145

"analyzer" : {

"my_lowercase_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "whitespace" ,

"filter" : [

"lowercase" ] },

"my_full_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"stop" ,

"snowball" ,

"my_synonym_filter" ] },

Slide 146

Slide 146

"my_ngram_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "whitespace" ,

"filter" : [

"lowercase" ,

"stop" ,

"my_ngram_filter" ] } } } },

Slide 147

Slide 147

"mappings" : {

"_doc" : {

"properties" : {

"quote" : {

"type" : "text" ,

"fields" : {

"lowercase" : {

"type" : "text" ,

"analyzer" : "my_lowercase_analyzer" },

"full" : {

"type" : "text" ,

"analyzer" : "my_full_analyzer" },

"ngram" : {

"type" : "text" ,

"analyzer" : "my_ngram_analyzer" } } } } } } }

Slide 148

Slide 148

POST /_reindex {

"source" : {

"index" : "starwars" },

"dest" : {

"index" : "starwars_v42" } }

Slide 149

Slide 149

PUT _alias {

"actions" : [ {

"add" : {

"index" : "starwars_v42" ,

"alias" : "starwars_extended" } } ] }

Slide 150

Slide 150

Aliases Atomic remove and add Point to multiple indices (read- only)

Slide 151

Slide 151

POST /starwars_extended/_search?explain= true {

"query" : {

"multi_match" : {

"query" : "obiwan" ,

"fields" : [

"quote" ,

"quote.lowercase" ,

"quote.full" ,

"quote.ngram" ],

"type" : "most_fields" } } }

Slide 152

Slide 152

... "hits" : {

"total" : 1 ,

"max_score" : 0.4912064 ,

"hits" : [ {

"_shard" : "[starwars_v42][2]" ,

"_node" : "BCDwzJ4WSw2dyoGLTzwlqw" ,

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.4912064 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." }, ...

Slide 153

Slide 153

Whitespace Tokenizer "weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:"

Slide 154

Slide 154

POST /starwars_extended/_search {

"query" : {

"multi_match" : {

"query" : "you" ,

"fields" : [

"quote" ,

"quote.lowercase" ,

"quote.full^5" ,

"quote.ngram" ],

"type" : "best_fields" } } }

Slide 155

Slide 155

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "1" ,

"_score" : 1.6022799 ,

"_source" : {

"quote" : "These are <em>not</em> the droids you are looking for." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 1.4997643 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.38650417 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ]

Slide 156

Slide 156

Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase

Slide 157

Slide 157

Different Analyzers for Indexing and Searching Per query In the mapping

Slide 158

Slide 158

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.ngram" : {

"query" : "the" ,

"analyzer" : "standard" } } } }

Slide 159

Slide 159

... "hits" : [ {

"_index" : "starwars_extended" ,

"_type" : "_doc" ,

"_id" : "2" ,

"_score" : 0.38254172 ,

"_source" : {

"quote" : "Obi-Wan never told you what happened to your father." } }, {

"_index" : "starwars_extended" ,

"_type" : "_doc" ,

"_id" : "3" ,

"_score" : 0.36165747 ,

"_source" : {

"quote" : "<b>No</b>. I am your father." } } ] ...

Slide 160

Slide 160

Edge Gram vs Trigram Extending a mapping Testing a custom mapping

Slide 161

Slide 161

POST /starwars_extended/_close PUT /starwars_extended/_settings {

"analysis" : {

"filter" : {

"my_edgegram_filter" : {

"type" : "edge_ngram" ,

"min_gram" : 3 ,

"max_gram" : 10 } },

"analyzer" : {

"my_edgegram_analyzer" : {

"char_filter" : [

"html_strip" ],

"tokenizer" : "standard" ,

"filter" : [

"lowercase" ,

"my_edgegram_filter" ] } } } } POST /starwars_extended/_open

Slide 162

Slide 162

GET starwars_extended/_analyze {

"text" : "Father" ,

"analyzer" : "my_edgegram_analyzer" }

Slide 163

Slide 163

{

"tokens" : [ {

"token" : "fat" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 }, {

"token" : "fath" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 }, {

"token" : "fathe" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 }, {

"token" : "father" ,

"start_offset" : 0 ,

"end_offset" : 6 ,

"type" : "<ALPHANUM>" ,

"position" : 0 } ] }

Slide 164

Slide 164

PUT /starwars_extended/_doc/_mapping {

"properties" : {

"quote" : {

"type" : "text" ,

"fields" : {

"edgegram" : {

"type" : "text" ,

"analyzer" : "my_edgegram_analyzer" ,

"search_analyzer" : "standard" } } } } }

Slide 165

Slide 165

PUT /starwars_extended/_doc/ 4 {

"quote" : "I find your lack of faith disturbing." } PUT /starwars_extended/_doc/ 5 {

"quote" : "That... is your failure." }

Slide 166

Slide 166

GET /starwars_extended/_doc/ 4 /_termvectors {

"fields" : [

"quote.edgegram" ],

"offsets" : true ,

"payloads" : true ,

"positions" : true ,

"term_statistics" : true ,

"field_statistics" : true }

Slide 167

Slide 167

{

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_version" : 1 ,

"found" : true ,

"took" : 3 ,

"term_vectors" : {

"quote.edgegram" : {

"field_statistics" : {

"sum_doc_freq" : 26 ,

"doc_count" : 2 ,

"sum_ttf" : 26 },

"terms" : {

"dis" : {

"doc_freq" : 1 ,

"ttf" : 1 ,

"term_freq" : 1 ,

"tokens" : [ {

"position" : 6 ,

"start_offset" : 26 ,

"end_offset" : 36 } ] },

"dist" : {

"doc_freq" : 1 ,

"ttf" : 1 , ...

Slide 168

Slide 168

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote" : "fail" } } }

Slide 169

Slide 169

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.lowercase" : "fail" } } }

Slide 170

Slide 170

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.full" : "fail" } } }

Slide 171

Slide 171

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.ngram" : "fail" } } }

Slide 172

Slide 172

... "hits" : {

"total" : 2 ,

"max_score" : 1.0135446 ,

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "4" ,

"_score" : 1.0135446 ,

"_source" : {

"quote" : "I find your lack of faith disturbing." } }, {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "5" ,

"_score" : 0.50476736 ,

"_source" : {

"quote" : "That... is your failure." } } ] ...

Slide 173

Slide 173

POST /starwars_extended/_search {

"query" : {

"match" : {

"quote.edgegram" : "fail" } } }

Slide 174

Slide 174

... "hits" : {

"total" : 1 ,

"max_score" : 0.39556286 ,

"hits" : [ {

"_index" : "starwars_v42" ,

"_type" : "_doc" ,

"_id" : "5" ,

"_score" : 0.39556286 ,

"_source" : {

"quote" : "That... is your failure." } } ] ...

Slide 175

Slide 175

The End