Full-Text Search Internals

A presentation at SAP Inside Track Vienna in November 2018 in Vienna, Austria by Philipp Krenn

Slide 1

Slide 1

Full-Text Search Internals Philipp Krenn @xeraa

Slide 2

Slide 2

Who uses a Database?

Slide 3

Slide 3

Who uses Search?

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

Slide 7

Slide 7

Developer

Slide 8

Slide 8

Store

Slide 9

Slide 9

Apache Lucene Elasticsearch

Slide 10

Slide 10

Slide 11

Slide 11

Example These are <em>not</em> the droids you are looking for.

Slide 12

Slide 12

html_strip Char Filter These are not the droids you are looking for.

Slide 13

Slide 13

standard Tokenizer These are not the droids you looking for are

Slide 14

Slide 14

lowercase Token Filter these are not the droids looking for you are

Slide 15

Slide 15

stop Token Filter droids you looking

Slide 16

Slide 16

snowball Token Filter droid you look

Slide 17

Slide 17

Analyze

Slide 18

Slide 18

GET /_analyze { "analyzer": "english", "text": "These are not the droids you are looking for." }

Slide 19

Slide 19

{ } "tokens": [ { "token": "droid", "start_offset": 18, "end_offset": 24, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, ... ]

Slide 20

Slide 20

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }

Slide 21

Slide 21

{ } "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ]

Slide 22

Slide 22

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L44-L50

Slide 23

Slide 23

Always Use Stop Words?

Slide 24

Slide 24

To be, or not to be.

Slide 25

Slide 25

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 26

Slide 26

Language Rules English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst

Slide 27

Slide 27

German GET /_analyze { "analyzer": "german", "text": "Das sind nicht die Droiden, nach denen du suchst." }

Slide 28

Slide 28

{ } "tokens": [ { "token": "droid", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 4 }, { "token": "den", "start_offset": 33, "end_offset": 38, "type": "<ALPHANUM>", "position": 6 }, { "token": "such", "start_offset": 42, "end_offset": 48, "type": "<ALPHANUM>", "position": 8 } ]

Slide 29

Slide 29

German with the English Analyzer da sind nicht die droiden denen du suchst nach

Slide 30

Slide 30

German Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ snowball/german_stop.txt

Slide 31

Slide 31

Another Example Obi-Wan never told you what happened to your father.

Slide 32

Slide 32

Another Example obi wan never told you what happen your father

Slide 33

Slide 33

Another Example <b>No</b>. I am your father.

Slide 34

Slide 34

Another Example i am your father

Slide 35

Slide 35

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

Slide 36

Slide 36

To / The Index

Slide 37

Slide 37

PUT /starwars { "settings": { "number_of_shards": 1, "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] } },

Slide 38

Slide 38

}, } "analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } }

Slide 39

Slide 39

} "mappings": { "_doc": { "properties": { "quote": { "type": "text", "analyzer": "my_analyzer" } } } }

Slide 40

Slide 40

Synonyms Index synonym or query time synonym_graph

Slide 41

Slide 41

GET /starwars/_mapping GET /starwars/_settings

Slide 42

Slide 42

PUT /starwars/_doc/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/3 { "quote": "<b>No</b>. I am your father." }

Slide 43

Slide 43

GET /starwars/_doc/1 GET /starwars/_doc/1/_source

Slide 44

Slide 44

Search

Slide 45

Slide 45

POST /starwars/_search { "query": { "match_all": { } } }

Slide 46

Slide 46

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, ...

Slide 47

Slide 47

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

Slide 48

Slide 48

{ } "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

Slide 49

Slide 49

POST /starwars/_search { "query": { "match": { "quote": "dad" } } }

Slide 50

Slide 50

... "hits": { "total": 2, "max_score": 0.41913947, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.41913947, "_source": { "quote": "<b>No</b>. I am your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.39291072, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }

Slide 51

Slide 51

POST /starwars/_doc/0/_explain { "query": { "match": { "quote": "dad" } } }

Slide 52

Slide 52

{ } "_index": "starwars", "_type": "_doc", "_id": "0", "matched": false

Slide 53

Slide 53

POST /starwars/_doc/1/_explain { "query": { "match": { "quote": "dad" } } }

Slide 54

Slide 54

{ } "_index": "starwars", "_type": "_doc", "_id": "1", "matched": false, "explanation": { "value": 0, "description": "no matching term", "details": [] }

Slide 55

Slide 55

POST /starwars/_doc/2/_explain { "query": { "match": { "quote": "dad" } } }

Slide 56

Slide 56

{ "_index": "starwars", "_type": "_doc", "_id": "2", "matched": true, "explanation": { ...

Slide 57

Slide 57

POST /starwars/_search { "query": { "match": { "quote": "machine" } } }

Slide 58

Slide 58

{ } "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

Slide 59

Slide 59

POST /starwars/_search { "query": { "match_phrase": { "quote": "I am your father" } } }

Slide 60

Slide 60

{ } "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.5665855, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.5665855, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 61

Slide 61

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am father", "slop": 1 } } } }

Slide 62

Slide 62

{ } "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8327639, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8327639, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 63

Slide 63

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am not your father", "slop": 1 } } } }

Slide 64

Slide 64

{ } "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0409548, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0409548, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 65

Slide 65

POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }

Slide 66

Slide 66

{ } "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

Slide 67

Slide 67

POST /starwars/_search { "query": { "match": { "quote": { "query": "ovi-van", "fuzziness": 1 } } } }

Slide 68

Slide 68

{ } "took": 109, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.3798467, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.3798467, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

Slide 69

Slide 69

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 70

Slide 70

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 71

Slide 71

SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"

Slide 72

Slide 72

Score

Slide 73

Slide 73

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 74

Slide 74

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Slide 75

Slide 75

Term Frequency

Slide 76

Slide 76

Slide 77

Slide 77

Inverse Document Frequency

Slide 78

Slide 78

Slide 79

Slide 79

Field-Length Norm

Slide 80

Slide 80

POST /starwars/_search?explain=true { "query": { "match": { "quote": "father" } } }

Slide 81

Slide 81

... "_explanation": { "value": 0.41913947, "description": "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.41913947, "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, { "value": 1.4569536, "description": "tfNorm, computed from:", "details": [ { "value": 2, "description": "termFreq=2.0", "details": [] }, ...

Slide 82

Slide 82

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Slide 83

Slide 83

Vector Space Model Search multiple terms

Slide 84

Slide 84

Search your father

Slide 85

Slide 85

Slide 86

Slide 86

Coordination Factor Reward multiple terms

Slide 87

Slide 87

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 88

Slide 88

Practical Scoring Function Putting it all together

Slide 89

Slide 89

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Slide 90

Slide 90

Function Score Script, weight, random, field value, decay (geo or date)

Slide 91

Slide 91

POST /starwars/_search { "query": { "function_score": { "query": { "match": { "quote": "father" } }, "random_score": {} } } }

Slide 92

Slide 92

Compare Scores "100% perfect" vs a "50%" match

Slide 93

Slide 93

Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 94

Slide 94

GET /starwars/_analyze { "analyzer" : "my_analyzer", "text": "These are my father's machines." }

Slide 95

Slide 95

{ "tokens": [ { "token": "my", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 }, { "token": "father", "start_offset": 13, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "dad", "start_offset": 13, "end_offset": 21, "type": "SYNONYM", "position": 3 }, { "token": "machin", "start_offset": 22, "end_offset": 30, "type": "<ALPHANUM>", "position": 4 } ] }

Slide 96

Slide 96

PUT /starwars/_doc/4 { "quote": "These are my father's machines." }

Slide 97

Slide 97

POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 98

Slide 98

"hits": { "total": 4, "max_score": 2.92523, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 2.92523, "_source": { "quote": "These are my father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.8617505, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 99

Slide 99

2.92523 == 100%

Slide 100

Slide 100

DELETE /starwars/_doc/4 POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 101

Slide 101

"hits": { "total": 3, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 102

Slide 102

1.2499592 == 43% or 100%?

Slide 103

Slide 103

PUT /starwars/_doc/4 { "quote": "These droids are my father's father's machines." } POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 104

Slide 104

"hits": { "total": 4, "max_score": 3.0068164, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 3.0068164, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.89701396, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 105

Slide 105

3.0068164 == 103%?

Slide 106

Slide 106

Slide 107

Slide 107

Performance

Slide 108

Slide 108

Slide 109

Slide 109

Slide 110

Slide 110

Conclusion

Slide 111

Slide 111

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 112

Slide 112

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 113

Slide 113

There is more...