How Does Full-text Search Work Under the Hood

A presentation at BigData Vilnius in November 2018 in Antakalnio g. 17, Vilnius 10312, Lithuania by Philipp Krenn

Slide 1

Slide 1

How Does Full-Text Search Work under the Hood Philipp Krenn @xeraa

Slide 2

Slide 2

Who uses a Database?

Slide 3

Slide 3

Who uses Search?

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

Slide 7

Slide 7

Developer

Slide 8

Slide 8

Store

Slide 9

Slide 9

Apache Lucene Elasticsearch

Slide 10

Slide 10

Slide 11

Slide 11

Example These are <em>not</em> the droids you are looking for.

Slide 12

Slide 12

html_strip Char Filter These are not the droids you are looking for.

Slide 13

Slide 13

standard Tokenizer These are not the droids you looking for are

Slide 14

Slide 14

lowercase Token Filter these are not the droids looking for you are

Slide 15

Slide 15

stop Token Filter droids you looking

Slide 16

Slide 16

snowball Token Filter droid you look

Slide 17

Slide 17

Setup

Slide 18

Slide 18

https://cloud.elastic.co

Slide 19

Slide 19

Slide 20

Slide 20

Slide 21

Slide 21

Docker Compose --version: '2' services: kibana: image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION links: - elasticsearch ports: - 5601:5601 elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION volumes: - esdata1:/usr/share/elasticsearch/data ports: - 9200:9200 volumes: esdata1: driver: local

Slide 22

Slide 22

Analyze

Slide 23

Slide 23

GET /_analyze { "analyzer": "english", "text": "These are not the droids you are looking for." }

Slide 24

Slide 24

{ } "tokens": [ { "token": "droid", "start_offset": 18, "end_offset": 24, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, ... ]

Slide 25

Slide 25

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }

Slide 26

Slide 26

{ } "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ]

Slide 27

Slide 27

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L44-L50

Slide 28

Slide 28

Always Use Stop Words?

Slide 29

Slide 29

To be, or not to be.

Slide 30

Slide 30

Lithuanian Tai ne tie droidai, kurių ieškote.

Slide 31

Slide 31

Lithuanian ne tie droid kur iešk

Slide 32

Slide 32

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Slide 33

Slide 33

Language Rules English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst

Slide 34

Slide 34

More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (Polish), Ukrainian Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,...

Slide 35

Slide 35

Lithuanian with the English Analyzer tai ne tie droidai kurių ieškot

Slide 36

Slide 36

Lithuanian Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ lt/stopwords.txt

Slide 37

Slide 37

Detect Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect

Slide 38

Slide 38

PUT _ingest/pipeline/langdetect-pipeline { "description": "A pipeline to detect languages", "processors": [ { "langdetect" : { "field" : "quote", "target_field" : "language" } } ] }

Slide 39

Slide 39

POST _ingest/pipeline/langdetect-pipeline/_simulate { "docs": [ { "_source": { "quote": "Tai ne tie droidai, kurių ieškote." } } ] }

Slide 40

Slide 40

{ } "docs": [ { "doc": { "_index": "_index", "_type": "_type", "_id": "_id", "_source": { "language": "lt", "quote": "Tai ne tie droidai, kurių ieškote." }, "_ingest": { "timestamp": "2018-10-26T00:06:42.320613Z" } } } ]

Slide 41

Slide 41

Phonetic GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "phonetic", "encoder": "beider_morse", "languageset": "any" } ], "text": "These are not the droids you are looking for." }

Slide 42

Slide 42

Phonetic ... drDts drits drots loknk... iou ari ori

Slide 43

Slide 43

Another Example Obi-Wan never told you what happened to your father.

Slide 44

Slide 44

Another Example obi wan never told you what happen your father

Slide 45

Slide 45

Another Example <b>No</b>. I am your father.

Slide 46

Slide 46

Another Example i am your father

Slide 47

Slide 47

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

Slide 48

Slide 48

To / The Index

Slide 49

Slide 49

PUT /starwars { "settings": { "number_of_shards": 1, "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] } },

Slide 50

Slide 50

}, } "analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } }

Slide 51

Slide 51

} "mappings": { "_doc": { "properties": { "quote": { "type": "text", "analyzer": "my_analyzer" } } } }

Slide 52

Slide 52

Synonyms Index synonym or query time synonym_graph

Slide 53

Slide 53

GET /starwars/_mapping GET /starwars/_settings

Slide 54

Slide 54

PUT /starwars/_doc/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/3 { "quote": "<b>No</b>. I am your father." }

Slide 55

Slide 55

GET /starwars/_doc/1 GET /starwars/_doc/1/_source

Slide 56

Slide 56

Search

Slide 57

Slide 57

POST /starwars/_search { "query": { "match_all": { } } }

Slide 58

Slide 58

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, ...

Slide 59

Slide 59

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

Slide 60

Slide 60

{ } "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

Slide 61

Slide 61

POST /starwars/_search { "query": { "match": { "quote": "dad" } } }

Slide 62

Slide 62

... "hits": { "total": 2, "max_score": 0.41913947, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.41913947, "_source": { "quote": "<b>No</b>. I am your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.39291072, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }

Slide 63

Slide 63

POST /starwars/_doc/0/_explain { "query": { "match": { "quote": "dad" } } }

Slide 64

Slide 64

{ } "_index": "starwars", "_type": "_doc", "_id": "0", "matched": false

Slide 65

Slide 65

POST /starwars/_doc/1/_explain { "query": { "match": { "quote": "dad" } } }

Slide 66

Slide 66

{ } "_index": "starwars", "_type": "_doc", "_id": "1", "matched": false, "explanation": { "value": 0, "description": "no matching term", "details": [] }

Slide 67

Slide 67

POST /starwars/_doc/2/_explain { "query": { "match": { "quote": "dad" } } }

Slide 68

Slide 68

{ "_index": "starwars", "_type": "_doc", "_id": "2", "matched": true, "explanation": { ...

Slide 69

Slide 69

POST /starwars/_search { "query": { "match": { "quote": "machine" } } }

Slide 70

Slide 70

{ } "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

Slide 71

Slide 71

POST /starwars/_search { "query": { "match_phrase": { "quote": "I am your father" } } }

Slide 72

Slide 72

{ } "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.5665855, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.5665855, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 73

Slide 73

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am father", "slop": 1 } } } }

Slide 74

Slide 74

{ } "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8327639, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8327639, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 75

Slide 75

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am not your father", "slop": 1 } } } }

Slide 76

Slide 76

{ } "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0409548, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0409548, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

Slide 77

Slide 77

POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }

Slide 78

Slide 78

{ } "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

Slide 79

Slide 79

POST /starwars/_search { "query": { "match": { "quote": { "query": "ovi-van", "fuzziness": 1 } } } }

Slide 80

Slide 80

{ } "took": 109, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.3798467, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.3798467, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

Slide 81

Slide 81

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

Slide 82

Slide 82

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

Slide 83

Slide 83

SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"

Slide 84

Slide 84

Score

Slide 85

Slide 85

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

Slide 86

Slide 86

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Slide 87

Slide 87

Term Frequency

Slide 88

Slide 88

Slide 89

Slide 89

Inverse Document Frequency

Slide 90

Slide 90

Slide 91

Slide 91

Field-Length Norm

Slide 92

Slide 92

POST /starwars/_search?explain=true { "query": { "match": { "quote": "father" } } }

Slide 93

Slide 93

... "_explanation": { "value": 0.41913947, "description": "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.41913947, "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, { "value": 1.4569536, "description": "tfNorm, computed from:", "details": [ { "value": 2, "description": "termFreq=2.0", "details": [] }, ...

Slide 94

Slide 94

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Slide 95

Slide 95

Vector Space Model Search multiple terms

Slide 96

Slide 96

Search your father

Slide 97

Slide 97

Slide 98

Slide 98

Coordination Factor Reward multiple terms

Slide 99

Slide 99

Search for 3 terms 1 term: 2 terms: 3 terms:

Slide 100

Slide 100

Practical Scoring Function Putting it all together

Slide 101

Slide 101

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Slide 102

Slide 102

Function Score Script, weight, random, field value, decay (geo or date)

Slide 103

Slide 103

POST /starwars/_search { "query": { "function_score": { "query": { "match": { "quote": "father" } }, "random_score": {} } } }

Slide 104

Slide 104

Compare Scores "100% perfect" vs a "50%" match

Slide 105

Slide 105

Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

Slide 106

Slide 106

GET /starwars/_analyze { "analyzer" : "my_analyzer", "text": "These are my father's machines." }

Slide 107

Slide 107

{ "tokens": [ { "token": "my", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 }, { "token": "father", "start_offset": 13, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "dad", "start_offset": 13, "end_offset": 21, "type": "SYNONYM", "position": 3 }, { "token": "machin", "start_offset": 22, "end_offset": 30, "type": "<ALPHANUM>", "position": 4 } ] }

Slide 108

Slide 108

PUT /starwars/_doc/4 { "quote": "These are my father's machines." }

Slide 109

Slide 109

POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 110

Slide 110

"hits": { "total": 4, "max_score": 2.92523, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 2.92523, "_source": { "quote": "These are my father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.8617505, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 111

Slide 111

2.92523 == 100%

Slide 112

Slide 112

DELETE /starwars/_doc/4 POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 113

Slide 113

"hits": { "total": 3, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 114

Slide 114

1.2499592 == 43% or 100%?

Slide 115

Slide 115

PUT /starwars/_doc/4 { "quote": "These droids are my father's father's machines." } POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

Slide 116

Slide 116

"hits": { "total": 4, "max_score": 3.0068164, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 3.0068164, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.89701396, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

Slide 117

Slide 117

3.0068164 == 103%?

Slide 118

Slide 118

Slide 119

Slide 119

Performance

Slide 120

Slide 120

Slide 121

Slide 121

Slide 122

Slide 122

Conclusion

Slide 123

Slide 123

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Slide 124

Slide 124

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Slide 125

Slide 125

https://www.elastic.co/blog/elastic-app-search-beta-released

Slide 126

Slide 126

Thank You Questions & Stickers Philipp Krenn @xeraa