A presentation at CrunchConf in in Budapest, Hungary by Philipp Krenn
Search and Beyond Philipp Krenn @xeraa
Developer
This is not a Training https://training.elastic.co
Agenda 09:00 - 10:40: Intro & Architecture & Search 10:40 - 11:00: Coffee break 11:00 - 12:20: More Search 12:20 - 13:05: Lunch 13:05 - 15:00: Monitoring 15:00 - 15:20: Coffee break 15:20 - 17:00: More Monitoring & Q&A
Elastic Stack Architecture
$ curl http://localhost:9200 { "name" : "elasticsearch-hot", "cluster_name" : "metrics-cluster", "cluster_uuid" : "06nHPLLgTrmZEpYli6JW5w", "version" : { "number" : "6.5.0", "build_flavor" : "default", "build_type" : "tar", "build_hash" : "c53b7d3", "build_date" : "2018-11-08T21:28:50.577384Z", "build_snapshot" : false, "lucene_version" : "7.5.0", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }
https://db-engines.com/en/ ranking
Only accept features that scale. — https://github.com/elastic/engineering/blob/master/ development_constitution.md
Horizontal Scaling Shards Replication Writes & Reads
Exhibit A: A JSON Document { "name": "Elasticsearch", "author": "Shay Banon", "stable_version": "6.5.0", "preview_version": "7.0.0-alpha1" }
Exhibit B: A cURL Command $ curl -XPOST -i localhost:9200/databases/nosql -d ' { "name": "Elasticsearch", "author": "Shay Banon", "stable_version": "6.5.0", "preview_version": "7.0.0-alpha1" }'
Exhibit B: A cURL Command HTTP/1.1 201 Created Location: /databases/nosql/AVfD8XQaeuK3k1LGtT8content-type: application/json; charset=UTF-8 content-length: 162 { "_index":"databases", "_type":"nosql", "_id":"AVfD8XQaeuK3k1LGtT8-", "_version":1, "result":"created", "_shards": { "total":2, "successful":1, "failed":0 }, "created":true }
Exhibit C: A Console Command POST /databases/nosql { "name": "Elasticsearch", "author": "Shay Banon", "stable_version": "6.4.2", "preview_version": "7.0.0-alpha2" }
Exhibit C: A Console Command { "_index": "databases", "_type": "nosql", "_id": "AVfD6ukyeuK3k1LGtSwT", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true }
Single node $ cd elasticsearch-<version> $ ./bin/elasticsearch
Cluster No more broadcasting
Node Types
Master-eligible node Default by default
Data node Default by default
Client node Smart load balancer
Ingest node Parse and enrich
ML node Machine learning
Discovery
Zen discovery
Peer to peer network Unicast Who to contact Ping Discover each other
One master node Elected or joined to
Three dedicated master nodes for production
Healthcheck Master pings all nodes, they report back
discovery.zen.no_master_block write | all
discovery.zen.minimum_master_nodes
Split brain
GET /_cluster/health green yellow red
Alternative discovery Azure, EC2, GCE
Indexing a Document
Document Unique combination: _index _type _id PS: Types will be removed
POST /databases/nosql vs PUT /databases/nosql/ elasticsearch
Autogenerated ID 20 characters, URL-safe, Base64-encoded, GUID strings AVfD6ukyeuK3k1LGtSwT
Consistent hashing Before 2.0: djb2 Ignoring _routing
unsigned long hash(unsigned char *str) { unsigned long hash = 5381; int c; while (c = str++) hash = ((hash << 5) + hash) + c; / hash * 33 + c */ return hash; }
Consistent hashing Current default: murmur3 https://github.com/elastic/elasticsearch/blob/5.4/core/src/main/java/org/elasticsearch/ common/hash/MurmurHash3.java
Consistent hashing Better distribution 100,000 incremental IDs https://github.com/elastic/elasticsearch/pull/7954
Consistent hashing 3 shards murmur3 [33185, 33347, 33468] djb2 [30100, 30000, 39900]
Consistent hashing 5 shards murmur3 [19933, 19964, 19940, 20030, 20133] djb2 [20000, 20000, 20000, 20000, 20000]
Consistent hashing 33 shards murmur3 [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977] djb2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0]
Shard decision shard = hash(doc_id) % (num_of_primary_shards)
Write Coordinating Node, Hash, Primary, Replica(s)
Get & Aggregate Coordinating Node, Hash, Shard
Optimistic concurrency control _version
Flaky nodes index.unassigned.node_left.delayed_timeout: 1m
Sequency numbers Quick recovery in 6.0
Lucene Segment
Lucene at work
Segments are immutable
index.refresh_interval 7.0: index.search.idle.after
Tombstone file Marks deleted documents
Merge Combine (and clean up) segments
Visualize merges http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Searching
Search Coordinating Node, Query then Fetch
Benchmarks Fair Reproducible Close to Production
Full-Text Search
Who uses a Database?
Who uses Search?
Store
Apache Lucene Elasticsearch
Example These are <em>not</em> the droids you are looking for.
html_strip Char Filter These are not the droids you are looking for.
standard Tokenizer These are not the droids you looking for are
lowercase Token Filter these are not the droids looking for you are
stop Token Filter droids you looking
snowball Token Filter droid you look
Analyze
GET /_analyze { "analyzer": "english", "text": "These are not the droids you are looking for." }
{ } "tokens": [ { "token": "droid", "start_offset": 18, "end_offset": 24, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, ... ]
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }
{ } "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ]
Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L44-L50
Always Use Stop Words?
To be, or not to be.
Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai
Language Rules English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst
More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (Polish), Ukrainian Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,...
German GET /_analyze { "analyzer": "german", "text": "Das sind nicht die Droiden, nach denen du suchst." }
{ } "tokens": [ { "token": "droid", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 4 }, { "token": "den", "start_offset": 33, "end_offset": 38, "type": "<ALPHANUM>", "position": 6 }, { "token": "such", "start_offset": 42, "end_offset": 48, "type": "<ALPHANUM>", "position": 8 } ]
German with the English Analyzer da sind nicht die droiden denen du suchst nach
German Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ snowball/german_stop.txt
Detect Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect
PUT _ingest/pipeline/langdetect-pipeline { "description": "A pipeline to detect languages", "processors": [ { "langdetect" : { "field" : "quote", "target_field" : "language" } } ] }
POST _ingest/pipeline/langdetect-pipeline/_simulate { "docs": [ { "_source": { "quote": "Das sind nicht die Droiden, nach denen du suchst." } } ] }
{ } "docs": [ { "doc": { "_index": "_index", "_type": "_type", "_id": "_id", "_source": { "language": "de", "quote": "Das sind nicht die Droiden, nach denen du suchst." }, "_ingest": { "timestamp": "2018-10-26T00:06:42.320613Z" } } } ]
Phonetic GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "phonetic", "encoder": "beider_morse", "languageset": "any" } ], "text": "These are not the droids you are looking for." }
Phonetic ... drDts drits drots loknk... iou ari ori
Another Example Obi-Wan never told you what happened to your father.
Another Example obi wan never told you what happen your father
Another Example <b>No</b>. I am your father.
Another Example i am your father
Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]
To / The Index
PUT /starwars { "settings": { "number_of_shards": 1, "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] } },
}, } "analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } }
} "mappings": { "_doc": { "properties": { "quote": { "type": "text", "analyzer": "my_analyzer" } } } }
Synonyms Index synonym or query time synonym_graph
GET /starwars/_mapping GET /starwars/_settings
PUT /starwars/_doc/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/3 { "quote": "<b>No</b>. I am your father." }
GET /starwars/_doc/1 GET /starwars/_doc/1/_source
Multi Lingual Index PUT /starwars_en/_doc/1 Type Field { "quote_en": "...", "quote_de": "..." }
PS: Single Type per Index
Search
POST /starwars/_search { "query": { "match_all": { } } }
GET vs POST
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, ...
POST /starwars/_search { "query": { "match": { "quote": "droid" } } }
{ } "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }
POST /starwars/_search { "query": { "match": { "quote": "dad" } } }
... "hits": { "total": 2, "max_score": 0.41913947, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.41913947, "_source": { "quote": "<b>No</b>. I am your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.39291072, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }
POST /starwars/_doc/0/_explain { "query": { "match": { "quote": "dad" } } }
{ } "_index": "starwars", "_type": "_doc", "_id": "0", "matched": false
POST /starwars/_doc/1/_explain { "query": { "match": { "quote": "dad" } } }
{ } "_index": "starwars", "_type": "_doc", "_id": "1", "matched": false, "explanation": { "value": 0, "description": "no matching term", "details": [] }
POST /starwars/_doc/2/_explain { "query": { "match": { "quote": "dad" } } }
{ "_index": "starwars", "_type": "_doc", "_id": "2", "matched": true, "explanation": { ...
POST /starwars/_search { "query": { "match": { "quote": "machine" } } }
{ } "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }
POST /starwars/_search { "query": { "match_phrase": { "quote": "I am your father" } } }
{ } "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.5665855, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.5665855, "_source": { "quote": "<b>No</b>. I am your father." } } ] }
POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am father", "slop": 1 } } } }
{ } "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8327639, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8327639, "_source": { "quote": "<b>No</b>. I am your father." } } ] }
POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am not your father", "slop": 1 } } } }
{ } "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0409548, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0409548, "_source": { "quote": "<b>No</b>. I am your father." } } ] }
POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }
{ } "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }
POST /starwars/_search { "query": { "match": { "quote": { "query": "ovi-van", "fuzziness": 1 } } } }
{ } "took": 109, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.3798467, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.3798467, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }
FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton
http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata
SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"
Score
Term Frequency / Inverse Document Frequency (TF/IDF) Search one term
BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25
Term Frequency
Inverse Document Frequency
Field-Length Norm
POST /starwars/_search?explain=true { "query": { "match": { "quote": "father" } } }
... "_explanation": { "value": 0.41913947, "description": "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.41913947, "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, { "value": 1.4569536, "description": "tfNorm, computed from:", "details": [ { "value": 2, "description": "termFreq=2.0", "details": [] }, ...
Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you
Vector Space Model Search multiple terms
Search your father
Coordination Factor Reward multiple terms
Search for 3 terms 1 term: 2 terms: 3 terms:
Practical Scoring Function Putting it all together
score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)
Function Score Script, weight, random, field value, decay (geo or date)
POST /starwars/_search { "query": { "function_score": { "query": { "match": { "quote": "father" } }, "random_score": {} } } }
Compare Scores "100% perfect" vs a "50%" match
Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages
GET /starwars/_analyze { "analyzer" : "my_analyzer", "text": "These are my father's machines." }
{ "tokens": [ { "token": "my", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 }, { "token": "father", "start_offset": 13, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "dad", "start_offset": 13, "end_offset": 21, "type": "SYNONYM", "position": 3 }, { "token": "machin", "start_offset": 22, "end_offset": 30, "type": "<ALPHANUM>", "position": 4 } ] }
PUT /starwars/_doc/4 { "quote": "These are my father's machines." }
POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }
"hits": { "total": 4, "max_score": 2.92523, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 2.92523, "_source": { "quote": "These are my father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.8617505, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...
2.92523 == 100%
DELETE /starwars/_doc/4 POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }
"hits": { "total": 3, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...
1.2499592 == 43% or 100%?
PUT /starwars/_doc/4 { "quote": "These droids are my father's father's machines." } POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }
"hits": { "total": 4, "max_score": 3.0068164, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 3.0068164, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.89701396, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...
3.0068164 == 103%?
PS: Shards Default? Effect on IDF?
Distributed Frequency Search GET starwars/_search?search_type=dfs_query_then_fetch { ... }
Don’t use dfs_query_then_fetch in production. It really isn’t required. — https://www.elastic.co/guide/en/elasticsearch/ guide/current/relevance-is-broken.html
More Search
Highlighting
POST /starwars/_search { "query": { "match": { "quote": "father" } }, "highlight": { "type": "unified", "pre_tags": [ "<tag>" ], "post_tags": [ "</tag>" ], "fields": { "quote": {} } } }
... "hits": { "total": 3, "max_score": 0.631961, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 0.631961, "_source": { "quote": "These droids are my father's father's machines." }, "highlight": { "quote": [ "These droids are my <tag>father's</tag> <tag>father's</tag> machines." ] } }, ...
Boolean Queries must must_not should filter
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }
... "hits": { "total": 3, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...
POST /starwars/_search { "query": { "bool": { "filter": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }
... "hits": { "total": 3, "max_score": 1.6694657, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1.6694657, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8317767, "_source": { "quote": "<b>No</b>. I am your father." } },
Named Queries & minimum_should_match
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": { "query": "your", "_name": "quote-your" } } }, { "match": { "quote": { "query": "obi", "_name": "quote-obi" } } }, { "match": { "quote": { "query": "droid", "_name": "quote-droid" } } } ], "minimum_should_match": 2 } } }
... "hits": { "total": 1, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, "matched_queries": [ "quote-obi", "quote-your" ] } ] } }
Boosting >1 increase, <1 decrease, <0 punish <0 removed in 7.0
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": { "query": "obi", "boost": 3 } } } ] } } }
... "hits": { "total": 3, "max_score": 4.2368493, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 4.2368493, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...
Search for father but prefer father father
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father father" } } } } }
... "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.263922, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.1077905, "_source": { "quote": "<b>No</b>. I am your father." } },
POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": { "match_phrase": { "quote": "father father" } } } } }
... "hits": { "total": 3, "max_score": 9.146545, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 9.146545, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0454913, "_source": { "quote": "<b>No</b>. I am your father." } }, ...
Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0
POST /starwars/_search { "query": { "match": { "quote": "drui" } }, "suggest": { "my_suggestion" : { "text" : "drui", "term" : { "field" : "quote" } } } }
... "hits": { "total": 0, "max_score": null, "hits": [] }, "suggest": { "my_suggestion": [ { "text": "drui", "offset": 0, "length": 4, "options": [ { "text": "droid", "score": 0.5, "freq": 1 } ] } ] } }
Multiple Suggesters term phrase completion context
NGram Partial matches Edge Gram
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "hes", "start_offset": 1, "end_offset": 4, "type": "word", "position": 1 }, { "token": "ese", "start_offset": 2, "end_offset": 5, "type": "word", "position": 2 }, { "token": "are", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, ...
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "edge_ngram", "min_gram": "1", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }
{ "tokens": [ { "token": "t", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "th", "start_offset": 0, "end_offset": 2, "type": "word", "position": 1 }, { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 2 }, { "token": "a", "start_offset": 6, "end_offset": 7, "type": "word", "position": 3 }, { "token": "ar", "start_offset": 6, "end_offset": 8, "type": "word", "position": 4 }, ...
Combining Analyzers Reindex Store multiple times Tune BM25 Combine scores
BM25 Revisited
https://www.elastic.co/blog/practical-bm25-part-2-the-bm25algorithm-and-its-variables
b field length amplification k1 term frequency saturation Default 0.75 Default 1.2
PUT /starwars_v42 { "settings": { "number_of_shards": 1, "index": { "similarity": { "default": { "type": "BM25", "b": 0, "k1": 0 } } },
"analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] }, "my_ngram_filter": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] } },
"analyzer": { "my_lowercase_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "my_full_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] },
}, } } "my_ngram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase", "stop", "my_ngram_filter" ] }
"mappings": { "_doc": { "properties": { "quote": { "type": "text", "fields": { "lowercase": { "type": "text", "analyzer": "my_lowercase_analyzer" }, "full": { "type": "text", "analyzer": "my_full_analyzer" }, "ngram": { "type": "text", "analyzer": "my_ngram_analyzer" } } } } } } }
POST /_reindex { "source": { "index": "starwars" }, "dest": { "index": "starwars_v42" } }
Aliases Atomic remove and add Point to multiple indices (read-only)
PUT _alias { "actions": [ { "add": { "index": "starwars_v42", "alias": "starwars_extended" } } ] }
POST /starwars/_search { "query": { "match": { "quote": "droid" } } }
"hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.1533037, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.1295731, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]
POST /starwars_extended/_search { "query": { "match": { "quote.full": "droid" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 0.6931472, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 0.6931472, "_source": { "quote": "These droids are my father's father's machines." } } ]
There are no "best" b and k1 values
POST /starwars_extended/_search?explain=true { "query": { "multi_match": { "query": "obiwan", "fields": [ "quote", "quote.lowercase", "quote.full", "quote.ngram" ], "type": "most_fields" } } }
... "hits": { "total": 1, "max_score": 0.4912064, "hits": [ { "_shard": "[starwars_v42][2]", "_node": "BCDwzJ4WSw2dyoGLTzwlqw", "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 0.4912064, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, ...
Whitespace Tokenizer "weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:"
POST /starwars_extended/_search { "query": { "multi_match": { "query": "you", "fields": [ "quote", "quote.lowercase^5", "quote.full", "quote.ngram" ], "type": "best_fields" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 3.465736, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 3.465736, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "3", "_score": 0.35667494, "_source": { "quote": "<b>No</b>. I am your father." } } ]
Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase
Different Analyzers for Indexing and Searching Per query In the mapping
POST /starwars_extended/_search { "query": { "match": { "quote.ngram": { "query": "the", "analyzer": "standard" } } } }
... "hits": [ { "_index": "starwars_extended", "_type": "_doc", "_id": "2", "_score": 0.38254172, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_extended", "_type": "_doc", "_id": "3", "_score": 0.36165747, "_source": { "quote": "<b>No</b>. I am your father." } } ] ...
Edge Gram vs Trigram Test a setting before adding a field
Shingle Token Filter Shingles (token ngrams) from a token stream
POST /starwars_extended/_close PUT /starwars_extended/_settings { "index": { "similarity": { "default": { "type": "BM25", "b": null, "k1": null } } },
"analysis": { "filter": { "my_edgegram_filter": { "type": "edge_ngram", "min_gram": 3, "max_gram": 10 }, "my_shingle_filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 2 } },
"analyzer": { "my_edgegram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "my_edgegram_filter" ] },
} } } "my_shingle_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "my_shingle_filter" ] } POST /starwars_extended/_open
GET starwars_extended/_analyze { "text": "Father", "analyzer": "my_edgegram_analyzer" }
{ } "tokens": [ { "token": "fat", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fath", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fathe", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "father", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 } ]
PUT /starwars_extended/_doc/_mapping { "properties": { "quote": { "type": "text", "fields": { "edgegram": { "type": "text", "analyzer": "my_edgegram_analyzer", "search_analyzer": "standard" }, "shingle": { "type": "text", "analyzer": "my_shingle_analyzer" } } } } }
PUT /starwars_extended/_doc/5 { "quote": "I find your lack of faith disturbing." } PUT /starwars_extended/_doc/6 { "quote": "That... is your failure." }
GET /starwars_extended/_doc/5/_termvectors { "fields": [ "quote.edgegram" ], "offsets": true, "payloads": true, "positions": true, "term_statistics": true, "field_statistics": true }
{ "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_version": 1, "found": true, "took": 3, "term_vectors": { "quote.edgegram": { "field_statistics": { "sum_doc_freq": 26, "doc_count": 2, "sum_ttf": 26 }, "terms": { "dis": { "doc_freq": 1, "ttf": 1, "term_freq": 1, "tokens": [ { "position": 6, "start_offset": 26, "end_offset": 36 } ] }, "dist": { "doc_freq": 1, "ttf": 1, ...
POST /starwars_extended/_search { "query": { "match": { "quote": "fail" } } }
POST /starwars_extended/_search { "query": { "match": { "quote.lowercase": "fail" } } }
POST /starwars_extended/_search { "query": { "match": { "quote.full": "fail" } } }
POST /starwars_extended/_search { "query": { "match": { "quote.ngram": "fail" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.8400999, "_source": { "quote": "That... is your failure." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_score": 1.442779, "_source": { "quote": "I find your lack of faith disturbing." } } ]
POST /starwars_extended/_search { "query": { "match": { "quote.edgegram": "fail" } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.0114291, "_source": { "quote": "That... is your failure." } } ]
Updating Missing Fields Expensive
POST /starwars_extended/_update_by_query { "query": { "bool": { "must_not": { "exists": { "field": "quote.edgegram" } } } } }
Shingles: Context Should Matter
POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.lowercase": "these droids are" } } } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.1837702, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 2.137744, "_source": { "quote": "These droids are my father's father's machines." } } ]
POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.shingle": "these droids are" } } } } }
"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 3.1811738, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.6568544, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]
Decompounding Commonly in German, Scandinavian languages, Finnish, Korean
PUT /decompound_en { "settings": { "number_of_shards": 1, "analysis": { "filter": { "british_decompounder": { "type": "hyphenation_decompounder", "hyphenation_patterns_path": "hyph/en_GB.xml", "word_list": [ "death", "star" ] } }, "analyzer": { "british_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "british_decompounder" ] } } } } }
GET /decompound_en/_analyze { "analyzer" : "british_decompound", "text" : "deathstar" }
{ } "tokens": [ { "token": "deathstar", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "death", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "star", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 } ]
German Dictionaly (LGPL) https://github.com/uschindler/ german-decompounder
PUT /decompound_de { "settings": { "number_of_shards": 1, "analysis": { "filter": { "german_decompounder": { "type": "hyphenation_decompounder", "word_list_path": "dictionary-de.txt", "hyphenation_patterns_path": "hyph/de_DR.xml", "only_longest_match": true, "min_subword_size": 4 }, "german_stemmer": { "type": "stemmer", "language": "light_german" } }, "analyzer": { "german_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "german_decompounder", "german_normalization", "german_stemmer" ] } } } } }
GET /decompound_de/_analyze { "analyzer" : "german_decompound", "text" : "Todesstern" }
{ } "tokens": [ { "token": "todesst", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "tod", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "stern", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 } ]
Without Word Lists https://github.com/jprante/ elasticsearch-analysis-decompound
Performance
Conclusion
Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms
Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model
Advanced Queries Highlighting Suggestions NGrams, Edge Grams Multiple Analyzers
Advanced Queries Reindex & Alias Update by Query Shingles Decompound
Monitor Your Apps with the
Disclaimer I build highly monitored Hello World apps
Agenda Monitor Java (preconfigured) Some Security Monitor PHP (configure yourself)
Code https://github.com/xeraa/ microservice-monitoring
Cloud
Java Application
Simple No discovery, load-balancing,...
Monitor Java
Kibana Monitoring Overview of the Elastic Stack components
Metricbeat System [Metricbeat System] Overview and [Metricbeat System] Host overview dashboards See the memory spike every 5min
Time Series Visual Builder Sum of system.memory.actual.used.bytes Sum of system.process.memory. rss.bytes grouped by the term system.process.name and moved to the negative y-axis with a Math step
Packetbeat Call /, /good, /bad, and /foobar [Packetbeat] Overview, [Packetbeat] Flows, [Packetbeat] HTTP, and [Packetbeat] DNS Tunneling dashboards
Packetbeat Raw events in Discover Process enrichment for nginx, Java, and the APM server
Filebeat Modules [Filebeat Nginx] Access and error logs, [Filebeat System] Syslog dashboard, and [Osquery Result] Compliance pack dashboards
Filebeat Raw events in Discover /good: MDC logging under json.name and the context view for one log message meta.* and host.* information
Filebeat /bad and /null: Stacktraces by filtering down on application:java and json.severity:ERROR Visualize json.stack_hash
Heartbeat Heartbeat HTTP monitoring dashboard Stop and start the frontend application while auto refreshing
Metricbeat nginx [Metricbeat Nginx] Overview dashboard
Metricbeat HTTP /health and /metrics endpoints Collected information in Discover
Metricbeat JMX Same data Visualize the heap usage: jolokia. metrics.memory.heap_usage.used divided by the max of jolokia. metrics.memory.heap_usage.max
Annotations Add changes from the events index
Some Security
Filebeat Modules [Filebeat Auditd] Audit Events, [Filebeat System] New users and groups, and [Filebeat System] Sudo commands dashboards
https://github.com/linux-audit "auditd is the userspace component to the Linux Auditing System. It's responsible for writing audit records to the disk. Viewing the logs is done with the ausearch or aureport utilities."
Auditd Monitors File and network access System calls Commands run by a user Security events
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/chap-system_auditing
Understanding Logs https://access.redhat.com/ documentation/en-us/ red_hat_enterprise_linux/7/html/ security_guide/secunderstanding_audit_log_files
Auditbeat [Auditbeat Auditd] Overview dashboard
Fail SSH ssh elastic-user@xeraa.wtf with a bad password [Filebeat System] SSH login attempts dashboard
Success ssh elastic-user@xeraa.wtf with a good password Run service nginx restart and pick the elastic-admin user
Audit Event [Auditbeat Auditd] Executions dashboard filter elastic-user
Audit Event cat /etc/passwd Filter for tags is developers-passwdread in Discover
Power Abuse ssh elastic-admin@xeraa.wtf sudo cat /home/elastic-user/secret.txt Tag power-abuse in Discover
File Integrity Change something in /var/www/html/index.html [Auditbeat File Integrity] Overview dashboard
Monitor PHP
Heartbeat Add HTTP port 88
Packetbeat Add HTTP on port 88
Metricbeat php-fpm - module: php_fpm metricsets: ["pool"] period: 10s status_path: "/status" hosts: ["http://localhost"]
Filebeat Collect /var/www/html/silverstripe/ logs/*.json
More
a Alerting a Gold License and part of the Elastic Cloud
b Machine Learning Anomaly Detection of Time Series Data b Platinum License and part of the Elastic Cloud
Conclusion
System metrics & network Filebeat modules & Auditbeat Application logs
Uptime Application metrics Request tracing
Code https://github.com/xeraa/ microservice-monitoring
Questions? Philipp Krenn PS: Sticker @xeraa
Elasticsearch is the most widely used full-text search engine, but is also very common for logging, metrics, and analytics. This workshops shows you what the rage is all about:
And we will do all of that live, since it is so easy and much more interactive that way.
The following code examples from the presentation can be tried out live.
The following resources were mentioned during the presentation or are useful additional information.