Search and Beyond Philipp Krenn @xeraa

Developer

This is not a Training https://training.elastic.co

Agenda 09:00 - 10:40: Intro & Architecture & Search 10:40 - 11:00: Coffee break 11:00 - 12:20: More Search 12:20 - 13:05: Lunch 13:05 - 15:00: Monitoring 15:00 - 15:20: Coffee break 15:20 - 17:00: More Monitoring & Q&A

Elastic Stack Architecture

$ curl http://localhost:9200 { "name" : "elasticsearch-hot", "cluster_name" : "metrics-cluster", "cluster_uuid" : "06nHPLLgTrmZEpYli6JW5w", "version" : { "number" : "6.5.0", "build_flavor" : "default", "build_type" : "tar", "build_hash" : "c53b7d3", "build_date" : "2018-11-08T21:28:50.577384Z", "build_snapshot" : false, "lucene_version" : "7.5.0", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }

https://db-engines.com/en/ ranking

Only accept features that scale. — https://github.com/elastic/engineering/blob/master/ development_constitution.md

Horizontal Scaling Shards Replication Writes & Reads

Exhibit A: A JSON Document { "name": "Elasticsearch", "author": "Shay Banon", "stable_version": "6.5.0", "preview_version": "7.0.0-alpha1" }

Exhibit B: A cURL Command $ curl -XPOST -i localhost:9200/databases/nosql -d ' { "name": "Elasticsearch", "author": "Shay Banon", "stable_version": "6.5.0", "preview_version": "7.0.0-alpha1" }'

Exhibit B: A cURL Command HTTP/1.1 201 Created Location: /databases/nosql/AVfD8XQaeuK3k1LGtT8content-type: application/json; charset=UTF-8 content-length: 162 { "_index":"databases", "_type":"nosql", "_id":"AVfD8XQaeuK3k1LGtT8-", "_version":1, "result":"created", "_shards": { "total":2, "successful":1, "failed":0 }, "created":true }

Exhibit C: A Console Command POST /databases/nosql { "name": "Elasticsearch", "author": "Shay Banon", "stable_version": "6.4.2", "preview_version": "7.0.0-alpha2" }

Exhibit C: A Console Command { "_index": "databases", "_type": "nosql", "_id": "AVfD6ukyeuK3k1LGtSwT", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true }

Single node $ cd elasticsearch-<version> $ ./bin/elasticsearch

Cluster No more broadcasting

Node Types

Master-eligible node Default by default

Data node Default by default

Client node Smart load balancer

Ingest node Parse and enrich

ML node Machine learning

Discovery

Zen discovery

Peer to peer network Unicast Who to contact Ping Discover each other

One master node Elected or joined to

Three dedicated master nodes for production

Healthcheck Master pings all nodes, they report back

discovery.zen.no_master_block write | all

discovery.zen.minimum_master_nodes

Split brain

GET /_cluster/health green yellow red

Alternative discovery Azure, EC2, GCE

Indexing a Document

Document Unique combination: _index _type _id PS: Types will be removed

POST /databases/nosql vs PUT /databases/nosql/ elasticsearch

Autogenerated ID 20 characters, URL-safe, Base64-encoded, GUID strings AVfD6ukyeuK3k1LGtSwT

Consistent hashing Before 2.0: djb2 Ignoring _routing

unsigned long hash(unsigned char *str) { unsigned long hash = 5381; int c; while (c = str++) hash = ((hash << 5) + hash) + c; / hash * 33 + c */ return hash; }

Consistent hashing Current default: murmur3 https://github.com/elastic/elasticsearch/blob/5.4/core/src/main/java/org/elasticsearch/ common/hash/MurmurHash3.java

Consistent hashing Better distribution 100,000 incremental IDs https://github.com/elastic/elasticsearch/pull/7954

Consistent hashing 3 shards murmur3 [33185, 33347, 33468] djb2 [30100, 30000, 39900]

Consistent hashing 5 shards murmur3 [19933, 19964, 19940, 20030, 20133] djb2 [20000, 20000, 20000, 20000, 20000]

Consistent hashing 33 shards murmur3 [2999, 3096, 2930, 2986, 3070, 3093, 3023, 3052, 3112, 2940, 3036, 2985, 3031, 3048, 3127, 2961, 2901, 3105, 3041, 3130, 3013, 3035, 3031, 3019, 3008, 3022, 3111, 3086, 3016, 2996, 3075, 2945, 2977] djb2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 900, 900, 900, 900, 1000, 1000, 10000, 10000, 10000, 10000, 9100, 9100, 9100, 9100, 9000, 9000, 0, 0, 0, 0, 0, 0]

Shard decision shard = hash(doc_id) % (num_of_primary_shards)

Write Coordinating Node, Hash, Primary, Replica(s)

Get & Aggregate Coordinating Node, Hash, Shard

Optimistic concurrency control _version

Flaky nodes index.unassigned.node_left.delayed_timeout: 1m

Sequency numbers Quick recovery in 6.0

Lucene Segment

Lucene at work

Segments are immutable

index.refresh_interval 7.0: index.search.idle.after

Tombstone file Marks deleted documents

Merge Combine (and clean up) segments

Visualize merges http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Searching

Search Coordinating Node, Query then Fetch

Benchmarks Fair Reproducible Close to Production

Full-Text Search

Who uses a Database?

Who uses Search?

Store

Apache Lucene Elasticsearch

Example These are <em>not</em> the droids you are looking for.

html_strip Char Filter These are not the droids you are looking for.

standard Tokenizer These are not the droids you looking for are

lowercase Token Filter these are not the droids looking for you are

stop Token Filter droids you looking

snowball Token Filter droid you look

Analyze

GET /_analyze { "analyzer": "english", "text": "These are not the droids you are looking for." }

{ } "tokens": [ { "token": "droid", "start_offset": 18, "end_offset": 24, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 25, "end_offset": 28, "type": "<ALPHANUM>", "position": 5 }, ... ]

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }

{ } "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ]

Stop Words a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/java/org/apache/lucene/analysis/en/ EnglishAnalyzer.java#L44-L50

Always Use Stop Words?

To be, or not to be.

Languages Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai

Language Rules English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst

More Language Plugins Core: ICU (Asian languages), Kuromoji (advanced Japanese), Phonetic, SmartCN, Stempel (Polish), Ukrainian Community: Hebrew, Vietnamese, Network Address Analysis, String2Integer,...

German GET /_analyze { "analyzer": "german", "text": "Das sind nicht die Droiden, nach denen du suchst." }

{ } "tokens": [ { "token": "droid", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 4 }, { "token": "den", "start_offset": 33, "end_offset": 38, "type": "<ALPHANUM>", "position": 6 }, { "token": "such", "start_offset": 42, "end_offset": 48, "type": "<ALPHANUM>", "position": 8 } ]

German with the English Analyzer da sind nicht die droiden denen du suchst nach

German Stop Words https://github.com/apache/lucene-solr/blob/master/lucene/ analysis/common/src/resources/org/apache/lucene/analysis/ snowball/german_stop.txt

Detect Languages https://github.com/spinscale/ elasticsearch-ingest-langdetect

PUT _ingest/pipeline/langdetect-pipeline { "description": "A pipeline to detect languages", "processors": [ { "langdetect" : { "field" : "quote", "target_field" : "language" } } ] }

POST _ingest/pipeline/langdetect-pipeline/_simulate { "docs": [ { "_source": { "quote": "Das sind nicht die Droiden, nach denen du suchst." } } ] }

{ } "docs": [ { "doc": { "_index": "_index", "_type": "_type", "_id": "_id", "_source": { "language": "de", "quote": "Das sind nicht die Droiden, nach denen du suchst." }, "_ingest": { "timestamp": "2018-10-26T00:06:42.320613Z" } } } ]

Phonetic GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "phonetic", "encoder": "beider_morse", "languageset": "any" } ], "text": "These are not the droids you are looking for." }

Phonetic ... drDts drits drots loknk... iou ari ori

Another Example Obi-Wan never told you what happened to your father.

Another Example obi wan never told you what happen your father

Another Example <b>No</b>. I am your father.

Another Example i am your father

Inverted Index am droid father happen i look never obi told wan what you your ID 1 0 1[4] 0 0 0 1[7] 0 0 0 0 0 1[5] 0 ID 2 0 0 1[9] 1[6] 0 0 1[2] 1[0] 1[3] 1[1] 1[5] 1[4] 1[8] ID 3 1[2] 0 1[4] 0 1[1] 0 0 0 0 0 0 0 1[3]

To / The Index

PUT /starwars { "settings": { "number_of_shards": 1, "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] } },

}, } "analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } }

} "mappings": { "_doc": { "properties": { "quote": { "type": "text", "analyzer": "my_analyzer" } } } }

Synonyms Index synonym or query time synonym_graph

GET /starwars/_mapping GET /starwars/_settings

PUT /starwars/_doc/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/_doc/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/_doc/3 { "quote": "<b>No</b>. I am your father." }

GET /starwars/_doc/1 GET /starwars/_doc/1/_source

Multi Lingual Index PUT /starwars_en/_doc/1 Type Field { "quote_en": "...", "quote_de": "..." }

PS: Single Type per Index

Search

POST /starwars/_search { "query": { "match_all": { } } }

GET vs POST

{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, ...

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

{ } "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

POST /starwars/_search { "query": { "match": { "quote": "dad" } } }

... "hits": { "total": 2, "max_score": 0.41913947, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.41913947, "_source": { "quote": "<b>No</b>. I am your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.39291072, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }

POST /starwars/_doc/0/_explain { "query": { "match": { "quote": "dad" } } }

{ } "_index": "starwars", "_type": "_doc", "_id": "0", "matched": false

POST /starwars/_doc/1/_explain { "query": { "match": { "quote": "dad" } } }

{ } "_index": "starwars", "_type": "_doc", "_id": "1", "matched": false, "explanation": { "value": 0, "description": "no matching term", "details": [] }

POST /starwars/_doc/2/_explain { "query": { "match": { "quote": "dad" } } }

{ "_index": "starwars", "_type": "_doc", "_id": "2", "matched": true, "explanation": { ...

POST /starwars/_search { "query": { "match": { "quote": "machine" } } }

{ } "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] }

POST /starwars/_search { "query": { "match_phrase": { "quote": "I am your father" } } }

{ } "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.5665855, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.5665855, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am father", "slop": 1 } } } }

{ } "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.8327639, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8327639, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

POST /starwars/_search { "query": { "match_phrase": { "quote": { "query": "I am not your father", "slop": 1 } } } }

{ } "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1.0409548, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0409548, "_source": { "quote": "<b>No</b>. I am your father." } } ] }

POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }

{ } "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

POST /starwars/_search { "query": { "match": { "quote": { "query": "ovi-van", "fuzziness": 1 } } } }

{ } "took": 109, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.3798467, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 0.3798467, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] }

FuzzyQuery History http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html Before: Brute force Now: Levenshtein Automaton

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"

Score

Term Frequency / Inverse Document Frequency (TF/IDF) Search one term

BM25 Default in Elasticsearch 5.0 https://speakerdeck.com/elastic/improved-text-scoring-withbm25

Term Frequency

Inverse Document Frequency

Field-Length Norm

POST /starwars/_search?explain=true { "query": { "match": { "quote": "father" } } }

... "_explanation": { "value": 0.41913947, "description": "weight(Synonym(quote:dad quote:father) in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.41913947, "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, { "value": 1.4569536, "description": "tfNorm, computed from:", "details": [ { "value": 2, "description": "termFreq=2.0", "details": [] }, ...

Score 0.41913947: i am your father 0.39291072: obi wan never told what happen your father you

Vector Space Model Search multiple terms

Search your father

Coordination Factor Reward multiple terms

Search for 3 terms 1 term: 2 terms: 3 terms:

Practical Scoring Function Putting it all together

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Function Score Script, weight, random, field value, decay (geo or date)

POST /starwars/_search { "query": { "function_score": { "query": { "match": { "quote": "father" } }, "random_score": {} } } }

Compare Scores "100% perfect" vs a "50%" match

Don't do this. Seriously. Stop trying to think about your problem this way, it's not going to end well. — https://wiki.apache.org/lucene-java/ ScoresAsPercentages

GET /starwars/_analyze { "analyzer" : "my_analyzer", "text": "These are my father's machines." }

{ "tokens": [ { "token": "my", "start_offset": 10, "end_offset": 12, "type": "<ALPHANUM>", "position": 2 }, { "token": "father", "start_offset": 13, "end_offset": 21, "type": "<ALPHANUM>", "position": 3 }, { "token": "dad", "start_offset": 13, "end_offset": 21, "type": "SYNONYM", "position": 3 }, { "token": "machin", "start_offset": 22, "end_offset": 30, "type": "<ALPHANUM>", "position": 4 } ] }

PUT /starwars/_doc/4 { "quote": "These are my father's machines." }

POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

"hits": { "total": 4, "max_score": 2.92523, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 2.92523, "_source": { "quote": "These are my father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.8617505, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

2.92523 == 100%

DELETE /starwars/_doc/4 POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

"hits": { "total": 3, "max_score": 1.2499592, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.2499592, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

1.2499592 == 43% or 100%?

PUT /starwars/_doc/4 { "quote": "These droids are my father's father's machines." } POST /starwars/_search { "query": { "match": { "quote": "my father machine" } } }

"hits": { "total": 4, "max_score": 3.0068164, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 3.0068164, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 0.89701396, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, ...

3.0068164 == 103%?

PS: Shards Default? Effect on IDF?

Distributed Frequency Search GET starwars/_search?search_type=dfs_query_then_fetch { ... }

Don’t use dfs_query_then_fetch in production. It really isn’t required. — https://www.elastic.co/guide/en/elasticsearch/ guide/current/relevance-is-broken.html

More Search

Highlighting

POST /starwars/_search { "query": { "match": { "quote": "father" } }, "highlight": { "type": "unified", "pre_tags": [ "<tag>" ], "post_tags": [ "</tag>" ], "fields": { "quote": {} } } }

... "hits": { "total": 3, "max_score": 0.631961, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 0.631961, "_source": { "quote": "These droids are my father's father's machines." }, "highlight": { "quote": [ "These droids are my <tag>father's</tag> <tag>father's</tag> machines." ] } }, ...

Boolean Queries must must_not should filter

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }

... "hits": { "total": 3, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...

POST /starwars/_search { "query": { "bool": { "filter": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": "obi" } } ] } } }

... "hits": { "total": 3, "max_score": 1.6694657, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 1.6694657, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 0.8317767, "_source": { "quote": "<b>No</b>. I am your father." } },

Named Queries & minimum_should_match

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": { "query": "your", "_name": "quote-your" } } }, { "match": { "quote": { "query": "obi", "_name": "quote-obi" } } }, { "match": { "quote": { "query": "droid", "_name": "quote-droid" } } } ], "minimum_should_match": 2 } } }

... "hits": { "total": 1, "max_score": 2.117857, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 2.117857, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, "matched_queries": [ "quote-obi", "quote-your" ] } ] } }

Boosting >1 increase, <1 decrease, <0 punish <0 removed in 7.0

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": [ { "match": { "quote": "your" } }, { "match": { "quote": { "query": "obi", "boost": 3 } } } ] } } }

... "hits": { "total": 3, "max_score": 4.2368493, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "2", "_score": 4.2368493, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.3856719, "_source": { "quote": "<b>No</b>. I am your father." } }, ...

Search for father but prefer father father

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father father" } } } } }

... "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.263922, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.1077905, "_source": { "quote": "<b>No</b>. I am your father." } },

POST /starwars/_search { "query": { "bool": { "must": { "match": { "quote": "father" } }, "should": { "match_phrase": { "quote": "father father" } } } } }

... "hits": { "total": 3, "max_score": 9.146545, "hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 9.146545, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "3", "_score": 1.0454913, "_source": { "quote": "<b>No</b>. I am your father." } }, ...

Suggestion Suggest a similar text _search end point _suggest deprecated since 5.0

POST /starwars/_search { "query": { "match": { "quote": "drui" } }, "suggest": { "my_suggestion" : { "text" : "drui", "term" : { "field" : "quote" } } } }

... "hits": { "total": 0, "max_score": null, "hits": [] }, "suggest": { "my_suggestion": [ { "text": "drui", "offset": 0, "length": 4, "options": [ { "text": "droid", "score": 0.5, "freq": 1 } ] } ] } }

Multiple Suggesters term phrase completion context

NGram Partial matches Edge Gram

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }

{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "hes", "start_offset": 1, "end_offset": 4, "type": "word", "position": 1 }, { "token": "ese", "start_offset": 2, "end_offset": 5, "type": "word", "position": 2 }, { "token": "are", "start_offset": 6, "end_offset": 9, "type": "word", "position": 3 }, ...

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": { "type": "edge_ngram", "min_gram": "1", "max_gram": "3", "token_chars": [ "letter" ] }, "filter": [ "lowercase" ], "text": "These are <em>not</em> the droids you are looking for." }

{ "tokens": [ { "token": "t", "start_offset": 0, "end_offset": 1, "type": "word", "position": 0 }, { "token": "th", "start_offset": 0, "end_offset": 2, "type": "word", "position": 1 }, { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 2 }, { "token": "a", "start_offset": 6, "end_offset": 7, "type": "word", "position": 3 }, { "token": "ar", "start_offset": 6, "end_offset": 8, "type": "word", "position": 4 }, ...

Combining Analyzers Reindex Store multiple times Tune BM25 Combine scores

BM25 Revisited

https://www.elastic.co/blog/practical-bm25-part-2-the-bm25algorithm-and-its-variables

b field length amplification k1 term frequency saturation Default 0.75 Default 1.2

PUT /starwars_v42 { "settings": { "number_of_shards": 1, "index": { "similarity": { "default": { "type": "BM25", "b": 0, "k1": 0 } } },

"analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "father,dad", "droid => droid,machine" ] }, "my_ngram_filter": { "type": "ngram", "min_gram": "3", "max_gram": "3", "token_chars": [ "letter" ] } },

"analyzer": { "my_lowercase_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "my_full_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] },

}, } } "my_ngram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "whitespace", "filter": [ "lowercase", "stop", "my_ngram_filter" ] }

"mappings": { "_doc": { "properties": { "quote": { "type": "text", "fields": { "lowercase": { "type": "text", "analyzer": "my_lowercase_analyzer" }, "full": { "type": "text", "analyzer": "my_full_analyzer" }, "ngram": { "type": "text", "analyzer": "my_ngram_analyzer" } } } } } } }

POST /_reindex { "source": { "index": "starwars" }, "dest": { "index": "starwars_v42" } }

Aliases Atomic remove and add Point to multiple indices (read-only)

PUT _alias { "actions": [ { "add": { "index": "starwars_v42", "alias": "starwars_extended" } } ] }

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

"hits": [ { "_index": "starwars", "_type": "_doc", "_id": "4", "_score": 1.1533037, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars", "_type": "_doc", "_id": "1", "_score": 1.1295731, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]

POST /starwars_extended/_search { "query": { "match": { "quote.full": "droid" } } }

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 0.6931472, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 0.6931472, "_source": { "quote": "These droids are my father's father's machines." } } ]

There are no "best" b and k1 values

POST /starwars_extended/_search?explain=true { "query": { "multi_match": { "query": "obiwan", "fields": [ "quote", "quote.lowercase", "quote.full", "quote.ngram" ], "type": "most_fields" } } }

... "hits": { "total": 1, "max_score": 0.4912064, "hits": [ { "_shard": "[starwars_v42][2]", "_node": "BCDwzJ4WSw2dyoGLTzwlqw", "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 0.4912064, "_source": { "quote": "Obi-Wan never told you what happened to your father." }, ...

Whitespace Tokenizer "weight( Synonym(quote.ngram:biw quote.ngram:iwa quote.ngram:obi quote.ngram:wan) in 0) [PerFieldSimilarity], result of:"

POST /starwars_extended/_search { "query": { "multi_match": { "query": "you", "fields": [ "quote", "quote.lowercase^5", "quote.full", "quote.ngram" ], "type": "best_fields" } } }

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 3.465736, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "2", "_score": 3.465736, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "3", "_score": 0.35667494, "_source": { "quote": "<b>No</b>. I am your father." } } ]

Multi Match Type best_fields Score of the best field (default) cross_fields All terms in at least one field most_fields Score sum of all fields phrase

Different Analyzers for Indexing and Searching Per query In the mapping

POST /starwars_extended/_search { "query": { "match": { "quote.ngram": { "query": "the", "analyzer": "standard" } } } }

... "hits": [ { "_index": "starwars_extended", "_type": "_doc", "_id": "2", "_score": 0.38254172, "_source": { "quote": "Obi-Wan never told you what happened to your father." } }, { "_index": "starwars_extended", "_type": "_doc", "_id": "3", "_score": 0.36165747, "_source": { "quote": "<b>No</b>. I am your father." } } ] ...

Edge Gram vs Trigram Test a setting before adding a field

Shingle Token Filter Shingles (token ngrams) from a token stream

POST /starwars_extended/_close PUT /starwars_extended/_settings { "index": { "similarity": { "default": { "type": "BM25", "b": null, "k1": null } } },

"analysis": { "filter": { "my_edgegram_filter": { "type": "edge_ngram", "min_gram": 3, "max_gram": 10 }, "my_shingle_filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 2 } },

"analyzer": { "my_edgegram_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "my_edgegram_filter" ] },

} } } "my_shingle_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "my_shingle_filter" ] } POST /starwars_extended/_open

GET starwars_extended/_analyze { "text": "Father", "analyzer": "my_edgegram_analyzer" }

{ } "tokens": [ { "token": "fat", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fath", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "fathe", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 }, { "token": "father", "start_offset": 0, "end_offset": 6, "type": "<ALPHANUM>", "position": 0 } ]

PUT /starwars_extended/_doc/_mapping { "properties": { "quote": { "type": "text", "fields": { "edgegram": { "type": "text", "analyzer": "my_edgegram_analyzer", "search_analyzer": "standard" }, "shingle": { "type": "text", "analyzer": "my_shingle_analyzer" } } } } }

PUT /starwars_extended/_doc/5 { "quote": "I find your lack of faith disturbing." } PUT /starwars_extended/_doc/6 { "quote": "That... is your failure." }

GET /starwars_extended/_doc/5/_termvectors { "fields": [ "quote.edgegram" ], "offsets": true, "payloads": true, "positions": true, "term_statistics": true, "field_statistics": true }

{ "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_version": 1, "found": true, "took": 3, "term_vectors": { "quote.edgegram": { "field_statistics": { "sum_doc_freq": 26, "doc_count": 2, "sum_ttf": 26 }, "terms": { "dis": { "doc_freq": 1, "ttf": 1, "term_freq": 1, "tokens": [ { "position": 6, "start_offset": 26, "end_offset": 36 } ] }, "dist": { "doc_freq": 1, "ttf": 1, ...

POST /starwars_extended/_search { "query": { "match": { "quote": "fail" } } }

POST /starwars_extended/_search { "query": { "match": { "quote.lowercase": "fail" } } }

POST /starwars_extended/_search { "query": { "match": { "quote.full": "fail" } } }

POST /starwars_extended/_search { "query": { "match": { "quote.ngram": "fail" } } }

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.8400999, "_source": { "quote": "That... is your failure." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "5", "_score": 1.442779, "_source": { "quote": "I find your lack of faith disturbing." } } ]

POST /starwars_extended/_search { "query": { "match": { "quote.edgegram": "fail" } } }

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "6", "_score": 1.0114291, "_source": { "quote": "That... is your failure." } } ]

Updating Missing Fields Expensive

POST /starwars_extended/_update_by_query { "query": { "bool": { "must_not": { "exists": { "field": "quote.edgegram" } } } } }

Shingles: Context Should Matter

POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.lowercase": "these droids are" } } } } }

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.1837702, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 2.137744, "_source": { "quote": "These droids are my father's father's machines." } } ]

POST /starwars_extended/_search { "query": { "bool": { "must": { "match": { "quote.shingle": "these droids are" } } } } }

"hits": [ { "_index": "starwars_v42", "_type": "_doc", "_id": "4", "_score": 3.1811738, "_source": { "quote": "These droids are my father's father's machines." } }, { "_index": "starwars_v42", "_type": "_doc", "_id": "1", "_score": 2.6568544, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ]

Decompounding Commonly in German, Scandinavian languages, Finnish, Korean

PUT /decompound_en { "settings": { "number_of_shards": 1, "analysis": { "filter": { "british_decompounder": { "type": "hyphenation_decompounder", "hyphenation_patterns_path": "hyph/en_GB.xml", "word_list": [ "death", "star" ] } }, "analyzer": { "british_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "british_decompounder" ] } } } } }

GET /decompound_en/_analyze { "analyzer" : "british_decompound", "text" : "deathstar" }

{ } "tokens": [ { "token": "deathstar", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "death", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 }, { "token": "star", "start_offset": 0, "end_offset": 9, "type": "<ALPHANUM>", "position": 0 } ]

German Dictionaly (LGPL) https://github.com/uschindler/ german-decompounder

PUT /decompound_de { "settings": { "number_of_shards": 1, "analysis": { "filter": { "german_decompounder": { "type": "hyphenation_decompounder", "word_list_path": "dictionary-de.txt", "hyphenation_patterns_path": "hyph/de_DR.xml", "only_longest_match": true, "min_subword_size": 4 }, "german_stemmer": { "type": "stemmer", "language": "light_german" } }, "analyzer": { "german_decompound": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "german_decompounder", "german_normalization", "german_stemmer" ] } } } } }

GET /decompound_de/_analyze { "analyzer" : "german_decompound", "text" : "Todesstern" }

{ } "tokens": [ { "token": "todesst", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "tod", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 }, { "token": "stern", "start_offset": 0, "end_offset": 10, "type": "<ALPHANUM>", "position": 0 } ]

Without Word Lists https://github.com/jprante/ elasticsearch-analysis-decompound

Performance

Conclusion

Indexing Formatting Tokenize Lowercase, Stop Words, Stemming Synonyms

Scoring Term Frequency Inverse Document Frequency Field-Length Norm Vector Space Model

Advanced Queries Highlighting Suggestions NGrams, Edge Grams Multiple Analyzers

Advanced Queries Reindex & Alias Update by Query Shingles Decompound

Monitor Your Apps with the

Disclaimer I build highly monitored Hello World apps

Agenda Monitor Java (preconfigured) Some Security Monitor PHP (configure yourself)

Code https://github.com/xeraa/ microservice-monitoring

Cloud

Java Application

Simple No discovery, load-balancing,...

Monitor Java

Kibana Monitoring Overview of the Elastic Stack components

Metricbeat System [Metricbeat System] Overview and [Metricbeat System] Host overview dashboards See the memory spike every 5min

Time Series Visual Builder Sum of system.memory.actual.used.bytes Sum of system.process.memory. rss.bytes grouped by the term system.process.name and moved to the negative y-axis with a Math step

Packetbeat Call /, /good, /bad, and /foobar [Packetbeat] Overview, [Packetbeat] Flows, [Packetbeat] HTTP, and [Packetbeat] DNS Tunneling dashboards

Packetbeat Raw events in Discover Process enrichment for nginx, Java, and the APM server

Filebeat Modules [Filebeat Nginx] Access and error logs, [Filebeat System] Syslog dashboard, and [Osquery Result] Compliance pack dashboards

Filebeat Raw events in Discover /good: MDC logging under json.name and the context view for one log message meta.* and host.* information

Filebeat /bad and /null: Stacktraces by filtering down on application:java and json.severity:ERROR Visualize json.stack_hash

Heartbeat Heartbeat HTTP monitoring dashboard Stop and start the frontend application while auto refreshing

Metricbeat nginx [Metricbeat Nginx] Overview dashboard

Metricbeat HTTP /health and /metrics endpoints Collected information in Discover

Metricbeat JMX Same data Visualize the heap usage: jolokia. metrics.memory.heap_usage.used divided by the max of jolokia. metrics.memory.heap_usage.max

Annotations Add changes from the events index

Some Security

Filebeat Modules [Filebeat Auditd] Audit Events, [Filebeat System] New users and groups, and [Filebeat System] Sudo commands dashboards

https://github.com/linux-audit "auditd is the userspace component to the Linux Auditing System. It's responsible for writing audit records to the disk. Viewing the logs is done with the ausearch or aureport utilities."

Auditd Monitors File and network access System calls Commands run by a user Security events

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/chap-system_auditing

Understanding Logs https://access.redhat.com/ documentation/en-us/ red_hat_enterprise_linux/7/html/ security_guide/secunderstanding_audit_log_files

Auditbeat [Auditbeat Auditd] Overview dashboard

Fail SSH ssh elastic-user@xeraa.wtf with a bad password [Filebeat System] SSH login attempts dashboard

Success ssh elastic-user@xeraa.wtf with a good password Run service nginx restart and pick the elastic-admin user

Audit Event [Auditbeat Auditd] Executions dashboard filter elastic-user

Audit Event cat /etc/passwd Filter for tags is developers-passwdread in Discover

Power Abuse ssh elastic-admin@xeraa.wtf sudo cat /home/elastic-user/secret.txt Tag power-abuse in Discover

File Integrity Change something in /var/www/html/index.html [Auditbeat File Integrity] Overview dashboard

Monitor PHP

Heartbeat Add HTTP port 88

Packetbeat Add HTTP on port 88

Metricbeat php-fpm - module: php_fpm metricsets: ["pool"] period: 10s status_path: "/status" hosts: ["http://localhost"]

Filebeat Collect /var/www/html/silverstripe/ logs/*.json

More

a Alerting a Gold License and part of the Elastic Cloud

b Machine Learning Anomaly Detection of Time Series Data b Platinum License and part of the Elastic Cloud

Conclusion

System metrics & network Filebeat modules & Auditbeat Application logs

Uptime Application metrics Request tracing

Code https://github.com/xeraa/ microservice-monitoring

Questions? Philipp Krenn PS: Sticker @xeraa