Make Your Data FABulous

A presentation at Voxxed Days Minsk in May 2018 in Minsk, Belarus by Philipp Krenn

Slide 1

Slide 1

Make Your Data FAB ulous Philipp Krenn 44444444 @xeraa

Slide 2

Slide 2

Developer Advocate

Slide 3

Slide 3

ViennaDB Papers We Love Vienna

Slide 4

Slide 4

What is the perfect datastore solution?

Slide 5

Slide 5

It depends...

Slide 6

Slide 6

Pick your tradeoffs

Slide 7

Slide 7

Slide 8

Slide 8

CAP Theorem

Slide 9

Slide 9

Slide 10

Slide 10

Consistent "[...] a total order on all operations such that each operation looks as if it were completed at a single instant."

Slide 11

Slide 11

Available "[...] every request received by a non- failing node in the system must result in a response."

Slide 12

Slide 12

Partition Tolerant "[...] the network will be allowed to lose arbitrarily many messages sent from one node to another."

Slide 13

Slide 13

https://berb.github.io/diploma-thesis/original/061_challenge.html

Slide 14

Slide 14

Robinson Crusoe

Slide 15

Slide 15

/dev/null breaks CAP: effect of write are always consistent, it's always available, and all replicas are consistent even during partitions. — https://twitter.com/ashic/status/591511683987701760

Slide 16

Slide 16

FAB Theory

Slide 17

Slide 17

Fast Near real-time instead of batch processing

Slide 18

Slide 18

Accurate Exact instead of approximate results

Slide 19

Slide 19

Big Parallel computing tools are needed to handle the data

Slide 20

Slide 20

The 42 V's of Big Data and Data Science https://www.elderresearch.com/company/blog/42-v-of-big-data

Slide 21

Slide 21

Slide 22

Slide 22

Slide 23

Slide 23

Fast ✅ Big ✅ Accurate ❔

Slide 24

Slide 24

Slide 25

Slide 25

Shard Unit of scale

Slide 26

Slide 26

Slide 27

Slide 27

The evil wizard Mondain had attempted to gain control over Sosaria by trapping its essence in a crystal. When the Stranger at the end...

Slide 28

Slide 28

...of Ultima I defeated Mondain and shattered the crystal, the crystal shards each held a refracted copy of Sosaria. — http://www.raphkoster.com/2009/01/08/database- sharding-came-from-uo/

Slide 29

Slide 29

Slide 30

Slide 30

Slide 31

Slide 31


version:

'2' services: kibana: image:

docker.elastic.co/kibana/kibana:6.2.4 links: -

elasticsearch ports: -

5601 :5601 elasticsearch: image:

docker.elastic.co/elasticsearch/elasticsearch:6.2.4 volumes: - esdata1: /usr/share/elasticsearch/data ports: -

9200 :9200 volumes: esdata1: driver:

local

Slide 32

Slide 32

Terms Aggregation

Slide 33

Slide 33

Word Count Word Count Luke 64 Droid 13 R2 31 3PO 13 Alderaan 20 Princess 12 Kenobi 19 Ben 11 Obi-Wan 18 Vader 11 Droids 16 Han 10 Blast 15 Jedi 10 Imperial 15 Sandpeople 10

Slide 34

Slide 34

PUT starwars {

"settings" : {

"number_of_shards" : 5 ,

"number_of_replicas" : 0 } }

Slide 35

Slide 35

{ "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "0" } } { "word" : "Luke" } { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "1" } } { "word" : "Luke" } { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "2" } } { "word" : "Luke" } { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "3" } } { "word" : "Luke" } ...

Slide 36

Slide 36

Slide 37

Slide 37

GET starwars/_search {

"query" : {

"match" : {

"word" : "Luke" } } }

Slide 38

Slide 38

{

"took" : 6 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"skipped" : 0 ,

"failed" : 0 },

"hits" : {

"total" : 64 ,

"max_score" : 3.2049506 ,

"hits" : [ {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "0vVdy2IBkmPuaFRg659y" ,

"_score" : 3.2049506 ,

"_routing" : "1" ,

"_source" : {

"word" : "Luke" } }, ...

Slide 39

Slide 39

GET starwars/_search {

"aggs" : {

"most_common" : {

"terms" : {

"field" : "word.keyword" ,

"size" : 1 } } },

"size" : 0 }

Slide 40

Slide 40

{

"took" : 13 ,

"timed_out" : false ,

"_shards" : {

"total" : 5 ,

"successful" : 5 ,

"skipped" : 0 ,

"failed" : 0 },

"hits" : {

"total" : 288 ,

"max_score" : 0 ,

"hits" : [] },

"aggregations" : {

"most_common" : {

"doc_count_error_upper_bound" : 10 ,

"sum_other_doc_count" : 232 ,

"buckets" : [ {

"key" : "Luke" ,

"doc_count" : 56 } ] } } }

Slide 41

Slide 41

Slide 42

Slide 42

{ "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "0" } } { "word" : "Luke" } { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "1" } } { "word" : "Luke" } ... { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "9" } } { "word" : "Luke" } { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "0" } } { "word" : "Luke" } { "index" : { "_index" : "starwars" , "_type" : "_doc" , "routing" : "0" } } { "word" : "Luke" } ...

Slide 43

Slide 43

Routing shard# = hash(_routing) % #primary_shards

Slide 44

Slide 44

GET _cat/shards?index=starwars&v index shard prirep state docs store ip node starwars 3 p STARTED
58

6.4 kb 172.19.0.2 Q88C3vO starwars 4 p STARTED
26

5.2 kb 172.19.0.2 Q88C3vO starwars 2 p STARTED
71

6.9 kb 172.19.0.2 Q88C3vO starwars 1 p STARTED
63

6.6 kb 172.19.0.2 Q88C3vO starwars 0 p STARTED
70

6.7 kb 172.19.0.2 Q88C3vO

Slide 45

Slide 45

(Sub) Results Per Shard shard_size = (size * 1.5 + 10)

Slide 46

Slide 46

How Many? Results per shard Results for aggregation

Slide 47

Slide 47

"doc_count_error_upper_bound": 10 "sum_other_doc_count": 232

Slide 48

Slide 48

GET starwars/_search {

"aggs" : {

"most_common" : {

"terms" : {

"field" : "word.keyword" ,

"size" : 1 ,

"show_term_doc_count_error" : true } } },

"size" : 0 }

Slide 49

Slide 49

"aggregations" : {

"most_common" : {

"doc_count_error_upper_bound" : 10 ,

"sum_other_doc_count" : 232 ,

"buckets" : [ {

"key" : "Luke" ,

"doc_count" : 56 ,

"doc_count_error_upper_bound" : 9 } ] } }

Slide 50

Slide 50

GET starwars/_search {

"aggs" : {

"most_common" : {

"terms" : {

"field" : "word.keyword" ,

"size" : 1 ,

"shard_size" : 20 ,

"show_term_doc_count_error" : true } } },

"size" : 0 }

Slide 51

Slide 51

"aggregations" : {

"most_common" : {

"doc_count_error_upper_bound" : 0 ,

"sum_other_doc_count" : 224 ,

"buckets" : [ {

"key" : "Luke" ,

"doc_count" : 64 ,

"doc_count_error_upper_bound" : 0 } ] } }

Slide 52

Slide 52

Inverse Document Frequency

Slide 53

Slide 53

GET starwars/_search {

"query" : {

"match" : {

"word" : "Luke" } } }

Slide 54

Slide 54

... {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "0vVdy2IBkmPuaFRg659y" ,

"_score" : 3.2049506 ,

"_routing" : "1" ,

"_source" : {

"word" : "Luke" } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2PVdy2IBkmPuaFRg659y" ,

"_score" : 3.2049506 ,

"_routing" : "7" ,

"_source" : {

"word" : "Luke" } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "0_Vdy2IBkmPuaFRg659y" ,

"_score" : 3.1994843 ,

"_routing" : "2" ,

"_source" : {

"word" : "Luke" } }, ...

Slide 55

Slide 55

Slide 56

Slide 56

Term Frequency / Inverse Document Frequency (TF/IDF)

Slide 57

Slide 57

BM25 Default in Elasticsearch 5.0

Slide 58

Slide 58

Term Frequency

Slide 59

Slide 59

Slide 60

Slide 60

Inverse Document Frequency

Slide 61

Slide 61

Slide 62

Slide 62

Field-Length Norm

Slide 63

Slide 63

Query Then Fetch

Slide 64

Slide 64

Query

Slide 65

Slide 65

Fetch

Slide 66

Slide 66

DFS Query Then Fetch Distributed Frequency Search

Slide 67

Slide 67

GET starwars/_search?search_type=dfs_query_then_fetch {

"query" : {

"match" : {

"word" : "Luke" } } }

Slide 68

Slide 68

{

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "0fVdy2IBkmPuaFRg659y" ,

"_score" : 1.5367417 ,

"_routing" : "0" ,

"_source" : {

"word" : "Luke" } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "2_Vdy2IBkmPuaFRg659y" ,

"_score" : 1.5367417 ,

"_routing" : "0" ,

"_source" : {

"word" : "Luke" } }, {

"_index" : "starwars" ,

"_type" : "_doc" ,

"_id" : "3PVdy2IBkmPuaFRg659y" ,

"_score" : 1.5367417 ,

"_routing" : "0" ,

"_source" : {

"word" : "Luke" } }, ...

Slide 69

Slide 69

Slide 70

Slide 70

Slide 71

Slide 71

Often Non or Minor Issue Lots of documents Even distribution

Slide 72

Slide 72

Don’t use dfs_query_then_fetch in production. It really isn’t required. — https://www.elastic.co/guide/en/elasticsearch/ guide/current/relevance-is-broken.html

Slide 73

Slide 73

Single Shard Default in 7.0

Slide 74

Slide 74

Simon Says Use a single shard until it blows up

Slide 75

Slide 75

PUT starwars/_settings {

"settings" : {

"index.blocks.write" : true } }

Slide 76

Slide 76

POST starwars/_shrink/starwars_single {

"settings" : {

"number_of_shards" : 1 ,

"number_of_replicas" : 0 } }

Slide 77

Slide 77

GET starwars_single/_search {

"query" : {

"match" : {

"word" : "Luke" } },

"_source" : false }

Slide 78

Slide 78

{

"_index" : "starwars_single" ,

"_type" : "_doc" ,

"_id" : "0fVdy2IBkmPuaFRg659y" ,

"_score" : 1.5367417 ,

"_routing" : "0" }, {

"_index" : "starwars_single" ,

"_type" : "_doc" ,

"_id" : "2_Vdy2IBkmPuaFRg659y" ,

"_score" : 1.5367417 ,

"_routing" : "0" }, {

"_index" : "starwars_single" ,

"_type" : "_doc" ,

"_id" : "3PVdy2IBkmPuaFRg659y" ,

"_score" : 1.5367417 ,

"_routing" : "0" },

Slide 79

Slide 79

GET starwars_single/_search {

"aggs" : {

"most_common" : {

"terms" : {

"field" : "word.keyword" ,

"size" : 1 } } },

"size" : 0 }

Slide 80

Slide 80

{

"took" : 1 ,

"timed_out" : false ,

"_shards" : {

"total" : 1 ,

"successful" : 1 ,

"skipped" : 0 ,

"failed" : 0 },

"hits" : {

"total" : 288 ,

"max_score" : 0 ,

"hits" : [] },

"aggregations" : {

"most_common" : {

"doc_count_error_upper_bound" : 0 ,

"sum_other_doc_count" : 224 ,

"buckets" : [ {

"key" : "Luke" ,

"doc_count" : 64 } ] } } }

Slide 81

Slide 81

Slide 82

Slide 82

Conclusion

Slide 83

Slide 83

Tradeoffs...

Slide 84

Slide 84

C onsistent 4 A vailable 4 P artition Tolerant F ast 4 A ccurate 4 B ig

Slide 85

Slide 85

Questions? Philipp Krenn 44444 @xeraa PS: Stickers