SlideShare a Scribd company logo
Roman Nikitchenko, 09.10.2014 
BIG.DATA 
Technology scope
Any real big data is 
just about 
DIGITAL LIFE 
FOOTPRINT 
www.vitech.com.ua 2
BIG DATA is not about the 
data. It is about OUR ABILITY 
TO HANDLE THEM. 
www.vitech.com.ua 3
Our stack 
What is our stack 
of big data 
technologies? 
Some of our 
specifics 
But we are 
always special, 
don't you? 
Couple of buzz 
words 
Arguments for 
meetings with 
management ;-) 
www.vitech.com.ua 4
YARN 
Linear scalability: 2 
times more power costs 
2 times more money 
No natural keys so load 
balancing is perfect 
No 'special' hardware 
so staging is closer to 
production. 
www.vitech.com.ua 5
HADOOP magic is here! 
www.vitech.com.ua 6
● Hadoop is open source 
framework for big 
data. Both distributed 
storage and 
processing. 
● Hadoop is reliable and 
fault tolerant with no 
rely on hardware for 
these properties. 
● Hadoop has unique 
horisontal scalability. 
Currently — from 
single computer up to 
thousands of cluster 
nodes. 
What is 
it? 
What is 
HADOOP? 
www.vitech.com.ua 7
What is HADOOP INDEED? 
Why 
hadoop 
BIG 
DATA BIG 
= 
+ 
www.vitech.com.ua 8 
? 
x MAX 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA
SIMPLE BUT 
RELIABLE 
● Really big amount of data 
stored in reliable manner. 
● Storage is simple, 
recoverable and cheap 
(relatively). 
● The same is about 
processing power. 
www.vitech.com.ua 9
COMPLEX INSIDE, 
SIMPLE OUTSIDE 
● Complexity is burried 
inside. Most of really 
complex operations are 
taken by engine. 
● Interface is remote, 
compatible between 
versions so clients are 
relatively safe against 
implementation changes. 
www.vitech.com.ua 10
DECENTRALIZED 
● No single point of failure 
(almost). 
● Scalable as close to linear as 
possible. 
● No manual actions to recover 
in case of failures 
www.vitech.com.ua 11
Hadoop 
historical 
top view 
● HDFS serves as file 
system layer 
● MapReduce originally 
served as distributed 
processing framework. 
● Native client API is 
Java but there are lot 
of alternatives. 
● This is only initial 
architecture and it is 
now more complex. 
www.vitech.com.ua 12
HDFS 
top 
view 
HDFS is... scalable 
● Namenode is 
'management' 
component. Keeps 
'directory' of what file 
blocks are stored 
where. 
● Actual work is 
performed by data 
nodes. 
www.vitech.com.ua 13
HDFS is... reliable 
● Files are stored in large enough blocks. Every block is 
replicated to several data nodes. 
● Replication is tracked by namenode. Clients only locate 
blocks using namenode and actual load is taken by 
datanode. 
● Datanode failure leads to replication recovery. Namenode 
could be backed by standby scheme. 
www.vitech.com.ua 14
NO BACKUPS 
www.vitech.com.ua 15
MapReduce is... 
● 2 steps data processing model: transform and then 
reduce. Really nice to do things in distributed manner. 
● Large class of jobs can be adopted but not all of them. 
www.vitech.com.ua 16
BIG 
DATA 
process 
ing: 
require 
DISTRIBUTION 
LOAD HAS TO BE 
SHARED 
● Work is to be 
balanced. 
● Work can be shared 
in accordance to 
data placement. 
● Work is to be 
balanced to reflect 
resource balance. 
www.vitech.com.ua 17
DATA LOCALITY 
TOOLS ARE TO BE CLOSE 
TO WORK PLACE 
● Process data on the 
same nodes as it is 
stored on with 
MapReduce. 
● Distributed storage 
— distributed 
processing. 
www.vitech.com.ua 18
DISTRIBUTION 
+ LOCALITY 
Do it locally 
Share it 
YOUR DATA TOGETHER THEY GO! 
BIG 
DATA 
BIG 
DATA 
BIG 
DATA 
Partition 
Partition 
Partition 
WORK TO DO 
JOINED RESULT 
Data partitioning drives 
work sharing. Good 
partitioning — good 
scalability. 
www.vitech.com.ua 19
Now with resource management 
● New component (YARN) forms resource management 
layer and completes real distributed data OS. 
● MapReduce is from now only one among other YARN 
appliactions. 
www.vitech.com.ua 20
Why YARN is SO 
important? 
● Better resource balance for 
heterogeneous clusterss 
and multple applications. 
● Dynamic applications over 
static services. 
● Much wider applications 
model over simple 
MapReduce. Things like 
Spark ot Tez. 
www.vitech.com.ua 21
First ever world 
DATA OS 
10.000 nodes computer... 
Recent technology changes are focused on 
higher scale. Better resource usage and 
control, lower MTTR, higher security, 
redundancy, fault tolerance. 
www.vitech.com.ua 22
Hadoop 
: don't 
do it 
yoursel 
www.vitech.com.ua 23
Choose your destiny! We did. 
● HortonWorks are 'barely open source'. Innovative, but 
'running too fast'. Most ot their key technologies are not 
so mature yet. 
Cloudera is stable enough but not stale. Hadoop 2.3 with 
YARN, HBase 0.98.x. Balance. Spark 1.x is bold move! 
● MapR focuses on performance per node but they are 
slightly outdated in term of functionality and their 
distribution costs. For cases where node performance is 
high priority. 
www.vitech.com.ua 24
HBase 
motivat 
ion 
But Hadoop is... 
● Designed for throughput, 
not for latency. 
● HDFS blocks are expected 
to be large. There is issue 
with lot of small files. 
● Write once, read many 
times ideology. 
● MapReduce is not so 
flexible so any database 
built on top of it. 
● How about realtime? 
www.vitech.com.ua 25
LATENCY, SPEED and all 
Hadoop properties. 
HBase 
motivat 
ion 
BUT WE 
OFTEN 
NEED... 
www.vitech.com.ua 26
High layer applications 
Resource management 
YARN 
Distributed file system 
www.vitech.com.ua 27
Logical data model 
Table 
Region 
Region 
Every row 
consists of 
columns. 
Row 
Key Family #1 Family #2 ... 
Column Column ... ... 
... 
... 
... 
Data is 
placed in 
tables. 
Tables are split 
into regions 
based on row 
key ranges. 
Columns are 
grouped into 
Every table row families. 
is identified by 
unique row key. 
www.vitech.com.ua 28
Table 
Region 
Real data 
model 
● Data is stored in HFile. 
● Families are stored on 
disk in separate files. 
● Row keys are 
indexed in memory. 
● Column includes key, 
qualifier, value and timestamp. 
● No column limit. 
● Storage is block based. 
Region 
Row 
Key Family #1 Family #2 ... 
Column Column ... ... 
... 
HFile: family #1 
Row key Column Value TS 
... ... ... ... 
... ... ... ... 
● Delete is just another 
marker record. 
● Periodic compaction is 
required. 
HFile: family #2 
Row key Column Value TS 
... ... ... ... 
... ... ... ... 
www.vitech.com.ua 29
Hbase: infrastructure view 
Zookeeper coordinates 
distributed elements and 
is primary contact point 
for client. 
META 
DATA 
Master server keeps metadata and 
manages data distribution over 
Region servers. 
Zookeeper Master 
RS RS RS RS 
Client 
Region servers 
manage data 
table regions. 
Clients directly 
communicate 
with region 
server for data. 
Clients locate master 
through ZooKeeper 
then needed regions 
through master. 
www.vitech.com.ua 30
Zookeeper 
coordinates 
distributed 
elements and is 
primary contact 
point for client. 
META 
DATA 
RS RS 
DN DN 
Rack 
RS RS 
DN DN 
Rack 
RS RS 
DN DN 
Rack 
NameNode 
www.vitech.com.ua 31 
Client 
Master 
Zookeeper 
Master server keeps 
metadata and manages data 
distribution over Region 
servers. 
Region servers 
manage data 
table regions. 
Actual data 
storage service 
including 
replication is on 
HDFS data 
nodes. 
Clients directly 
communicate 
with region 
server for data. 
Clients locate 
master through 
ZooKeeper then 
needed regions 
through master. 
Together with HDFS
DATA LAKE 
Take as much data 
about your business 
processes as you can 
take. The more data 
you have the more 
value you could get 
from it. 
www.vitech.com.ua 32
Apache 
ZooKeeper 
… because coordinating 
distributed systems is a Zoo 
Zookee 
per 
www.vitech.com.ua 33
Apache 
ZooKeeper 
We use this guy: 
● As a part of Hadoop / 
HBase infrastructure 
● To coordinate MapReduce 
job tasks 
www.vitech.com.ua 34
Apache 
Spark 
● Better MapReduce with at least some 
MapReduce elements able to be reused. 
● Dynamic, faster to startup and does not need 
anything from cluster. 
● New job models. Not only Map and Reduce. 
● Results can be passed through memory 
including final one. 
www.vitech.com.ua 35
SOLR is just about search 
INDEX UPDATE 
INDEX QUERY 
Search responses 
Index update request is 
analyzed, tokenized, 
transformed... and the 
same is for queries. 
● SOLR indexes documents. What is stored into 
SOLR index is not what you index. SOLR is NOT A 
STORAGE, ONLY INDEX 
● But it can index ANYTHING. Search result is 
document ID 
www.vitech.com.ua 36
● HBase handles user data change online 
requests. 
● NGData Lily indexer handles stream of changes 
and transforms them into SOLR index change 
requests. 
● Indexes are built on SOLR so HBase data are 
searchable. 
www.vitech.com.ua 37
ENTERPRISE DATA HUB 
Don't ruine your existing data warehouse. 
Just extend it with new, centralized big 
data storage through data migration 
solution. 
www.vitech.com.ua 38
HBase: Data and search integration 
Replication can be 
set up to column 
HBase regions 
HDFS 
Data update 
www.vitech.com.ua 39 
Client 
User just puts (or 
deletes) data. 
Search responses 
Lily HBase 
NRT indexer 
family level. 
REPLICATION 
HBase 
cluster 
Translates data 
changes into SOLR 
index updates. 
SOLR cloud 
Search requests (HTTP) 
Apache 
Zookeeper does 
all coordination 
Finally provides 
search 
Serves low level 
file system.
Questions and discussion 
www.vitech.com.ua 40

More Related Content

PPTX
Architecting virtualized infrastructure for big data presentation
PDF
Hadoop 101
PPTX
Hibernate
PDF
Spring5 hibernate5 security5 lab step by step
DOC
24 collections framework interview questions
PPS
Java Hibernate Programming with Architecture Diagram and Example
PPT
Servlet programming
PPTX
Connection Pooling
Architecting virtualized infrastructure for big data presentation
Hadoop 101
Hibernate
Spring5 hibernate5 security5 lab step by step
24 collections framework interview questions
Java Hibernate Programming with Architecture Diagram and Example
Servlet programming
Connection Pooling

What's hot (20)

PDF
Orcale Presentation
PPT
Servlet programming
PDF
Introduction To Hibernate
PPT
Jspprogramming
ODP
Hibernate Developer Reference
PDF
jsf2 Notes
PDF
PPTX
Database Connection Pooling With c3p0
PPTX
Hibernate tutorial
PDF
Java Web Programming Using Cloud Platform: Module 3
PDF
Owner - Java properties reinvented.
PPTX
Angularj2.0
PPTX
A first Draft to Java Configuration
PDF
5050 dev nation
PDF
Security Multitenant
PPTX
Advance java session 5
PPTX
MyBatis
PPT
Chap3 3 12
PPT
JPA and Coherence with TopLink Grid
Orcale Presentation
Servlet programming
Introduction To Hibernate
Jspprogramming
Hibernate Developer Reference
jsf2 Notes
Database Connection Pooling With c3p0
Hibernate tutorial
Java Web Programming Using Cloud Platform: Module 3
Owner - Java properties reinvented.
Angularj2.0
A first Draft to Java Configuration
5050 dev nation
Security Multitenant
Advance java session 5
MyBatis
Chap3 3 12
JPA and Coherence with TopLink Grid
Ad

Viewers also liked (17)

PDF
Core concepts and Key technologies - Big Data Analytics
PDF
Implementing a Population Health Model (Hon Pak)
PDF
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
PPTX
Hadoop hbase mapreduce
PDF
Big data technologies and Hadoop infrastructure
PPTX
Big data characteristics, value chain and challenges
PPT
Relational algebra in dbms
PDF
BigData_Chp2: Hadoop & Map-Reduce
PDF
BigData_Chp1: Introduction à la Big Data
PPTX
Big data - Key Enablers, Drivers & Challenges
PPT
Effective Software Release Management
PDF
Big Data Tech Stack
PPTX
Big Data Technology Stack : Nutshell
PPT
Big Data
PPT
Big data ppt
PPTX
What is Big Data?
PPTX
Big data ppt
Core concepts and Key technologies - Big Data Analytics
Implementing a Population Health Model (Hon Pak)
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Hadoop hbase mapreduce
Big data technologies and Hadoop infrastructure
Big data characteristics, value chain and challenges
Relational algebra in dbms
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp1: Introduction à la Big Data
Big data - Key Enablers, Drivers & Challenges
Effective Software Release Management
Big Data Tech Stack
Big Data Technology Stack : Nutshell
Big Data
Big data ppt
What is Big Data?
Big data ppt
Ad

Similar to Big data: current technology scope. (20)

PDF
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
PDF
Big Data - Big Pitfalls.
PDF
HBase, crazy dances on the elephant back.
PDF
Big Data: fall seven times, stand up eight!
PPTX
ODP
Hadoop seminar
PDF
HBase, dances on the elephant back.
ODP
BigData Hadoop
PPTX
Big Data and Hadoop
PPT
Big Data and Hadoop Basics
PDF
Semantic web meetup 14.november 2013
PPTX
Hadoop
PPTX
Hadoop
PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
PDF
Unit IV.pdf
PPTX
Hadoop_arunam_ppt
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
PDF
Hadoop Tutorial for Big Data Enthusiasts
Java/Scala Lab: Роман Никитченко - Big Data - Big Pitfalls.
Big Data - Big Pitfalls.
HBase, crazy dances on the elephant back.
Big Data: fall seven times, stand up eight!
Hadoop seminar
HBase, dances on the elephant back.
BigData Hadoop
Big Data and Hadoop
Big Data and Hadoop Basics
Semantic web meetup 14.november 2013
Hadoop
Hadoop
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Unit IV.pdf
Hadoop_arunam_ppt
SQL Engines for Hadoop - The case for Impala
Big Data Analytics Presentation on the resourcefulness of Big data
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Hadoop Tutorial for Big Data Enthusiasts

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Programs and apps: productivity, graphics, security and other tools
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Big data: current technology scope.

  • 1. Roman Nikitchenko, 09.10.2014 BIG.DATA Technology scope
  • 2. Any real big data is just about DIGITAL LIFE FOOTPRINT www.vitech.com.ua 2
  • 3. BIG DATA is not about the data. It is about OUR ABILITY TO HANDLE THEM. www.vitech.com.ua 3
  • 4. Our stack What is our stack of big data technologies? Some of our specifics But we are always special, don't you? Couple of buzz words Arguments for meetings with management ;-) www.vitech.com.ua 4
  • 5. YARN Linear scalability: 2 times more power costs 2 times more money No natural keys so load balancing is perfect No 'special' hardware so staging is closer to production. www.vitech.com.ua 5
  • 6. HADOOP magic is here! www.vitech.com.ua 6
  • 7. ● Hadoop is open source framework for big data. Both distributed storage and processing. ● Hadoop is reliable and fault tolerant with no rely on hardware for these properties. ● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes. What is it? What is HADOOP? www.vitech.com.ua 7
  • 8. What is HADOOP INDEED? Why hadoop BIG DATA BIG = + www.vitech.com.ua 8 ? x MAX DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA BIG DATA
  • 9. SIMPLE BUT RELIABLE ● Really big amount of data stored in reliable manner. ● Storage is simple, recoverable and cheap (relatively). ● The same is about processing power. www.vitech.com.ua 9
  • 10. COMPLEX INSIDE, SIMPLE OUTSIDE ● Complexity is burried inside. Most of really complex operations are taken by engine. ● Interface is remote, compatible between versions so clients are relatively safe against implementation changes. www.vitech.com.ua 10
  • 11. DECENTRALIZED ● No single point of failure (almost). ● Scalable as close to linear as possible. ● No manual actions to recover in case of failures www.vitech.com.ua 11
  • 12. Hadoop historical top view ● HDFS serves as file system layer ● MapReduce originally served as distributed processing framework. ● Native client API is Java but there are lot of alternatives. ● This is only initial architecture and it is now more complex. www.vitech.com.ua 12
  • 13. HDFS top view HDFS is... scalable ● Namenode is 'management' component. Keeps 'directory' of what file blocks are stored where. ● Actual work is performed by data nodes. www.vitech.com.ua 13
  • 14. HDFS is... reliable ● Files are stored in large enough blocks. Every block is replicated to several data nodes. ● Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode. ● Datanode failure leads to replication recovery. Namenode could be backed by standby scheme. www.vitech.com.ua 14
  • 16. MapReduce is... ● 2 steps data processing model: transform and then reduce. Really nice to do things in distributed manner. ● Large class of jobs can be adopted but not all of them. www.vitech.com.ua 16
  • 17. BIG DATA process ing: require DISTRIBUTION LOAD HAS TO BE SHARED ● Work is to be balanced. ● Work can be shared in accordance to data placement. ● Work is to be balanced to reflect resource balance. www.vitech.com.ua 17
  • 18. DATA LOCALITY TOOLS ARE TO BE CLOSE TO WORK PLACE ● Process data on the same nodes as it is stored on with MapReduce. ● Distributed storage — distributed processing. www.vitech.com.ua 18
  • 19. DISTRIBUTION + LOCALITY Do it locally Share it YOUR DATA TOGETHER THEY GO! BIG DATA BIG DATA BIG DATA Partition Partition Partition WORK TO DO JOINED RESULT Data partitioning drives work sharing. Good partitioning — good scalability. www.vitech.com.ua 19
  • 20. Now with resource management ● New component (YARN) forms resource management layer and completes real distributed data OS. ● MapReduce is from now only one among other YARN appliactions. www.vitech.com.ua 20
  • 21. Why YARN is SO important? ● Better resource balance for heterogeneous clusterss and multple applications. ● Dynamic applications over static services. ● Much wider applications model over simple MapReduce. Things like Spark ot Tez. www.vitech.com.ua 21
  • 22. First ever world DATA OS 10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance. www.vitech.com.ua 22
  • 23. Hadoop : don't do it yoursel www.vitech.com.ua 23
  • 24. Choose your destiny! We did. ● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.98.x. Balance. Spark 1.x is bold move! ● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. www.vitech.com.ua 24
  • 25. HBase motivat ion But Hadoop is... ● Designed for throughput, not for latency. ● HDFS blocks are expected to be large. There is issue with lot of small files. ● Write once, read many times ideology. ● MapReduce is not so flexible so any database built on top of it. ● How about realtime? www.vitech.com.ua 25
  • 26. LATENCY, SPEED and all Hadoop properties. HBase motivat ion BUT WE OFTEN NEED... www.vitech.com.ua 26
  • 27. High layer applications Resource management YARN Distributed file system www.vitech.com.ua 27
  • 28. Logical data model Table Region Region Every row consists of columns. Row Key Family #1 Family #2 ... Column Column ... ... ... ... ... Data is placed in tables. Tables are split into regions based on row key ranges. Columns are grouped into Every table row families. is identified by unique row key. www.vitech.com.ua 28
  • 29. Table Region Real data model ● Data is stored in HFile. ● Families are stored on disk in separate files. ● Row keys are indexed in memory. ● Column includes key, qualifier, value and timestamp. ● No column limit. ● Storage is block based. Region Row Key Family #1 Family #2 ... Column Column ... ... ... HFile: family #1 Row key Column Value TS ... ... ... ... ... ... ... ... ● Delete is just another marker record. ● Periodic compaction is required. HFile: family #2 Row key Column Value TS ... ... ... ... ... ... ... ... www.vitech.com.ua 29
  • 30. Hbase: infrastructure view Zookeeper coordinates distributed elements and is primary contact point for client. META DATA Master server keeps metadata and manages data distribution over Region servers. Zookeeper Master RS RS RS RS Client Region servers manage data table regions. Clients directly communicate with region server for data. Clients locate master through ZooKeeper then needed regions through master. www.vitech.com.ua 30
  • 31. Zookeeper coordinates distributed elements and is primary contact point for client. META DATA RS RS DN DN Rack RS RS DN DN Rack RS RS DN DN Rack NameNode www.vitech.com.ua 31 Client Master Zookeeper Master server keeps metadata and manages data distribution over Region servers. Region servers manage data table regions. Actual data storage service including replication is on HDFS data nodes. Clients directly communicate with region server for data. Clients locate master through ZooKeeper then needed regions through master. Together with HDFS
  • 32. DATA LAKE Take as much data about your business processes as you can take. The more data you have the more value you could get from it. www.vitech.com.ua 32
  • 33. Apache ZooKeeper … because coordinating distributed systems is a Zoo Zookee per www.vitech.com.ua 33
  • 34. Apache ZooKeeper We use this guy: ● As a part of Hadoop / HBase infrastructure ● To coordinate MapReduce job tasks www.vitech.com.ua 34
  • 35. Apache Spark ● Better MapReduce with at least some MapReduce elements able to be reused. ● Dynamic, faster to startup and does not need anything from cluster. ● New job models. Not only Map and Reduce. ● Results can be passed through memory including final one. www.vitech.com.ua 35
  • 36. SOLR is just about search INDEX UPDATE INDEX QUERY Search responses Index update request is analyzed, tokenized, transformed... and the same is for queries. ● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX ● But it can index ANYTHING. Search result is document ID www.vitech.com.ua 36
  • 37. ● HBase handles user data change online requests. ● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests. ● Indexes are built on SOLR so HBase data are searchable. www.vitech.com.ua 37
  • 38. ENTERPRISE DATA HUB Don't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution. www.vitech.com.ua 38
  • 39. HBase: Data and search integration Replication can be set up to column HBase regions HDFS Data update www.vitech.com.ua 39 Client User just puts (or deletes) data. Search responses Lily HBase NRT indexer family level. REPLICATION HBase cluster Translates data changes into SOLR index updates. SOLR cloud Search requests (HTTP) Apache Zookeeper does all coordination Finally provides search Serves low level file system.
  • 40. Questions and discussion www.vitech.com.ua 40