SlideShare a Scribd company logo
Jacek Kruszelnicki, Numatica Corporation 
E-mail: j a c e k@numatica.com (remove spaces) 
Phone: 781 756 8064 
Scalable Software. Real-time Big Data Analytics. 
Big Data, Simple and Fast: 
1 
Addressing the Shortcomings of Hadoop
Scalable Software. Real-time Big Data Analytics. 
Jacek: 
 20+ years of shipping enterprise software 
 Biased towards simplicity/design elegance. Vendor–neutral. 
 Author, mentor & conference speaker 
 M.S. Computer Science, focused on business value of IT 
Jacek Kruszelnicki 
President & CEO, Numatica Corporation 
Presenter 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
2 
Focus: 
 Distributed, High-throughput, Low-latency software. 
 Real-time (“Fast”) Big Data. Services: 
 IT Strategy and Planning 
 Software Architecture & Design 
 Proof of Concept, Quick Start programs 
 Software Development 
 Trainings, Seminars
Scalable Software. Real-time Big Data Analytics. 
Agenda 
 Hadoop is not a universal, or inexpensive, Big Data stack 
 Technical requirements for a flexible Big/Fast Data stack 
 Solutions thought to be alternatives to Hadoop 
 Why In-Memory Data Grids are a good fit for Big/Fast Data 
 How Hazelcast meets the Big/Fast Data requirements 
 Focus on architectures, but demo/code samples provided 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
3 
“Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction.” - E.F. Schumacher
Scalable Software. Real-time Big Data Analytics. 
Not on Agenda 
 Big Data Tutorial 
 Hadoop or any other technology tutorial 
 In-depth overview of Big Data market 
 Big Data Analytics -> Business Insights 
 CAP Theorem discussion 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
4
Scalable Software. Real-time Big Data Analytics. 
Operational Intelligence: 
 Analyze stream of business activities and external stimuli on-the fly 
 React to them (preferably) instantaneously 
 Real time data stream processing is critical, otherwise business value is lost. Real-time Big (Fast) Data Analytics examples: 
 Dynamic pricing (e-commerce) 
 High-frequency trading 
 Network security threats 
 Credit card fraud prevention 
 Factory floor data collection, RFID 
 Mobile infrastructure, machine to machine (M2M) applications 
 Prescriptive or Location-based applications 
 Real-time dashboards, alerts, and reports 
Fast Data - Business Drivers 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
5
Scalable Software. Real-time Big Data Analytics. 
Big Data – The 3 Vs Re-arranged 
Volume 
Variety 
Velocity 
Common definition: Data sets too large and complex to process using standard DB tools and data processing applications (in short: “Will not fit into MS Excel”) 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
 Structured, semi-structured 
 At-Rest, In-Motion 
 Variety of formats 
 Millions of events per second 
 Need to re-run analytics frequently (Dashboards, etc) 
 Frequently changing data (at rest or in- motion) 
 Stale data provides little business value 
 Also need to do off-line processing (machine learning) 
6 
 Few Googles and Facebooks out there 
 Typical data < 100 TB
Scalable Software. Real-time Big Data Analytics. 
Big...and Fast 
 Rapidly changing, massive stream(s) of data (events) 
 Multiple sources, formats 
 Needs to be processed in-flight 
 Events persisted or not, depending on business value 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
7 
Data In-Motion: 
Data At-Rest: 
 Data already persisted, change notification 
 Multiple formats 
 Re-ingest, re-process 
Software stacks to support “In-Motion” and “At-Rest” Big Data are needed.
Scalable Software. Real-time Big Data Analytics. 
”Big Elephant in the room...” 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
8 
 Currently the large scale data analysis tool of choice 
 De facto standard, almost synonymous with Big Data 
 “Nobody ever got fired for using Hadoop” Problems: Very complex/expensive to deploy and run -> high TCO MapReduce slow, batch-oriented, “intrusive” No stream processing No support for in-memory processing Closely coupled with HDFS (third-party solutions of varying quality) SQL-on-Hadoop limited and slow http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html/ 
The Hadoop Stack:
Scalable Software. Real-time Big Data Analytics. 
Hadoop... and the kitchen sink 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
9 
Not shown: 
JobTracker. TaskTracker,... 
Avro (Serialization) 
Chukwa(logs, incremental) 
EMR 
BigTop 
Spark 
Impala (vs Hive) 
19 shown, up to 24 
Violates first rule of distributed programming: DO NOT DISTRIBUTE (unnecessarily).
Scalable Software. Real-time Big Data Analytics. 
Hadoop 2.0 - High TCO remains 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
10 
Improved: 
YARN, MapReduce separated 
Removed Name Node/ Job Tracker as SPOF? 
Unchanged: 
More complexity 
Low-level abstractions 
Still Master/Slave
Scalable Software. Real-time Big Data Analytics. 
Hadoop 2.0 - High TCO remains 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
11 
Hadoop-based stack example from the Web. Very Complicated: 
Berkeley Big-data 
Analytics Stack 
(BDAS)
Scalable Software. Real-time Big Data Analytics. 
Hadoop Disillusionment Phase? 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
12 
Hadoop hyped as a universal, disruptive big data solution 
MR is on its way out at Google. 
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ 
Data Scientists: 76 % felt Hadoop is too slow, too much effort to program http://www.cio.com/article/2449814/big-data/data-scientists-frustrated-by-data-variety-find-hadoop-limiting.html
Scalable Software. Real-time Big Data Analytics. 
Some Hadoop Alternatives 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
13 
Hadoop “extensions” 
MPP DBs 
Still needs many of Hadoop modules: HDFS, YARN, Pig, Hive, HBase, JobTracker.TaskTracker,... 
 Spark 
 Shark (Spark on Hive) 
 Storm 
 Kafka 
 Teradata 
 Vertica 
 Greenplum 
Proprietary, complex, VERY expensive, query language not Turing complete 
NoSQL DBs 
Column: Cassandra, Hbase, etc. 
Document/K-V: MongoDB, CouchDB, Riak, etc. 
In-Memory: VoltDB, etc. 
No ACID/Referential integrity, 
No triggers, foreign keys 
Mongo/Couch ->JSON-centered, Cassandra - complex 
Query language proprietary, subset of SQL 
VoltDB – precoded stored procs, no ad-hoc 
Not Turing complete 
In-Memory Data Grids 
 Coherence (commercial 
 Hazelcast (OSS) 
 Terracotta (commercial) 
 Gemfire (commercial) 
 GridGain (OSS/commercial) 
 Gigaspaces XAP (OSS)
Scalable Software. Real-time Big Data Analytics. 
Big+Fast Alternatives 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
14 
Storm/Kafka 
 Intrusive (introduces exotic abstractions): Streams, Spouts, Bolts, Tasks, Workers, Stream Groups, Topologies 
 Low-level: Shell scripting, Clojure code here and there 
 Complex Admin: Everything requires ZooKeeper, Nimbus is a SPOF 
https://news.ycombinator.com/item?id=8416455
Scalable Software. Real-time Big Data Analytics. 
Big+Fast Alternatives part 2 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
15 
Lambda Architecture example from the Web. Lots of moving parts:
Scalable Software. Real-time Big Data Analytics. 
Big, Fast Stack - Technical Requirements 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
16 
Successful Big + Fast Data stack should have: 
Architectural simplicity and elegance [lowers Total Cost of Ownership (TCO)] 
Low Transactional and Data Latency [response <1s, “fresh” or real-time data] 
High-throughput, overall up- and out- Scalability 
High-throughput Data Stream/Complex Event Processing 
SQL-like querying, ACID, Txs (including XA) 
Distributed Execution Framework 
High Availability and Fault Tolerance. 
Stateless, decentralized, elastic cluster management
Scalable Software. Real-time Big Data Analytics. 
In-Memory Grids for Big And Fast Data 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
17 
 Technology moving forward (6x sequential, 100K x random): 
 RAM is the new disk (ns) 
 DISK is the new tape (ms) 
 Databases (RDBMS, NoSQL) are not enough 
 SQL or SQL-like languages are not Turing complete 
 Object-oriented/functional/parallel programming abstractions needed 
 In-memory data is volatile/perishable? 
 In-memory data can be persisted, if so desired 
 No need to achieve archival durability. 
 Most Big Data brought in already stored somewhere else 
 Stream data may not be relevant for long, typically limited retention time
Scalable Software. Real-time Big Data Analytics. 
Why Hazelcast? 
The most intellectually elegant distributed in-memory data/compute grid. 
Distributed Execution Framework (extension of Java’s Executor Service) 
Distributed Queries (SQL/Predicate), Data affinity (execution on specific node execution) 
Elastic cluster management Java API included 
Auto-discovery of nodes and re-balancing 
Minimalism as design aesthetic: 
Non-intrusive, 
No dependencies 
3.1MB single jar library. 
Apache License 2, commercial extensions + support 
Implements Java APIs (Map, List, Set, Queue, Lock) in a distributed manner 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
18 
Peer-peer architecture, no single point of failure
Scalable Software. Real-time Big Data Analytics. 
Hazelcast vs Hadoop 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
19 
Java8 + Hazelcast (single 3.1 MB jar) Java + Hadoop 
Pluggable persistence (HDFS, MapR, RDBMS) HDFS 
MapReduce MapReduce 
Data manipulation & querying Hive (not OLTP) 
In-memory parallel processing Spark 
Stream processing Storm 
Messaging Kafka 
Scalable and Elastic ZooKeeper 
Cluster Management ZooKeeper, YARN, Mesos 
Elastic, simple (automatic cluster re-balancing) N/A
Scalable Software. Real-time Big Data Analytics. 
RAM currently maxes out at ~ 640GB/server (256GB?) 
RAM still more expensive than SSDs and HDDs 
Garbage collection 
Considerations 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
20 
Cost and capacity limitations will disappear over time 
Off-heap memory and specialized JVMs (Azul, etc.)
Scalable Software. Real-time Big Data Analytics. 
From Technologies to Platforms 
 Backing Storage: RDBMS, NoSQL, HDFS GlusterFs, XFS 
 Enterprise Message Store(s) 
 Multi-Source Data Harvesting, Ingestion, Transformation 
 Data Abstraction, Modeling, Querying, Visualization 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
21 
In-Memory Data/Compute Grids not enough:
Scalable Software. Real-time Big Data Analytics. 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
22 
The Demo: Clickstream Analytics + Intrusion Detection 
Traffic Generator 
GeoLocation Map (5M entries) 
UserActionQueue 
GeoDataLoader 
JSON/Http 
RDBMS (PostgreSQL) 
InsightMagic 
Node 1 (JVM) 
UserAction QueueProcessor 
JSON/Http (“fire and forget”) 
Backing Storage 
(optional) 
UserActionHistory Map (incudes location) 
“At Rest” Analytics Client 
“timestamp > now() - 30 minutes” 
1. take() 
“In-Motion” Analytics Client 
“countryCode=‘CN’” 
.csv files 
“201.35.217.36” 
“Columbus,OH, US” 
A. Existing Data 
2. resolveGeo() 
3. putIfAbsent() 
Features: 
SOA Architecture 
Data Ingestion 
Complex Event Processing 
Dimension, Fact Maps 
In-Motion, At-Rest Analytics 
Backing Storage 
ACID + Txs available 
B. Real-time Events 
Analytics 
& Visualization 
Data Abstraction 
& Virtualization 
Messaging should be 
a separate cluster 
Copyright (C) 2014 Numatica Corporation
Scalable Software. Real-time Big Data Analytics. 
Q & A 
Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 
23 
“Intellectuals solve problems. Geniuses prevent them.” 
-- Albert Einstein 
More questions? Feel free to contact me at: 
Jacek Kruszelnicki, Numatica Corporation 
E-mail: j a c e k@numatica.com (remove spaces) 
Phone: 781 756 8064
Appendix 
24
25

More Related Content

PPTX
From cache to in-memory data grid. Introduction to Hazelcast.
PDF
Distributed applications using Hazelcast
PPTX
Hazelcast For Beginners (Paris JUG-1)
PPTX
New life inside monolithic application
PPTX
In memory grids IMDG
PPTX
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
PDF
Introduction to hazelcast
PPTX
From distributed caches to in-memory data grids
From cache to in-memory data grid. Introduction to Hazelcast.
Distributed applications using Hazelcast
Hazelcast For Beginners (Paris JUG-1)
New life inside monolithic application
In memory grids IMDG
Spring Meetup Paris - Getting Distributed with Hazelcast and Spring
Introduction to hazelcast
From distributed caches to in-memory data grids

What's hot (20)

KEY
Infinspan: In-memory data grid meets NoSQL
PDF
Web session replication with Hazelcast
PDF
Scalability, Availability & Stability Patterns
PPTX
In Memory Data Grids, Demystified!
PDF
VMworld 2013: Virtualizing Databases: Doing IT Right
PPTX
Trusted advisory on technology comparison --exadata, hana, db2
PPTX
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
PPT
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
PDF
Development of concurrent services using In-Memory Data Grids
PDF
Design Patterns for Distributed Non-Relational Databases
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
PDF
A Closer Look at Apache Kudu
PDF
Postgres in Production - Best Practices 2014
 
PPTX
Best Practices for Virtualizing Hadoop
PDF
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
PPTX
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
PDF
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
PDF
Best Practices for Virtualizing Apache Hadoop
PPTX
Understanding the IBM Power Systems Advantage
PDF
9/ IBM POWER @ OPEN'16
Infinspan: In-memory data grid meets NoSQL
Web session replication with Hazelcast
Scalability, Availability & Stability Patterns
In Memory Data Grids, Demystified!
VMworld 2013: Virtualizing Databases: Doing IT Right
Trusted advisory on technology comparison --exadata, hana, db2
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Choosing the Right Big Data Tools for the Job - A Polyglot Approach
Development of concurrent services using In-Memory Data Grids
Design Patterns for Distributed Non-Relational Databases
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
A Closer Look at Apache Kudu
Postgres in Production - Best Practices 2014
 
Best Practices for Virtualizing Hadoop
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
DataStax | Distributing the Enterprise, Safely (Thomas Valley) | Cassandra Su...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
Best Practices for Virtualizing Apache Hadoop
Understanding the IBM Power Systems Advantage
9/ IBM POWER @ OPEN'16
Ad

Similar to Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop (20)

DOC
Big Data Technologies - Hadoop, Spark, and Beyond.doc
PPTX
Deutsche Telekom on Big Data
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPT
Final deck
PPTX
Hadoop and Big Data: Revealed
PPTX
Real time analytics
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PPTX
Boost Performance with Scala – Learn From Those Who’ve Done It!
PPTX
EMC Isilon Database Converged deck
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
Hadoop and BigData - July 2016
PDF
GOAI: GPU-Accelerated Data Science DataSciCon 2017
PDF
Big Data & Open Source - Neil Jadhav
PPTX
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
PDF
Exploring the Wider World of Big Data
PPTX
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
Big Data Technologies - Hadoop, Spark, and Beyond.doc
Deutsche Telekom on Big Data
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Final deck
Hadoop and Big Data: Revealed
Real time analytics
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
EMC Isilon Database Converged deck
Oct 2011 CHADNUG Presentation on Hadoop
Hadoop and BigData - July 2016
GOAI: GPU-Accelerated Data Science DataSciCon 2017
Big Data & Open Source - Neil Jadhav
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Exploring the Wider World of Big Data
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
UnConference for Georgia Southern Computer Science March 31, 2015
Ad

More from Hazelcast (20)

PDF
Hazelcast 3.6 Roadmap Preview
PDF
Time to Make the Move to In-Memory Data Grids
PDF
The Power of the JVM: Applied Polyglot Projects with Java and JavaScript
PDF
JCache - It's finally here
PDF
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
PDF
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
PDF
Applying Real-time SQL Changes in your Hazelcast Data Grid
PDF
WAN Replication: Hazelcast Enterprise Lightning Talk
PDF
JAAS Security Suite: Hazelcast Enterprise Lightning Talk
PDF
Hazelcast for Terracotta Users
PDF
Extreme Network Performance with Hazelcast on Torusware
PDF
JAXLondon - Squeezing Performance of IMDGs
PDF
OrientDB & Hazelcast: In-Memory Distributed Graph Database
PDF
How to Use HazelcastMQ for Flexible Messaging and More
PDF
Devoxx UK 2014 High Performance In-Memory Java with Open Source
PDF
JSR107 State of the Union JavaOne 2013
PDF
Jfokus - Hazlecast
PDF
In-memory No SQL- GIDS2014
PDF
In-memory Data Management Trends & Techniques
PDF
How to Speed up your Database
Hazelcast 3.6 Roadmap Preview
Time to Make the Move to In-Memory Data Grids
The Power of the JVM: Applied Polyglot Projects with Java and JavaScript
JCache - It's finally here
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Shared Memory Performance: Beyond TCP/IP with Ben Cotton, JPMorgan
Applying Real-time SQL Changes in your Hazelcast Data Grid
WAN Replication: Hazelcast Enterprise Lightning Talk
JAAS Security Suite: Hazelcast Enterprise Lightning Talk
Hazelcast for Terracotta Users
Extreme Network Performance with Hazelcast on Torusware
JAXLondon - Squeezing Performance of IMDGs
OrientDB & Hazelcast: In-Memory Distributed Graph Database
How to Use HazelcastMQ for Flexible Messaging and More
Devoxx UK 2014 High Performance In-Memory Java with Open Source
JSR107 State of the Union JavaOne 2013
Jfokus - Hazlecast
In-memory No SQL- GIDS2014
In-memory Data Management Trends & Techniques
How to Speed up your Database

Recently uploaded (20)

PPTX
history of c programming in notes for students .pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
AI in Product Development-omnex systems
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administration Chapter 2
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
top salesforce developer skills in 2025.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Introduction to Artificial Intelligence
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
history of c programming in notes for students .pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
AI in Product Development-omnex systems
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administration Chapter 2
ISO 45001 Occupational Health and Safety Management System
Adobe Illustrator 28.6 Crack My Vision of Vector Design
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
ManageIQ - Sprint 268 Review - Slide Deck
The Five Best AI Cover Tools in 2025.docx
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
top salesforce developer skills in 2025.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Understanding Forklifts - TECH EHS Solution
Introduction to Artificial Intelligence
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop

  • 1. Jacek Kruszelnicki, Numatica Corporation E-mail: j a c e k@numatica.com (remove spaces) Phone: 781 756 8064 Scalable Software. Real-time Big Data Analytics. Big Data, Simple and Fast: 1 Addressing the Shortcomings of Hadoop
  • 2. Scalable Software. Real-time Big Data Analytics. Jacek:  20+ years of shipping enterprise software  Biased towards simplicity/design elegance. Vendor–neutral.  Author, mentor & conference speaker  M.S. Computer Science, focused on business value of IT Jacek Kruszelnicki President & CEO, Numatica Corporation Presenter Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 2 Focus:  Distributed, High-throughput, Low-latency software.  Real-time (“Fast”) Big Data. Services:  IT Strategy and Planning  Software Architecture & Design  Proof of Concept, Quick Start programs  Software Development  Trainings, Seminars
  • 3. Scalable Software. Real-time Big Data Analytics. Agenda  Hadoop is not a universal, or inexpensive, Big Data stack  Technical requirements for a flexible Big/Fast Data stack  Solutions thought to be alternatives to Hadoop  Why In-Memory Data Grids are a good fit for Big/Fast Data  How Hazelcast meets the Big/Fast Data requirements  Focus on architectures, but demo/code samples provided Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 3 “Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction.” - E.F. Schumacher
  • 4. Scalable Software. Real-time Big Data Analytics. Not on Agenda  Big Data Tutorial  Hadoop or any other technology tutorial  In-depth overview of Big Data market  Big Data Analytics -> Business Insights  CAP Theorem discussion Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 4
  • 5. Scalable Software. Real-time Big Data Analytics. Operational Intelligence:  Analyze stream of business activities and external stimuli on-the fly  React to them (preferably) instantaneously  Real time data stream processing is critical, otherwise business value is lost. Real-time Big (Fast) Data Analytics examples:  Dynamic pricing (e-commerce)  High-frequency trading  Network security threats  Credit card fraud prevention  Factory floor data collection, RFID  Mobile infrastructure, machine to machine (M2M) applications  Prescriptive or Location-based applications  Real-time dashboards, alerts, and reports Fast Data - Business Drivers Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 5
  • 6. Scalable Software. Real-time Big Data Analytics. Big Data – The 3 Vs Re-arranged Volume Variety Velocity Common definition: Data sets too large and complex to process using standard DB tools and data processing applications (in short: “Will not fit into MS Excel”) Copyright (C) 2014 Numatica Corporation. All Rights Reserved.  Structured, semi-structured  At-Rest, In-Motion  Variety of formats  Millions of events per second  Need to re-run analytics frequently (Dashboards, etc)  Frequently changing data (at rest or in- motion)  Stale data provides little business value  Also need to do off-line processing (machine learning) 6  Few Googles and Facebooks out there  Typical data < 100 TB
  • 7. Scalable Software. Real-time Big Data Analytics. Big...and Fast  Rapidly changing, massive stream(s) of data (events)  Multiple sources, formats  Needs to be processed in-flight  Events persisted or not, depending on business value Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 7 Data In-Motion: Data At-Rest:  Data already persisted, change notification  Multiple formats  Re-ingest, re-process Software stacks to support “In-Motion” and “At-Rest” Big Data are needed.
  • 8. Scalable Software. Real-time Big Data Analytics. ”Big Elephant in the room...” Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 8  Currently the large scale data analysis tool of choice  De facto standard, almost synonymous with Big Data  “Nobody ever got fired for using Hadoop” Problems: Very complex/expensive to deploy and run -> high TCO MapReduce slow, batch-oriented, “intrusive” No stream processing No support for in-memory processing Closely coupled with HDFS (third-party solutions of varying quality) SQL-on-Hadoop limited and slow http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html/ The Hadoop Stack:
  • 9. Scalable Software. Real-time Big Data Analytics. Hadoop... and the kitchen sink Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 9 Not shown: JobTracker. TaskTracker,... Avro (Serialization) Chukwa(logs, incremental) EMR BigTop Spark Impala (vs Hive) 19 shown, up to 24 Violates first rule of distributed programming: DO NOT DISTRIBUTE (unnecessarily).
  • 10. Scalable Software. Real-time Big Data Analytics. Hadoop 2.0 - High TCO remains Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 10 Improved: YARN, MapReduce separated Removed Name Node/ Job Tracker as SPOF? Unchanged: More complexity Low-level abstractions Still Master/Slave
  • 11. Scalable Software. Real-time Big Data Analytics. Hadoop 2.0 - High TCO remains Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 11 Hadoop-based stack example from the Web. Very Complicated: Berkeley Big-data Analytics Stack (BDAS)
  • 12. Scalable Software. Real-time Big Data Analytics. Hadoop Disillusionment Phase? Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 12 Hadoop hyped as a universal, disruptive big data solution MR is on its way out at Google. http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/ Data Scientists: 76 % felt Hadoop is too slow, too much effort to program http://www.cio.com/article/2449814/big-data/data-scientists-frustrated-by-data-variety-find-hadoop-limiting.html
  • 13. Scalable Software. Real-time Big Data Analytics. Some Hadoop Alternatives Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 13 Hadoop “extensions” MPP DBs Still needs many of Hadoop modules: HDFS, YARN, Pig, Hive, HBase, JobTracker.TaskTracker,...  Spark  Shark (Spark on Hive)  Storm  Kafka  Teradata  Vertica  Greenplum Proprietary, complex, VERY expensive, query language not Turing complete NoSQL DBs Column: Cassandra, Hbase, etc. Document/K-V: MongoDB, CouchDB, Riak, etc. In-Memory: VoltDB, etc. No ACID/Referential integrity, No triggers, foreign keys Mongo/Couch ->JSON-centered, Cassandra - complex Query language proprietary, subset of SQL VoltDB – precoded stored procs, no ad-hoc Not Turing complete In-Memory Data Grids  Coherence (commercial  Hazelcast (OSS)  Terracotta (commercial)  Gemfire (commercial)  GridGain (OSS/commercial)  Gigaspaces XAP (OSS)
  • 14. Scalable Software. Real-time Big Data Analytics. Big+Fast Alternatives Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 14 Storm/Kafka  Intrusive (introduces exotic abstractions): Streams, Spouts, Bolts, Tasks, Workers, Stream Groups, Topologies  Low-level: Shell scripting, Clojure code here and there  Complex Admin: Everything requires ZooKeeper, Nimbus is a SPOF https://news.ycombinator.com/item?id=8416455
  • 15. Scalable Software. Real-time Big Data Analytics. Big+Fast Alternatives part 2 Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 15 Lambda Architecture example from the Web. Lots of moving parts:
  • 16. Scalable Software. Real-time Big Data Analytics. Big, Fast Stack - Technical Requirements Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 16 Successful Big + Fast Data stack should have: Architectural simplicity and elegance [lowers Total Cost of Ownership (TCO)] Low Transactional and Data Latency [response <1s, “fresh” or real-time data] High-throughput, overall up- and out- Scalability High-throughput Data Stream/Complex Event Processing SQL-like querying, ACID, Txs (including XA) Distributed Execution Framework High Availability and Fault Tolerance. Stateless, decentralized, elastic cluster management
  • 17. Scalable Software. Real-time Big Data Analytics. In-Memory Grids for Big And Fast Data Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 17  Technology moving forward (6x sequential, 100K x random):  RAM is the new disk (ns)  DISK is the new tape (ms)  Databases (RDBMS, NoSQL) are not enough  SQL or SQL-like languages are not Turing complete  Object-oriented/functional/parallel programming abstractions needed  In-memory data is volatile/perishable?  In-memory data can be persisted, if so desired  No need to achieve archival durability.  Most Big Data brought in already stored somewhere else  Stream data may not be relevant for long, typically limited retention time
  • 18. Scalable Software. Real-time Big Data Analytics. Why Hazelcast? The most intellectually elegant distributed in-memory data/compute grid. Distributed Execution Framework (extension of Java’s Executor Service) Distributed Queries (SQL/Predicate), Data affinity (execution on specific node execution) Elastic cluster management Java API included Auto-discovery of nodes and re-balancing Minimalism as design aesthetic: Non-intrusive, No dependencies 3.1MB single jar library. Apache License 2, commercial extensions + support Implements Java APIs (Map, List, Set, Queue, Lock) in a distributed manner Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 18 Peer-peer architecture, no single point of failure
  • 19. Scalable Software. Real-time Big Data Analytics. Hazelcast vs Hadoop Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 19 Java8 + Hazelcast (single 3.1 MB jar) Java + Hadoop Pluggable persistence (HDFS, MapR, RDBMS) HDFS MapReduce MapReduce Data manipulation & querying Hive (not OLTP) In-memory parallel processing Spark Stream processing Storm Messaging Kafka Scalable and Elastic ZooKeeper Cluster Management ZooKeeper, YARN, Mesos Elastic, simple (automatic cluster re-balancing) N/A
  • 20. Scalable Software. Real-time Big Data Analytics. RAM currently maxes out at ~ 640GB/server (256GB?) RAM still more expensive than SSDs and HDDs Garbage collection Considerations Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 20 Cost and capacity limitations will disappear over time Off-heap memory and specialized JVMs (Azul, etc.)
  • 21. Scalable Software. Real-time Big Data Analytics. From Technologies to Platforms  Backing Storage: RDBMS, NoSQL, HDFS GlusterFs, XFS  Enterprise Message Store(s)  Multi-Source Data Harvesting, Ingestion, Transformation  Data Abstraction, Modeling, Querying, Visualization Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 21 In-Memory Data/Compute Grids not enough:
  • 22. Scalable Software. Real-time Big Data Analytics. Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 22 The Demo: Clickstream Analytics + Intrusion Detection Traffic Generator GeoLocation Map (5M entries) UserActionQueue GeoDataLoader JSON/Http RDBMS (PostgreSQL) InsightMagic Node 1 (JVM) UserAction QueueProcessor JSON/Http (“fire and forget”) Backing Storage (optional) UserActionHistory Map (incudes location) “At Rest” Analytics Client “timestamp > now() - 30 minutes” 1. take() “In-Motion” Analytics Client “countryCode=‘CN’” .csv files “201.35.217.36” “Columbus,OH, US” A. Existing Data 2. resolveGeo() 3. putIfAbsent() Features: SOA Architecture Data Ingestion Complex Event Processing Dimension, Fact Maps In-Motion, At-Rest Analytics Backing Storage ACID + Txs available B. Real-time Events Analytics & Visualization Data Abstraction & Virtualization Messaging should be a separate cluster Copyright (C) 2014 Numatica Corporation
  • 23. Scalable Software. Real-time Big Data Analytics. Q & A Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 23 “Intellectuals solve problems. Geniuses prevent them.” -- Albert Einstein More questions? Feel free to contact me at: Jacek Kruszelnicki, Numatica Corporation E-mail: j a c e k@numatica.com (remove spaces) Phone: 781 756 8064
  • 25. 25