Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop

Jacek Kruszelnicki, Numatica Corporation
E-mail: j a c e k@numatica.com (remove spaces)
Phone: 781 756 8064
Scalable Software. Real-time Big Data Analytics.
Big Data, Simple and Fast:
1
Addressing the Shortcomings of Hadoop

Jacek:
 20+ years of shipping enterprise software
 Biased towards simplicity/design elegance. Vendor–neutral.
 Author, mentor & conference speaker
 M.S. Computer Science, focused on business value of IT
Jacek Kruszelnicki
President & CEO, Numatica Corporation
Presenter
Copyright (C) 2014 Numatica Corporation. All Rights Reserved.
2
Focus:
 Distributed, High-throughput, Low-latency software.
 Real-time (“Fast”) Big Data. Services:
 IT Strategy and Planning
 Software Architecture & Design
 Proof of Concept, Quick Start programs
 Software Development
 Trainings, Seminars

Agenda
 Hadoop is not a universal, or inexpensive, Big Data stack
 Technical requirements for a flexible Big/Fast Data stack
 Solutions thought to be alternatives to Hadoop
 Why In-Memory Data Grids are a good fit for Big/Fast Data
 How Hazelcast meets the Big/Fast Data requirements
 Focus on architectures, but demo/code samples provided
3
“Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction.” - E.F. Schumacher

Not on Agenda
 Big Data Tutorial
 Hadoop or any other technology tutorial
 In-depth overview of Big Data market
 Big Data Analytics -> Business Insights
 CAP Theorem discussion
4

Operational Intelligence:
 Analyze stream of business activities and external stimuli on-the fly
 React to them (preferably) instantaneously
 Real time data stream processing is critical, otherwise business value is lost. Real-time Big (Fast) Data Analytics examples:
 Dynamic pricing (e-commerce)
 High-frequency trading
 Network security threats
 Credit card fraud prevention
 Factory floor data collection, RFID
 Mobile infrastructure, machine to machine (M2M) applications
 Prescriptive or Location-based applications
 Real-time dashboards, alerts, and reports
Fast Data - Business Drivers
5

Big Data – The 3 Vs Re-arranged
Volume
Variety
Velocity
Common definition: Data sets too large and complex to process using standard DB tools and data processing applications (in short: “Will not fit into MS Excel”)
 Structured, semi-structured
 At-Rest, In-Motion
 Variety of formats
 Millions of events per second
 Need to re-run analytics frequently (Dashboards, etc)
 Frequently changing data (at rest or in- motion)
 Stale data provides little business value
 Also need to do off-line processing (machine learning)
6
 Few Googles and Facebooks out there
 Typical data < 100 TB

Big...and Fast
 Rapidly changing, massive stream(s) of data (events)
 Multiple sources, formats
 Needs to be processed in-flight
 Events persisted or not, depending on business value
7
Data In-Motion:
Data At-Rest:
 Data already persisted, change notification
 Multiple formats
 Re-ingest, re-process
Software stacks to support “In-Motion” and “At-Rest” Big Data are needed.

”Big Elephant in the room...”
8
 Currently the large scale data analysis tool of choice
 De facto standard, almost synonymous with Big Data
 “Nobody ever got fired for using Hadoop” Problems: Very complex/expensive to deploy and run -> high TCO MapReduce slow, batch-oriented, “intrusive” No stream processing No support for in-memory processing Closely coupled with HDFS (third-party solutions of varying quality) SQL-on-Hadoop limited and slow http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/ http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html/
The Hadoop Stack:

Hadoop... and the kitchen sink
9
Not shown:
JobTracker. TaskTracker,...
Avro (Serialization)
Chukwa(logs, incremental)
EMR
BigTop
Spark
Impala (vs Hive)
19 shown, up to 24
Violates first rule of distributed programming: DO NOT DISTRIBUTE (unnecessarily).

Hadoop 2.0 - High TCO remains
10
Improved:
YARN, MapReduce separated
Removed Name Node/ Job Tracker as SPOF?
Unchanged:
More complexity
Low-level abstractions
Still Master/Slave

Hadoop 2.0 - High TCO remains
11
Hadoop-based stack example from the Web. Very Complicated:
Berkeley Big-data
Analytics Stack
(BDAS)

Hadoop Disillusionment Phase?
12
Hadoop hyped as a universal, disruptive big data solution
MR is on its way out at Google.
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/
Data Scientists: 76 % felt Hadoop is too slow, too much effort to program http://www.cio.com/article/2449814/big-data/data-scientists-frustrated-by-data-variety-find-hadoop-limiting.html

Some Hadoop Alternatives
13
Hadoop “extensions”
MPP DBs
Still needs many of Hadoop modules: HDFS, YARN, Pig, Hive, HBase, JobTracker.TaskTracker,...
 Spark
 Shark (Spark on Hive)
 Storm
 Kafka
 Teradata
 Vertica
 Greenplum
Proprietary, complex, VERY expensive, query language not Turing complete
NoSQL DBs
Column: Cassandra, Hbase, etc.
Document/K-V: MongoDB, CouchDB, Riak, etc.
In-Memory: VoltDB, etc.
No ACID/Referential integrity,
No triggers, foreign keys
Mongo/Couch ->JSON-centered, Cassandra - complex
Query language proprietary, subset of SQL
VoltDB – precoded stored procs, no ad-hoc
Not Turing complete
In-Memory Data Grids
 Coherence (commercial
 Hazelcast (OSS)
 Terracotta (commercial)
 Gemfire (commercial)
 GridGain (OSS/commercial)
 Gigaspaces XAP (OSS)

Big+Fast Alternatives
14
Storm/Kafka
 Intrusive (introduces exotic abstractions): Streams, Spouts, Bolts, Tasks, Workers, Stream Groups, Topologies
 Low-level: Shell scripting, Clojure code here and there
 Complex Admin: Everything requires ZooKeeper, Nimbus is a SPOF
https://news.ycombinator.com/item?id=8416455

Big+Fast Alternatives part 2
15
Lambda Architecture example from the Web. Lots of moving parts:

Big, Fast Stack - Technical Requirements
16
Successful Big + Fast Data stack should have:
Architectural simplicity and elegance [lowers Total Cost of Ownership (TCO)]
Low Transactional and Data Latency [response <1s, “fresh” or real-time data]
High-throughput, overall up- and out- Scalability
High-throughput Data Stream/Complex Event Processing
SQL-like querying, ACID, Txs (including XA)
Distributed Execution Framework
High Availability and Fault Tolerance.
Stateless, decentralized, elastic cluster management

In-Memory Grids for Big And Fast Data
17
 Technology moving forward (6x sequential, 100K x random):
 RAM is the new disk (ns)
 DISK is the new tape (ms)
 Databases (RDBMS, NoSQL) are not enough
 SQL or SQL-like languages are not Turing complete
 Object-oriented/functional/parallel programming abstractions needed
 In-memory data is volatile/perishable?
 In-memory data can be persisted, if so desired
 No need to achieve archival durability.
 Most Big Data brought in already stored somewhere else
 Stream data may not be relevant for long, typically limited retention time

Why Hazelcast?
The most intellectually elegant distributed in-memory data/compute grid.
Distributed Execution Framework (extension of Java’s Executor Service)
Distributed Queries (SQL/Predicate), Data affinity (execution on specific node execution)
Elastic cluster management Java API included
Auto-discovery of nodes and re-balancing
Minimalism as design aesthetic:
Non-intrusive,
No dependencies
3.1MB single jar library.
Apache License 2, commercial extensions + support
Implements Java APIs (Map, List, Set, Queue, Lock) in a distributed manner
18
Peer-peer architecture, no single point of failure

Hazelcast vs Hadoop
19
Java8 + Hazelcast (single 3.1 MB jar) Java + Hadoop
Pluggable persistence (HDFS, MapR, RDBMS) HDFS
MapReduce MapReduce
Data manipulation & querying Hive (not OLTP)
In-memory parallel processing Spark
Stream processing Storm
Messaging Kafka
Scalable and Elastic ZooKeeper
Cluster Management ZooKeeper, YARN, Mesos
Elastic, simple (automatic cluster re-balancing) N/A

RAM currently maxes out at ~ 640GB/server (256GB?)
RAM still more expensive than SSDs and HDDs
Garbage collection
Considerations
20
Cost and capacity limitations will disappear over time
Off-heap memory and specialized JVMs (Azul, etc.)

From Technologies to Platforms
 Backing Storage: RDBMS, NoSQL, HDFS GlusterFs, XFS
 Enterprise Message Store(s)
 Multi-Source Data Harvesting, Ingestion, Transformation
 Data Abstraction, Modeling, Querying, Visualization
21
In-Memory Data/Compute Grids not enough:

22
The Demo: Clickstream Analytics + Intrusion Detection
Traffic Generator
GeoLocation Map (5M entries)
UserActionQueue
GeoDataLoader
JSON/Http
RDBMS (PostgreSQL)
InsightMagic
Node 1 (JVM)
UserAction QueueProcessor
JSON/Http (“fire and forget”)
Backing Storage
(optional)
UserActionHistory Map (incudes location)
“At Rest” Analytics Client
“timestamp > now() - 30 minutes”
1. take()
“In-Motion” Analytics Client
“countryCode=‘CN’”
.csv files
“201.35.217.36”
“Columbus,OH, US”
A. Existing Data
2. resolveGeo()
3. putIfAbsent()
Features:
SOA Architecture
Data Ingestion
Complex Event Processing
Dimension, Fact Maps
In-Motion, At-Rest Analytics
Backing Storage
ACID + Txs available
B. Real-time Events
Analytics
& Visualization
Data Abstraction
& Virtualization
Messaging should be
a separate cluster
Copyright (C) 2014 Numatica Corporation

Q & A
23
“Intellectuals solve problems. Geniuses prevent them.”
-- Albert Einstein
More questions? Feel free to contact me at:
Jacek Kruszelnicki, Numatica Corporation
E-mail: j a c e k@numatica.com (remove spaces)
Phone: 781 756 8064

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop

More Related Content

What's hot (20)

Similar to Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop (20)

More from Hazelcast (20)

Recently uploaded (20)

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop