Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014

Enabling Key Business Advantage from Big Data
through Advanced Ingest Processing
Ronald S. Indeck, PhD
President and Founder
VelociData, Inc.
Solving the Need for Speed in
Big DataOps

www.velocidata.com info@velocidata.com
Today’s Discussion
• Motivations for Advanced Processing
• Total Data Challenges
• Economical Parallelism for IT is Arriving
• Heterogeneous System Architectures (HSA)
• HSA Implementation and Business Benchmarks
• Questions

Big Data?
3

The Urgency for Gaining Answers in Seconds
Companies that Embrace Analytics
Accelerate Performance
“Value Integrators” achieve higher
business performance:
‒ 20 times the EBITDA growth
‒ 50% more revenue growth
• “Large-scale data gathering and analytics are quickly becoming a new frontier of competitive
differentiation” – HBR
• The challenge for IT is to economically provide real time, quality data to support business analytics
and meet time-bound service level requirements when data are doubling every 12 months
Analytics is creating a
competitive advantage
4

Recognizing “Total Data” Challenges
• Bloor: Databases are more than adequate for the use cases they are
designed to support
• Consider Big Data AND Relational, not OR … think “Total Data”
• The critical unsolved challenge is breaking Total Data flow bottlenecks
5
• Total Data challenges
• Data volumes exploding
• Data velocity and variety growing
• Data must quickly move between disparate systems
• Processing high volumes on mainframes is expensive
• No spare resources for critical encryption / masking
• Improving or measuring data quality is challenging

Conventional Approaches
• Add more cores and memory to the existing platform
• Push processing into MPP (Teradata, Netezza, …)
• Change the infrastructure (Oracle Exadata, …)
• Use distributed platforms (Hadoop, ...)
These require new skills, time, capital, management, support,
risk … and fail to truly solve the Total Data flow problem
6

Parallelism in IT Processing is Compelling
• Amdahl’s Law
• High Performance Computing history
• Systems were expensive
• Unique tools and training required
• Scaling performance is often sub-linear
• Issues with timing and thread synchronization
HPC has struggled for 40 years to deliver widespread accessibility mostly due
to cost and poor abstraction, development tools, and design environment
If we could just deliver accessibility at an affordable cost …
• Hardware is now becoming inexpensive
• Application development improvements still needed to enable productivity
 Abstract through implementation of streaming as the paradigm
7

Complementary Approach: Heterogeneous System Architecture
• Leverage a variety of compute resources
• Not just parallel threads on identical resources
• Right resources at the right times
• Functional elements use appropriate processing components where needed
• Accommodate stream processing
• Source  processing  target
• Streaming data model enables pipelining, data flow acceleration
• Embrace fine-grained pipeline / functional parallelism
• Especially data / direct parallelism
• Separate latency and throughput
• Engineered system
• Manage thread, memory, and resource timing and contention
8

Heterogeneous System Architecture
General purpose “not bad at everything”
- Good branch prediction, fast access to large
memory
Thousands of cores performing very
specific tasks
- Excellent matrix and floating point
Fully customizable with extreme
opportunities for parallelism
- Excels at bit manipulation for regex,
cryptography, searching, …
9
Standard CPUs
Graphics Boards (GPUs)
FPGA Coprocessors

• Compute “value at risk” for a portfolio
• 1024 stocks
• Evaluate using Monte Carlo simulation
• Brownian motion random walk
• Execute 1 million trials and aggregate results: 1 trial equals 1024
random walks
• Double-precision computation
Example: Risk Modeling Application
10

Example: Risk Modeling Performance Results
• Baseline [CPU-only]
• 450 thousand walks/second  37 minutes to execute 1 billion walks
• FPGA + GPU + CPU
• 140 million walks/second  6 seconds for 1 billion walks
• Speedup of 370x
• Other financial MC simulations are similar
*First use of GPU, FPGA, and CPU in one application
application
stage 1
application
stage 2
application
stage 3
FPGA
graphics
engine
chip
multi-
processor
11

• Bundles software, firmware, and hardware into an appliance
• Delivers the right compute resource (CPU, GPU, and FPGA) to the right
process at the right time
• Uses other system resources effectively
• High-level abstraction: no need to code, re-train, or acquire new skillsets
• Promotes stream processing for real-time action
• Sources  processing  targets
• Streaming data model enables pipelining for data flow acceleration
Stream Processing as an HSA Appliance
12

Example: VelociData Solution Palette
17
VelociData
Suites
VelociData Solutions Examples Conventional
(records/second)
VelociData
(records/second)
Data
Transformation
Lookup and Replace
Data enrichment by populating fields from a master file,
dictionary translations, etc. (e.g. CP  Cardiopulmonologist)
3000-6000 600,000
Type Conversions XML  Fixed; Binary  Char; Date/Time Formats 1000-2000 800,000
Format Conversions
Rearrange, add, drop, merge, split, and resize fields to change
layouts
1000-10,000 650,000
Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000
Data Masking
Obfuscate data for non-production uses: Persistent or Dynamic;
Format preserving; AES-256
500-10,000 > 1,000,000
Data Quality
USPS Address
Processing
Standardization, verification, and cleansing
(CASS certification in process)
600-2000 400,000
Domain Set Validation
Validate a value based on a list of acceptable values (e.g., all
product codes at a retailer; all countries in the world)
1000-3000 750,000
Field Content Validation
Validates based on patterns such as emails, dates, and phone
numbers
1000-3000 > 1,000,000
Data type validation and bounds checking 3000-6000 > 1,000,000
Data Platform
Conversion
Mainframe Data
Conversion
Copybook parsing & data layout discovery; EBCDIC, COMP,
COMP-3, …  ASCII, Integer, Float,…
200-800 > 200,000
Data Sort Accelerated Data Sort
Sort data using complex sort keys from multiple fields within
records
7000-20,000 1,000,000
Results are system dependent but data intended to provide magnitude comparison

Example of Common ETL Bottlenecks
Task #1
Task #2
Task #3
Task #4
Task #5
Task #6
Task #7
Task #8
Staging DB
ETL Server
Candidates for
Acceleration
Extract Transform Load
CSV
Mainframe
XML
RDBMS
Social Media
Sensor
Hadoop
• Hadoop
• ETL Server
• Data Warehouse
• Database Appliances
• BI Tools
•Cloud

Example ETL Processes Offloaded
15
Task #6
Task #7
Task #8
Staging DB
ETL Server
Extract Transform Load
Keep Existing Input
Interfaces
Remove
Bottlenecks
Reduce ETL Server
Workload
Faster Total
Processing Time
CSV
Mainframe
XML
RDBMS
Social Media
Sensor
Hadoop
Task #1
Task #2
Task #3
Task #4
Task #5
• Hadoop
• ETL Server
• Data Warehouse
• Database
Appliances
• BI Tools
•Cloud

Example Mainframe-to-Hadoop Workflow
• Simple, configuration-driven workflow
• Sample shows Mainframe  HDFS
• Data are validated, cleansed, reformatted, enriched, …, along the way
• Enables landing analytics-ready data as fast as it can move
across the wire
• Workflow can also work in reverse to return processed data to
the mainframe
16
Mainframe
Input
Validation Key Generation Formatter Lookup Address
Standardization CSV Out

Wire-rate Platform Integration
17
Enable fast data access between systems
MPP Platforms (e.g., Teradata)
Format and improve data for ready
insertion into Data Analytics
architectures ETL Server
Preprocess data for fast movement
into and out of Data Integration tools
Mainframe
Conversion into and out
of EBCDIC and packed
decimal formats
Hadoop
Convert data to ASCII and
improve quality in flight
VelociData
feeds Hadoop
pre-processed,
quality data for
real-time BI efforts
VelociData
enables real-time
data access by
Teradata for
operational
analytics

Enabling Three Layers of Data Access
VelociData delivers Hadoop
pre-processed, quality data to
keep “the lake” clean
Hadoop
VelociData enables real-time
data access for immediate
analytics and visualization
VelociData feeds databases and
warehouses pre-analytic, aggregated
data for operational analytics
• Sensors
• Weblogs
• Transactions
• Mainframe
• Hadoop
• Social Media
• RDBMS
• …
Wire-rate transformations and convergence of fresh and historical data
19

Accessing Realtime and Historical Data
• Realtime Analysis for Competitive
Advantage
• Enabling the speed of business to match
business opportunities
• Integrating Historical Data for
Operational Excellence
• Informing traditional BI with real-time inputs
19
Conventional Batch-oriented BI
Real-time Operational Analytics
Iterative Modeling
Business Excellence

Stream Processing AND Hadoop
Leveraging stream processing with batch-oriented Hadoop
• Access to more data for analytics
• Process data on ingest (also land raw data if desired)
• Transformation
• Cleansing
• Security
• Never read a COBOL copybook again
• Stream sort for integrating data, aggregation, and dedupe
• …
20

Examples of Data Challenges Being Solved
21
• Pharmaceutical discovery query is reduced from 8 days to 20 minutes
• Retailer now integrates full customer data from in-store, on-line, and mobile sources in
real-time (processing 50,000 records/s, up from 100/s)
• Property casualty company shortens by five-fold a daily task of processing 540 million
records to enable more accurate real-time quoting
• Credit card company reduces mainframe costs and improves analytics performance by
integrating historical and fresh data into Hadoop at line rates
• Financial processing network masks 5 million fields/s of production data to sell
opportunity information to retailers
• To enable better customer support, a health benefits provider shortens a data
integration process from 16 hours to 45 seconds
• Billions of records with multi-fields keys are sorted nearly a million records/s for
analytics and data quality
• USPS address standardization at 10 billion/hour for data cleansing on ingest

Thank You!

Questions?

Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014

More Related Content

What's hot (20)

Similar to Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014 (20)

More from StampedeCon (20)

Recently uploaded (20)

Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014