SlideShare a Scribd company logo
Enabling Key Business Advantage from Big Data
through Advanced Ingest Processing
Ronald S. Indeck, PhD
President and Founder
VelociData, Inc.
Solving the Need for Speed in
Big DataOps
www.velocidata.com info@velocidata.com
Today’s Discussion
• Motivations for Advanced Processing
• Total Data Challenges
• Economical Parallelism for IT is Arriving
• Heterogeneous System Architectures (HSA)
• HSA Implementation and Business Benchmarks
• Questions
www.velocidata.com info@velocidata.com
Big Data?
3
www.velocidata.com info@velocidata.com
The Urgency for Gaining Answers in Seconds
Companies that Embrace Analytics
Accelerate Performance
“Value Integrators” achieve higher
business performance:
‒ 20 times the EBITDA growth
‒ 50% more revenue growth
• “Large-scale data gathering and analytics are quickly becoming a new frontier of competitive
differentiation” – HBR
• The challenge for IT is to economically provide real time, quality data to support business analytics
and meet time-bound service level requirements when data are doubling every 12 months
Analytics is creating a
competitive advantage
4
www.velocidata.com info@velocidata.com
Recognizing “Total Data” Challenges
• Bloor: Databases are more than adequate for the use cases they are
designed to support
• Consider Big Data AND Relational, not OR … think “Total Data”
• The critical unsolved challenge is breaking Total Data flow bottlenecks
5
• Total Data challenges
• Data volumes exploding
• Data velocity and variety growing
• Data must quickly move between disparate systems
• Processing high volumes on mainframes is expensive
• No spare resources for critical encryption / masking
• Improving or measuring data quality is challenging
www.velocidata.com info@velocidata.com
Conventional Approaches
• Add more cores and memory to the existing platform
• Push processing into MPP (Teradata, Netezza, …)
• Change the infrastructure (Oracle Exadata, …)
• Use distributed platforms (Hadoop, ...)
These require new skills, time, capital, management, support,
risk … and fail to truly solve the Total Data flow problem
6
www.velocidata.com info@velocidata.com
Parallelism in IT Processing is Compelling
• Amdahl’s Law
• High Performance Computing history
• Systems were expensive
• Unique tools and training required
• Scaling performance is often sub-linear
• Issues with timing and thread synchronization
HPC has struggled for 40 years to deliver widespread accessibility mostly due
to cost and poor abstraction, development tools, and design environment
If we could just deliver accessibility at an affordable cost …
• Hardware is now becoming inexpensive
• Application development improvements still needed to enable productivity
 Abstract through implementation of streaming as the paradigm
7
www.velocidata.com info@velocidata.com
Complementary Approach: Heterogeneous System Architecture
• Leverage a variety of compute resources
• Not just parallel threads on identical resources
• Right resources at the right times
• Functional elements use appropriate processing components where needed
• Accommodate stream processing
• Source  processing  target
• Streaming data model enables pipelining, data flow acceleration
• Embrace fine-grained pipeline / functional parallelism
• Especially data / direct parallelism
• Separate latency and throughput
• Engineered system
• Manage thread, memory, and resource timing and contention
8
www.velocidata.com info@velocidata.com
Heterogeneous System Architecture
General purpose “not bad at everything”
- Good branch prediction, fast access to large
memory
Thousands of cores performing very
specific tasks
- Excellent matrix and floating point
Fully customizable with extreme
opportunities for parallelism
- Excels at bit manipulation for regex,
cryptography, searching, …
9
Standard CPUs
Graphics Boards (GPUs)
FPGA Coprocessors
www.velocidata.com info@velocidata.com
• Compute “value at risk” for a portfolio
• 1024 stocks
• Evaluate using Monte Carlo simulation
• Brownian motion random walk
• Execute 1 million trials and aggregate results: 1 trial equals 1024
random walks
• Double-precision computation
Example: Risk Modeling Application
10
www.velocidata.com info@velocidata.com
Example: Risk Modeling Performance Results
• Baseline [CPU-only]
• 450 thousand walks/second  37 minutes to execute 1 billion walks
• FPGA + GPU + CPU
• 140 million walks/second  6 seconds for 1 billion walks
• Speedup of 370x
• Other financial MC simulations are similar
*First use of GPU, FPGA, and CPU in one application
application
stage 1
application
stage 2
application
stage 3
FPGA
graphics
engine
chip
multi-
processor
11
www.velocidata.com info@velocidata.com
• Bundles software, firmware, and hardware into an appliance
• Delivers the right compute resource (CPU, GPU, and FPGA) to the right
process at the right time
• Uses other system resources effectively
• High-level abstraction: no need to code, re-train, or acquire new skillsets
• Promotes stream processing for real-time action
• Sources  processing  targets
• Streaming data model enables pipelining for data flow acceleration
Stream Processing as an HSA Appliance
12
www.velocidata.com info@velocidata.com
Example: VelociData Solution Palette
17
VelociData
Suites
VelociData Solutions Examples Conventional
(records/second)
VelociData
(records/second)
Data
Transformation
Lookup and Replace
Data enrichment by populating fields from a master file,
dictionary translations, etc. (e.g. CP  Cardiopulmonologist)
3000-6000 600,000
Type Conversions XML  Fixed; Binary  Char; Date/Time Formats 1000-2000 800,000
Format Conversions
Rearrange, add, drop, merge, split, and resize fields to change
layouts
1000-10,000 650,000
Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000
Data Masking
Obfuscate data for non-production uses: Persistent or Dynamic;
Format preserving; AES-256
500-10,000 > 1,000,000
Data Quality
USPS Address
Processing
Standardization, verification, and cleansing
(CASS certification in process)
600-2000 400,000
Domain Set Validation
Validate a value based on a list of acceptable values (e.g., all
product codes at a retailer; all countries in the world)
1000-3000 750,000
Field Content Validation
Validates based on patterns such as emails, dates, and phone
numbers
1000-3000 > 1,000,000
Data type validation and bounds checking 3000-6000 > 1,000,000
Data Platform
Conversion
Mainframe Data
Conversion
Copybook parsing & data layout discovery; EBCDIC, COMP,
COMP-3, …  ASCII, Integer, Float,…
200-800 > 200,000
Data Sort Accelerated Data Sort
Sort data using complex sort keys from multiple fields within
records
7000-20,000 1,000,000
Results are system dependent but data intended to provide magnitude comparison
www.velocidata.com info@velocidata.com
Example of Common ETL Bottlenecks
Task #1
Task #2
Task #3
Task #4
Task #5
Task #6
Task #7
Task #8
Staging DB
ETL Server
Candidates for
Acceleration
Extract Transform Load
CSV
Mainframe
XML
RDBMS
Social Media
Sensor
Hadoop
• Hadoop
• ETL Server
• Data Warehouse
• Database Appliances
• BI Tools
•Cloud
www.velocidata.com info@velocidata.com
Example ETL Processes Offloaded
15
Task #6
Task #7
Task #8
Staging DB
ETL Server
Extract Transform Load
Keep Existing Input
Interfaces
Remove
Bottlenecks
Reduce ETL Server
Workload
Faster Total
Processing Time
CSV
Mainframe
XML
RDBMS
Social Media
Sensor
Hadoop
Task #1
Task #2
Task #3
Task #4
Task #5
• Hadoop
• ETL Server
• Data Warehouse
• Database
Appliances
• BI Tools
•Cloud
www.velocidata.com info@velocidata.com
Example Mainframe-to-Hadoop Workflow
• Simple, configuration-driven workflow
• Sample shows Mainframe  HDFS
• Data are validated, cleansed, reformatted, enriched, …, along the way
• Enables landing analytics-ready data as fast as it can move
across the wire
• Workflow can also work in reverse to return processed data to
the mainframe
16
Mainframe
Input
Validation Key Generation Formatter Lookup Address
Standardization CSV Out
www.velocidata.com info@velocidata.com
Wire-rate Platform Integration
17
Enable fast data access between systems
MPP Platforms (e.g., Teradata)
Format and improve data for ready
insertion into Data Analytics
architectures ETL Server
Preprocess data for fast movement
into and out of Data Integration tools
Mainframe
Conversion into and out
of EBCDIC and packed
decimal formats
Hadoop
Convert data to ASCII and
improve quality in flight
VelociData
feeds Hadoop
pre-processed,
quality data for
real-time BI efforts
VelociData
enables real-time
data access by
Teradata for
operational
analytics
www.velocidata.com info@velocidata.com
Enabling Three Layers of Data Access
VelociData delivers Hadoop
pre-processed, quality data to
keep “the lake” clean
Hadoop
VelociData enables real-time
data access for immediate
analytics and visualization
VelociData feeds databases and
warehouses pre-analytic, aggregated
data for operational analytics
• Sensors
• Weblogs
• Transactions
• Mainframe
• Hadoop
• Social Media
• RDBMS
• …
Wire-rate transformations and convergence of fresh and historical data
19
www.velocidata.com info@velocidata.com
Accessing Realtime and Historical Data
• Realtime Analysis for Competitive
Advantage
• Enabling the speed of business to match
business opportunities
• Integrating Historical Data for
Operational Excellence
• Informing traditional BI with real-time inputs
19
Conventional Batch-oriented BI
Real-time Operational Analytics
Iterative Modeling
Business Excellence
www.velocidata.com info@velocidata.com
Stream Processing AND Hadoop
Leveraging stream processing with batch-oriented Hadoop
• Access to more data for analytics
• Process data on ingest (also land raw data if desired)
• Transformation
• Cleansing
• Security
• Never read a COBOL copybook again
• Stream sort for integrating data, aggregation, and dedupe
• …
20
www.velocidata.com info@velocidata.com
Examples of Data Challenges Being Solved
21
• Pharmaceutical discovery query is reduced from 8 days to 20 minutes
• Retailer now integrates full customer data from in-store, on-line, and mobile sources in
real-time (processing 50,000 records/s, up from 100/s)
• Property casualty company shortens by five-fold a daily task of processing 540 million
records to enable more accurate real-time quoting
• Credit card company reduces mainframe costs and improves analytics performance by
integrating historical and fresh data into Hadoop at line rates
• Financial processing network masks 5 million fields/s of production data to sell
opportunity information to retailers
• To enable better customer support, a health benefits provider shortens a data
integration process from 16 hours to 45 seconds
• Billions of records with multi-fields keys are sorted nearly a million records/s for
analytics and data quality
• USPS address standardization at 10 billion/hour for data cleansing on ingest
www.velocidata.com info@velocidata.com
Thank You!
www.velocidata.com info@velocidata.com
Questions?

More Related Content

PPTX
Tools and approaches for migrating big datasets to the cloud
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PPTX
Versa Shore Microsoft APS PDW webinar
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
PPTX
Modernizing Your Data Warehouse using APS
PPTX
Building a Big Data Solution
PPTX
Microsoft Data Platform - What's included
Tools and approaches for migrating big datasets to the cloud
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Versa Shore Microsoft APS PDW webinar
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Why and how to leverage the simplicity and power of SQL on Flink
Modernizing Your Data Warehouse using APS
Building a Big Data Solution
Microsoft Data Platform - What's included

What's hot (20)

PPTX
Hadoop and Enterprise Data Warehouse
PDF
Prague data management meetup 2017-01-23
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
PPTX
Anatomy of a data driven architecture - Tamir Dresher
PDF
From Raw Data to Analytics with No ETL
PDF
So You Want to Build a Data Lake?
PDF
Big Data Architecture and Design Patterns
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Microsoft Azure Big Data Analytics
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PPTX
Big Data with SQL Server
PDF
Data warehouse con azure synapse analytics
PPTX
The modern analytics architecture
PPTX
Big data architectures and the data lake
PDF
Planing and optimizing data lake architecture
PDF
Yahoo's Next Generation User Profile Platform
PPTX
Introduction to Data Engineering
PPTX
2022 02 Integration Bootcamp
PDF
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
PDF
An overview of modern scalable web development
Hadoop and Enterprise Data Warehouse
Prague data management meetup 2017-01-23
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Anatomy of a data driven architecture - Tamir Dresher
From Raw Data to Analytics with No ETL
So You Want to Build a Data Lake?
Big Data Architecture and Design Patterns
Big Data Architecture Workshop - Vahid Amiri
Microsoft Azure Big Data Analytics
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Big Data with SQL Server
Data warehouse con azure synapse analytics
The modern analytics architecture
Big data architectures and the data lake
Planing and optimizing data lake architecture
Yahoo's Next Generation User Profile Platform
Introduction to Data Engineering
2022 02 Integration Bootcamp
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
An overview of modern scalable web development
Ad

Similar to Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014 (20)

PPTX
StreamCentral Technical Overview
PPTX
From Data to Services at the Speed of Business
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PPTX
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PPTX
The Most Trusted In-Memory database in the world- Altibase
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PPTX
DA_01_Intro.pptx
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
PPTX
Ho-Ho-Hold onto Your Hats! Real-Time Data Magic from Santa’s Sleigh with Azur...
PPTX
Skilwise Big data
PPTX
Skillwise Big Data part 2
PPT
Big data.ppt
PPTX
Lecture1
PDF
Webinar: SQL for Machine Data?
PDF
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
PDF
J1 - Keynote Data Platform - Rohan Kumar
StreamCentral Technical Overview
From Data to Services at the Speed of Business
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
The Most Trusted In-Memory database in the world- Altibase
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
DA_01_Intro.pptx
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Ho-Ho-Hold onto Your Hats! Real-Time Data Magic from Santa’s Sleigh with Azur...
Skilwise Big data
Skillwise Big Data part 2
Big data.ppt
Lecture1
Webinar: SQL for Machine Data?
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
J1 - Keynote Data Platform - Rohan Kumar
Ad

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Creating a Data Driven Organization - StampedeCon 2016
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Innovation in the Data Warehouse - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Spectroscopy.pptx food analysis technology
Programs and apps: productivity, graphics, security and other tools
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectroscopy.pptx food analysis technology

Enabling Key Business Advantage from Big Data through Advanced Ingest Processing - StampedeCon 2014

  • 1. Enabling Key Business Advantage from Big Data through Advanced Ingest Processing Ronald S. Indeck, PhD President and Founder VelociData, Inc. Solving the Need for Speed in Big DataOps
  • 2. www.velocidata.com info@velocidata.com Today’s Discussion • Motivations for Advanced Processing • Total Data Challenges • Economical Parallelism for IT is Arriving • Heterogeneous System Architectures (HSA) • HSA Implementation and Business Benchmarks • Questions
  • 4. www.velocidata.com info@velocidata.com The Urgency for Gaining Answers in Seconds Companies that Embrace Analytics Accelerate Performance “Value Integrators” achieve higher business performance: ‒ 20 times the EBITDA growth ‒ 50% more revenue growth • “Large-scale data gathering and analytics are quickly becoming a new frontier of competitive differentiation” – HBR • The challenge for IT is to economically provide real time, quality data to support business analytics and meet time-bound service level requirements when data are doubling every 12 months Analytics is creating a competitive advantage 4
  • 5. www.velocidata.com info@velocidata.com Recognizing “Total Data” Challenges • Bloor: Databases are more than adequate for the use cases they are designed to support • Consider Big Data AND Relational, not OR … think “Total Data” • The critical unsolved challenge is breaking Total Data flow bottlenecks 5 • Total Data challenges • Data volumes exploding • Data velocity and variety growing • Data must quickly move between disparate systems • Processing high volumes on mainframes is expensive • No spare resources for critical encryption / masking • Improving or measuring data quality is challenging
  • 6. www.velocidata.com info@velocidata.com Conventional Approaches • Add more cores and memory to the existing platform • Push processing into MPP (Teradata, Netezza, …) • Change the infrastructure (Oracle Exadata, …) • Use distributed platforms (Hadoop, ...) These require new skills, time, capital, management, support, risk … and fail to truly solve the Total Data flow problem 6
  • 7. www.velocidata.com info@velocidata.com Parallelism in IT Processing is Compelling • Amdahl’s Law • High Performance Computing history • Systems were expensive • Unique tools and training required • Scaling performance is often sub-linear • Issues with timing and thread synchronization HPC has struggled for 40 years to deliver widespread accessibility mostly due to cost and poor abstraction, development tools, and design environment If we could just deliver accessibility at an affordable cost … • Hardware is now becoming inexpensive • Application development improvements still needed to enable productivity  Abstract through implementation of streaming as the paradigm 7
  • 8. www.velocidata.com info@velocidata.com Complementary Approach: Heterogeneous System Architecture • Leverage a variety of compute resources • Not just parallel threads on identical resources • Right resources at the right times • Functional elements use appropriate processing components where needed • Accommodate stream processing • Source  processing  target • Streaming data model enables pipelining, data flow acceleration • Embrace fine-grained pipeline / functional parallelism • Especially data / direct parallelism • Separate latency and throughput • Engineered system • Manage thread, memory, and resource timing and contention 8
  • 9. www.velocidata.com info@velocidata.com Heterogeneous System Architecture General purpose “not bad at everything” - Good branch prediction, fast access to large memory Thousands of cores performing very specific tasks - Excellent matrix and floating point Fully customizable with extreme opportunities for parallelism - Excels at bit manipulation for regex, cryptography, searching, … 9 Standard CPUs Graphics Boards (GPUs) FPGA Coprocessors
  • 10. www.velocidata.com info@velocidata.com • Compute “value at risk” for a portfolio • 1024 stocks • Evaluate using Monte Carlo simulation • Brownian motion random walk • Execute 1 million trials and aggregate results: 1 trial equals 1024 random walks • Double-precision computation Example: Risk Modeling Application 10
  • 11. www.velocidata.com info@velocidata.com Example: Risk Modeling Performance Results • Baseline [CPU-only] • 450 thousand walks/second  37 minutes to execute 1 billion walks • FPGA + GPU + CPU • 140 million walks/second  6 seconds for 1 billion walks • Speedup of 370x • Other financial MC simulations are similar *First use of GPU, FPGA, and CPU in one application application stage 1 application stage 2 application stage 3 FPGA graphics engine chip multi- processor 11
  • 12. www.velocidata.com info@velocidata.com • Bundles software, firmware, and hardware into an appliance • Delivers the right compute resource (CPU, GPU, and FPGA) to the right process at the right time • Uses other system resources effectively • High-level abstraction: no need to code, re-train, or acquire new skillsets • Promotes stream processing for real-time action • Sources  processing  targets • Streaming data model enables pipelining for data flow acceleration Stream Processing as an HSA Appliance 12
  • 13. www.velocidata.com info@velocidata.com Example: VelociData Solution Palette 17 VelociData Suites VelociData Solutions Examples Conventional (records/second) VelociData (records/second) Data Transformation Lookup and Replace Data enrichment by populating fields from a master file, dictionary translations, etc. (e.g. CP  Cardiopulmonologist) 3000-6000 600,000 Type Conversions XML  Fixed; Binary  Char; Date/Time Formats 1000-2000 800,000 Format Conversions Rearrange, add, drop, merge, split, and resize fields to change layouts 1000-10,000 650,000 Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000 Data Masking Obfuscate data for non-production uses: Persistent or Dynamic; Format preserving; AES-256 500-10,000 > 1,000,000 Data Quality USPS Address Processing Standardization, verification, and cleansing (CASS certification in process) 600-2000 400,000 Domain Set Validation Validate a value based on a list of acceptable values (e.g., all product codes at a retailer; all countries in the world) 1000-3000 750,000 Field Content Validation Validates based on patterns such as emails, dates, and phone numbers 1000-3000 > 1,000,000 Data type validation and bounds checking 3000-6000 > 1,000,000 Data Platform Conversion Mainframe Data Conversion Copybook parsing & data layout discovery; EBCDIC, COMP, COMP-3, …  ASCII, Integer, Float,… 200-800 > 200,000 Data Sort Accelerated Data Sort Sort data using complex sort keys from multiple fields within records 7000-20,000 1,000,000 Results are system dependent but data intended to provide magnitude comparison
  • 14. www.velocidata.com info@velocidata.com Example of Common ETL Bottlenecks Task #1 Task #2 Task #3 Task #4 Task #5 Task #6 Task #7 Task #8 Staging DB ETL Server Candidates for Acceleration Extract Transform Load CSV Mainframe XML RDBMS Social Media Sensor Hadoop • Hadoop • ETL Server • Data Warehouse • Database Appliances • BI Tools •Cloud
  • 15. www.velocidata.com info@velocidata.com Example ETL Processes Offloaded 15 Task #6 Task #7 Task #8 Staging DB ETL Server Extract Transform Load Keep Existing Input Interfaces Remove Bottlenecks Reduce ETL Server Workload Faster Total Processing Time CSV Mainframe XML RDBMS Social Media Sensor Hadoop Task #1 Task #2 Task #3 Task #4 Task #5 • Hadoop • ETL Server • Data Warehouse • Database Appliances • BI Tools •Cloud
  • 16. www.velocidata.com info@velocidata.com Example Mainframe-to-Hadoop Workflow • Simple, configuration-driven workflow • Sample shows Mainframe  HDFS • Data are validated, cleansed, reformatted, enriched, …, along the way • Enables landing analytics-ready data as fast as it can move across the wire • Workflow can also work in reverse to return processed data to the mainframe 16 Mainframe Input Validation Key Generation Formatter Lookup Address Standardization CSV Out
  • 17. www.velocidata.com info@velocidata.com Wire-rate Platform Integration 17 Enable fast data access between systems MPP Platforms (e.g., Teradata) Format and improve data for ready insertion into Data Analytics architectures ETL Server Preprocess data for fast movement into and out of Data Integration tools Mainframe Conversion into and out of EBCDIC and packed decimal formats Hadoop Convert data to ASCII and improve quality in flight VelociData feeds Hadoop pre-processed, quality data for real-time BI efforts VelociData enables real-time data access by Teradata for operational analytics
  • 18. www.velocidata.com info@velocidata.com Enabling Three Layers of Data Access VelociData delivers Hadoop pre-processed, quality data to keep “the lake” clean Hadoop VelociData enables real-time data access for immediate analytics and visualization VelociData feeds databases and warehouses pre-analytic, aggregated data for operational analytics • Sensors • Weblogs • Transactions • Mainframe • Hadoop • Social Media • RDBMS • … Wire-rate transformations and convergence of fresh and historical data 19
  • 19. www.velocidata.com info@velocidata.com Accessing Realtime and Historical Data • Realtime Analysis for Competitive Advantage • Enabling the speed of business to match business opportunities • Integrating Historical Data for Operational Excellence • Informing traditional BI with real-time inputs 19 Conventional Batch-oriented BI Real-time Operational Analytics Iterative Modeling Business Excellence
  • 20. www.velocidata.com info@velocidata.com Stream Processing AND Hadoop Leveraging stream processing with batch-oriented Hadoop • Access to more data for analytics • Process data on ingest (also land raw data if desired) • Transformation • Cleansing • Security • Never read a COBOL copybook again • Stream sort for integrating data, aggregation, and dedupe • … 20
  • 21. www.velocidata.com info@velocidata.com Examples of Data Challenges Being Solved 21 • Pharmaceutical discovery query is reduced from 8 days to 20 minutes • Retailer now integrates full customer data from in-store, on-line, and mobile sources in real-time (processing 50,000 records/s, up from 100/s) • Property casualty company shortens by five-fold a daily task of processing 540 million records to enable more accurate real-time quoting • Credit card company reduces mainframe costs and improves analytics performance by integrating historical and fresh data into Hadoop at line rates • Financial processing network masks 5 million fields/s of production data to sell opportunity information to retailers • To enable better customer support, a health benefits provider shortens a data integration process from 16 hours to 45 seconds • Billions of records with multi-fields keys are sorted nearly a million records/s for analytics and data quality • USPS address standardization at 10 billion/hour for data cleansing on ingest