SlideShare a Scribd company logo
Confidential © 2015 Actian Corporation1
Keys to the Kingdom:
SQL in Hadoop
Emma McGrattan, SVP Engineering
June 9th, 2015
Confidential © 2015 Actian Corporation2
I’ve worked with data since before it was “Big”
I have no sense of rhythm and was refunded all my money by the Arthur Murray School of Dance because they just
couldn’t teach me to dance
Upon “retirement” I dream of opening a bar/bistro in Dublin, Amsterdam or Paris
I’d like to like olives, they’re so sophisticated, but I just can’t stand the taste
I once won an open mic comedy night and would love to do stand-up
I was born and raised in Ireland but became an American Citizen in 2011
I would rather walk 150ft to my car to get my phone to control my lights and thermostats and avoid doing it the old
fashioned way.
I’ve almost drowned four times
I joined Mensa at age 12
I once announced on-stage at Comdex that I’d like to be a pornographer – I meant to say “photographer” –
honestly I did!
Getting to Know Emma McGrattan
Confidential © 2015 Actian Corporation3
Who is Actian?
$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide offices, 7x 24 multinational support model
End to End Big Data Platform. Disruptive Price Performance.
3
“Actian Analytics demonstrates that the company now has an
impressive range of offerings that have been rebranded and
combined in a pretty comprehensive framework.” 451 Research
“Actian is now very powerfully
positioned in the big data and data
analytics markets.” Bloor Group
3
Confidential © 2015 Actian Corporation4
Collect the Keys; Win an Apple Watch
Confidential © 2015 Actian Corporation5
How to Collect the Keys
Confidential © 2015 Actian Corporation6
“wrapped
legacy”
“from
scratch”
Maturity
(SQL support,
ACID, reliability,
security, connectivity,
performance)
Hadoop IntegrationLow Native
High
“connections”
+
+
Mature &
Integrated
+ End-to-End
SQL In Hadoop Landscape
Confidential © 2015 Actian Corporation7
Actian Vortex – Architecture
Confidential © 2015 Actian Corporation8
Actian Vector – Unmatched Innovation
Time/CyclestoProcess
Data Processed
DISK
RAM
CHIP
10GB2-3GB40-400MB
2-20150-250Millions
Vector Processing
Single
Instruction
Multiple
Data
2nd Gen Column Store
Limit I/O
Efficient real time updates
Smarter Compression
Maximize throughput
Vectorized decompression
Exploiting Chip Cache
Process data on chip – not in RAM
1
2
3
4
YARN Integration
Intelligent Block Placement
Dynamic Resource Management…
Storage Indexes
Quickly identify candidate data blocks
Minimize IO
5
6
Confidential © 2015 Actian Corporation9
Up to 30X Faster
Than Impala
0
5
10
15
20
25
30
35
Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98
“Impala Subset” of TPC-DS at Scale Factor 3000 (3TB)
Actian+HDP2.1 vs Cloudera Impala
Impala Actian
Background to “Impala Subset “of TPC-DS benchmark can be found here:
http://blog.cloudera.com/blog/2015/01/impala-performance-dbms-class-speed/
Average
#timesfasterthanImpala
Both Executed on the Same Hardware and Software Environment:
5 Node Cluster with 64GB of RAM per node and 24x1TB Hard Disks
Highest Performing SQL in Hadoop
Confidential © 2015 Actian Corporation10
The SQL Behind the Actian Numbers
Confidential © 2015 Actian Corporation11
The Impala Equivalent Uses “Hints”
Note the use
of partition
keys
Confidential © 2015 Actian Corporation12
Trickle Update Support
The Achilles Heel of Hadoop
■ The design paradigm for HDFS is for data to be written once and
read ever after.
■ Appending updated records to the end of a column/table or
rewriting the entire table significantly impacts system
performance
The Solution – Positional Delta Trees
■ Enable on-line updates, without impacting read performance
■ Keep track of the tuple position of Inserts/Modifies/Deletes
■ Designed to make merging in of these updates fast by providing
the tuple positions where differences have to be applied at
update time.
Confidential © 2015 Actian Corporation13
Data Security
Access Control
Role Separation
■ System Administrator & Database Administrator should not have access to all data
Security Auditing*
■ Ability to audit who issued what query from where and when
Encryption
■ Data at rest – minimal performance impact*
■ Data in motion
■ File system
Confidential © 2015 Actian Corporation14
Overcoming Challenges with YARN
■ YARN not designed for long running
processes
■ Impossible to grow and shrink
resources on the fly
■ Requirement to dynamically set priority
of the workload
■ Identify data node failure and
dynamically reconfigure cluster
■ Intelligent block replica placement
And for my next trick….
Confidential © 2015 Actian Corporation15
Highest Performing and Fully Industrialized SQL in
Hadoop
Fully ACID compliant – brings
transactional integrity to
Hadoop to prevent
inaccurate results
Full ANSI SQL 92 support –
enables use of ALL standard BI
tools and apps
Native DBMS Security - authentication,
user and role-based security, data
protection, and encryption
Open APIs - allow read access to our
block format
Highly Performant – up to 30x faster
than our closest competitor, Impala
Mature, proven planner and fastest
optimizer ensures customers can
maximize number of nodes, CPU,
memory and cache
Hadoop distribution agnostic -
avoids vendor lock-in and provides
customer flexibility Native in-Hadoop YARN – manage
Hadoop resources automatically to
prevent inefficiencies
Collaborative architecture - query native
Hadoop file formats (like Parquet)
without ingestion
Update Capability – provides ability
to update without impacting read
performance
Highest Concurrency – allows your
customers to have simultaneous
users and tasks run without long
wait times
Confidential © 2015 Actian Corporation16
Customer Analytics on Hadoop in the Fitness Industry
Actian Vortex
Challenge: Leverage wealth of data
available from fitness tracking tools and
devices to better understand customers,
increase wallet share and cross-up sell
opportunities.
Obstacles: Existing CRM system unable
to handle the amount and type of data
Wanted to use Hadoop but lacked in-
house Hadoop skills.
Solution: Actian Vortex
Results:
• Timely, accurate reports
• Ability to leverage existing SQL skills
without having to learn new tools
• Eliminated need for aggregating or
sampling
• True ad hoc analysis
Confidential © 2015 Actian Corporation17
2 billion risk data points in
6 hours (~100k/sec)
<30 seconds
Hierarchy dimension on 1
million data points in <15 sec
Store 80 days (160 billion
rows) of data
Up to 10 billion rows per day
<2 sec
Actian Vortex
Technical Evaluation
1 hour 40 min (333K per sec)
6 sec on 5 node cluster;
2 sec on 10 node cluster
Sub second response times
Stored 100 days (200 billion
rows) with linear scaling
Text book scalability as nodes
added to cluster
<1 sec
Loading
Full Day
Aggregation
Filtered
Aggregation
Large Data
Volumes
Horizontal
Scalability
Drill Up /
DrillDown
Target Actual
Market Risk Analytics on Hadoop
Confidential © 2015 Actian Corporation18
Download Vortex Today!

More Related Content

PPTX
In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
PPTX
Interactive Analytics in Human Time
PDF
Phoenix - A High Performance Open Source SQL Layer over HBase
PDF
Big Data Heterogeneous Mixture Learning on Spark
PPTX
Managing a Multi-Tenant Data Lake
PPTX
10 Step Guide to Analytics
PPTX
Revolution Analytics
PPT
Pervasive DataRush
In-Hadoop, In-Database and In-Memory Processing for Predictive Analytics
Interactive Analytics in Human Time
Phoenix - A High Performance Open Source SQL Layer over HBase
Big Data Heterogeneous Mixture Learning on Spark
Managing a Multi-Tenant Data Lake
10 Step Guide to Analytics
Revolution Analytics
Pervasive DataRush

What's hot (20)

PDF
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
PPTX
Smart Meter Data Analytic using Hadoop
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PPTX
Spark meets Smart Meters
PPTX
Building Information Platform - Integration of Hadoop with SAP HANA and HANA ...
PPTX
01 sap hana landscape and operations infrastructure v2 0
PPTX
Building a Scalable Data Science Platform with R
PDF
Advanced analytics with sap hana and r
PPTX
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
PPTX
Cascading User Group Meet
PPTX
Keys for Success from Streams to Queries
PDF
CIO Guide to Using SAP HANA Platform For Big Data
PDF
Apache Eagle: Secure Hadoop in Real Time
PPTX
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
PPTX
Which data should you move to Hadoop?
PPTX
Zero Downtime App Deployment using Hadoop
PDF
MapR & Skytree:
PPTX
Geospatial data platform at Uber
PPTX
Pentaho Analytics on MongoDB
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Smart Meter Data Analytic using Hadoop
The key to unlocking the Value in the IoT? Managing the Data!
Spark meets Smart Meters
Building Information Platform - Integration of Hadoop with SAP HANA and HANA ...
01 sap hana landscape and operations infrastructure v2 0
Building a Scalable Data Science Platform with R
Advanced analytics with sap hana and r
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Cascading User Group Meet
Keys for Success from Streams to Queries
CIO Guide to Using SAP HANA Platform For Big Data
Apache Eagle: Secure Hadoop in Real Time
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Which data should you move to Hadoop?
Zero Downtime App Deployment using Hadoop
MapR & Skytree:
Geospatial data platform at Uber
Pentaho Analytics on MongoDB
Hadoop - Architectural road map for Hadoop Ecosystem
Ad

Similar to Keys to the Kingdom: SQL in Hadoop (20)

PPTX
SQL + Hadoop: The High Performance Advantage�
PPTX
Actian Analytics Platform - Hadoop SQL Edition
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
PPTX
Actian Vector on Hadoop: First Industrial-strength DBMS to Truly Leverage Hadoop
PDF
Turning Your Data Lake into Measurable Business Value
PDF
SQL in Hadoop To Boldly Go Where no Data Warehouse Has Gone Before
PDF
Unconstrained Analytics in the Age of Data – Delivering High-Performance Anal...
PPTX
Solving Performance Problems on Hadoop
PPTX
Analytics at the Speed of Thought: Actian Express Overview
PDF
Top Trends for Hadoop in 2015
PDF
Actian forrester- hortonworks
PDF
7 Ingredients to Create Real Value From Hadoop
PDF
Big Data LDN 2017: Billions of Rows, the 5ws and H of Interpreting Fast and F...
PDF
SQL In Hadoop: Big Data Innovation Without the Risk
PDF
Hadoop as an Analytic Platform: Why Not?
PDF
Maximum Overdrive: How Cloud-Born Data Changes the Game
PDF
Meta scale kognitio hadoop webinar
PDF
Big Data LDN 2018: DELIVERING ON THE OPERATIONAL DATA WAREHOUSE PROMISE
PDF
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
PPTX
How Hewlett Packard Enterprise Gets Real with IoT Analytics
SQL + Hadoop: The High Performance Advantage�
Actian Analytics Platform - Hadoop SQL Edition
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Actian Vector on Hadoop: First Industrial-strength DBMS to Truly Leverage Hadoop
Turning Your Data Lake into Measurable Business Value
SQL in Hadoop To Boldly Go Where no Data Warehouse Has Gone Before
Unconstrained Analytics in the Age of Data – Delivering High-Performance Anal...
Solving Performance Problems on Hadoop
Analytics at the Speed of Thought: Actian Express Overview
Top Trends for Hadoop in 2015
Actian forrester- hortonworks
7 Ingredients to Create Real Value From Hadoop
Big Data LDN 2017: Billions of Rows, the 5ws and H of Interpreting Fast and F...
SQL In Hadoop: Big Data Innovation Without the Risk
Hadoop as an Analytic Platform: Why Not?
Maximum Overdrive: How Cloud-Born Data Changes the Game
Meta scale kognitio hadoop webinar
Big Data LDN 2018: DELIVERING ON THE OPERATIONAL DATA WAREHOUSE PROMISE
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
How Hewlett Packard Enterprise Gets Real with IoT Analytics
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity

Keys to the Kingdom: SQL in Hadoop

  • 1. Confidential © 2015 Actian Corporation1 Keys to the Kingdom: SQL in Hadoop Emma McGrattan, SVP Engineering June 9th, 2015
  • 2. Confidential © 2015 Actian Corporation2 I’ve worked with data since before it was “Big” I have no sense of rhythm and was refunded all my money by the Arthur Murray School of Dance because they just couldn’t teach me to dance Upon “retirement” I dream of opening a bar/bistro in Dublin, Amsterdam or Paris I’d like to like olives, they’re so sophisticated, but I just can’t stand the taste I once won an open mic comedy night and would love to do stand-up I was born and raised in Ireland but became an American Citizen in 2011 I would rather walk 150ft to my car to get my phone to control my lights and thermostats and avoid doing it the old fashioned way. I’ve almost drowned four times I joined Mensa at age 12 I once announced on-stage at Comdex that I’d like to be a pornographer – I meant to say “photographer” – honestly I did! Getting to Know Emma McGrattan
  • 3. Confidential © 2015 Actian Corporation3 Who is Actian? $140M Revenues + Profitable 10,000+ Customers Global Presence: 8 world-wide offices, 7x 24 multinational support model End to End Big Data Platform. Disruptive Price Performance. 3 “Actian Analytics demonstrates that the company now has an impressive range of offerings that have been rebranded and combined in a pretty comprehensive framework.” 451 Research “Actian is now very powerfully positioned in the big data and data analytics markets.” Bloor Group 3
  • 4. Confidential © 2015 Actian Corporation4 Collect the Keys; Win an Apple Watch
  • 5. Confidential © 2015 Actian Corporation5 How to Collect the Keys
  • 6. Confidential © 2015 Actian Corporation6 “wrapped legacy” “from scratch” Maturity (SQL support, ACID, reliability, security, connectivity, performance) Hadoop IntegrationLow Native High “connections” + + Mature & Integrated + End-to-End SQL In Hadoop Landscape
  • 7. Confidential © 2015 Actian Corporation7 Actian Vortex – Architecture
  • 8. Confidential © 2015 Actian Corporation8 Actian Vector – Unmatched Innovation Time/CyclestoProcess Data Processed DISK RAM CHIP 10GB2-3GB40-400MB 2-20150-250Millions Vector Processing Single Instruction Multiple Data 2nd Gen Column Store Limit I/O Efficient real time updates Smarter Compression Maximize throughput Vectorized decompression Exploiting Chip Cache Process data on chip – not in RAM 1 2 3 4 YARN Integration Intelligent Block Placement Dynamic Resource Management… Storage Indexes Quickly identify candidate data blocks Minimize IO 5 6
  • 9. Confidential © 2015 Actian Corporation9 Up to 30X Faster Than Impala 0 5 10 15 20 25 30 35 Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98 “Impala Subset” of TPC-DS at Scale Factor 3000 (3TB) Actian+HDP2.1 vs Cloudera Impala Impala Actian Background to “Impala Subset “of TPC-DS benchmark can be found here: http://blog.cloudera.com/blog/2015/01/impala-performance-dbms-class-speed/ Average #timesfasterthanImpala Both Executed on the Same Hardware and Software Environment: 5 Node Cluster with 64GB of RAM per node and 24x1TB Hard Disks Highest Performing SQL in Hadoop
  • 10. Confidential © 2015 Actian Corporation10 The SQL Behind the Actian Numbers
  • 11. Confidential © 2015 Actian Corporation11 The Impala Equivalent Uses “Hints” Note the use of partition keys
  • 12. Confidential © 2015 Actian Corporation12 Trickle Update Support The Achilles Heel of Hadoop ■ The design paradigm for HDFS is for data to be written once and read ever after. ■ Appending updated records to the end of a column/table or rewriting the entire table significantly impacts system performance The Solution – Positional Delta Trees ■ Enable on-line updates, without impacting read performance ■ Keep track of the tuple position of Inserts/Modifies/Deletes ■ Designed to make merging in of these updates fast by providing the tuple positions where differences have to be applied at update time.
  • 13. Confidential © 2015 Actian Corporation13 Data Security Access Control Role Separation ■ System Administrator & Database Administrator should not have access to all data Security Auditing* ■ Ability to audit who issued what query from where and when Encryption ■ Data at rest – minimal performance impact* ■ Data in motion ■ File system
  • 14. Confidential © 2015 Actian Corporation14 Overcoming Challenges with YARN ■ YARN not designed for long running processes ■ Impossible to grow and shrink resources on the fly ■ Requirement to dynamically set priority of the workload ■ Identify data node failure and dynamically reconfigure cluster ■ Intelligent block replica placement And for my next trick….
  • 15. Confidential © 2015 Actian Corporation15 Highest Performing and Fully Industrialized SQL in Hadoop Fully ACID compliant – brings transactional integrity to Hadoop to prevent inaccurate results Full ANSI SQL 92 support – enables use of ALL standard BI tools and apps Native DBMS Security - authentication, user and role-based security, data protection, and encryption Open APIs - allow read access to our block format Highly Performant – up to 30x faster than our closest competitor, Impala Mature, proven planner and fastest optimizer ensures customers can maximize number of nodes, CPU, memory and cache Hadoop distribution agnostic - avoids vendor lock-in and provides customer flexibility Native in-Hadoop YARN – manage Hadoop resources automatically to prevent inefficiencies Collaborative architecture - query native Hadoop file formats (like Parquet) without ingestion Update Capability – provides ability to update without impacting read performance Highest Concurrency – allows your customers to have simultaneous users and tasks run without long wait times
  • 16. Confidential © 2015 Actian Corporation16 Customer Analytics on Hadoop in the Fitness Industry Actian Vortex Challenge: Leverage wealth of data available from fitness tracking tools and devices to better understand customers, increase wallet share and cross-up sell opportunities. Obstacles: Existing CRM system unable to handle the amount and type of data Wanted to use Hadoop but lacked in- house Hadoop skills. Solution: Actian Vortex Results: • Timely, accurate reports • Ability to leverage existing SQL skills without having to learn new tools • Eliminated need for aggregating or sampling • True ad hoc analysis
  • 17. Confidential © 2015 Actian Corporation17 2 billion risk data points in 6 hours (~100k/sec) <30 seconds Hierarchy dimension on 1 million data points in <15 sec Store 80 days (160 billion rows) of data Up to 10 billion rows per day <2 sec Actian Vortex Technical Evaluation 1 hour 40 min (333K per sec) 6 sec on 5 node cluster; 2 sec on 10 node cluster Sub second response times Stored 100 days (200 billion rows) with linear scaling Text book scalability as nodes added to cluster <1 sec Loading Full Day Aggregation Filtered Aggregation Large Data Volumes Horizontal Scalability Drill Up / DrillDown Target Actual Market Risk Analytics on Hadoop
  • 18. Confidential © 2015 Actian Corporation18 Download Vortex Today!

Editor's Notes

  • #7: internationalization
  • #9: 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  • #10: Execution Subset of TPC-DS as chosen by Impala Data size is 3TB (SF3000) Executed on 5-node “rushcluster” in Austin Both Impala and Vector numbers are on the same hardware Comparison with Impala Verified that Impala plans are sensible Currently observed average speedup is 11x Optimal query plans (manually written) gives us 16x speedup These are real numbers! We executed manual plans directly Changes in the cost model would get us to this performance Performance improvements Cost model changes will get us to 16x speedup Pipeline of query execution changes Well into H2 Estimated to get us 2x improvement So, estimated speedup vs Impala would be ~30x (no guarantees) Planning to run TPC-H SF1000 and SF3000 With all planned improvements (end of the year) we should be able to beat the EXASOL cluster numbers.
  • #11: Mention that the store_sales table has 8.6 BILLION rows
  • #13: Key advantages of the PDT over value-based merging are: Positional merging needs less I/O than value-based merging, because the sort keys do not need to be read. Positional merging is less CPU intensive than value-based merging, especially when the sort-key is a compound key and/or non-numerical attributes are part of the sort-key.
  • #17: A leading sports apparel retailer has used analytics, provided by a 3rd party customer relationship marketing service, to promote brand loyalty and identify targeted sales opportunities.  Under Armour wanted to turn to analytics to get closer to their customers., increase wallet share and cross & up-sell opportunities.   This is a trad’l retailer looking for new ways to leverage wealth of information available to them (especially in fitness tracking tools and devices) They were outsourcing data, which limited flexibility and slowed results.