SlideShare a Scribd company logo
SQL in Hadoop: To
Boldly Go Where
No Data Warehouse
has Gone Before
Emma McGrattan
SVP Engineering, Actian Corp
$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide offices, 7x 24 multinational support model
2
“Fast becoming a big data
powerhouse to challenge
the market.” Forrester
“Actian is now very powerfully
positioned in the big data and analytics
markets.” Bloor
Who is Actian?
Actian Management Console
DATAPLATFORM
SPARQLverse
ANALYTIC
APPS
Financial Services
Health Care
Other Verticals
SQL
Java,C/++,
Pythn
SOURCE DATA
Databases / Marts
Warehouses
Cloud / SaaS
Applications
Structured &
Unstructured Data
Enterprise
Applications
APPLICATIONDEV
Application Development and Tools
Deployment Options
DataFlow
Elastic Data Prep
Vector in Hadoop
SQL Analytics
Predictive Analytics
Graph Analytics
DataFlow
INFRASTRUCTURE
Library of Analytic Blueprints
Actian
Vortex
X100
X100
X100
X100
HDFS
HDFS
HDFS
HDFS
HDFS
X100
Workernode[1..n](datanodes)
Actian Vector in Hadoop Architecture
SQLProcessing
SQL parser
Optimizer
Cross compiler
parsed tree
query plan
Client application
X100 algebra
X100
Distributed rewriter
Builder
Execution engine
annotated query tree
operator tree
Buffer manager
datadata request
HDFS
Masternode
SQL query
I/O
X100
Rewriter
Builder
Execution engine
annotated query tree
partial operator tree
Buffer manager
datadata request
HDFS
I/O
MPI
annotated tree
result
MPI
partial result set
MPI
inter-nodecommunication
HDFS
namenode
HDFS
datanode
X100
X100
X100
X100
Vector: Built for Warp Speed
Time/CyclestoProcess
Data Processed
DISK
RAM
CHIP
10GB2-3GB40-400MB
2-20150-250Millions
Vectorized End-to-End
Single
Instruction
Multiple
Data
Update Capability
Limit I/O
Efficient real time updates
Update & Delete individual records
Smart Compression
Maximize throughput
Vectorized decompression
Exploiting Chip Cache
Process data on chip – not in RAM
1
2
3
4
YARN Integration
Intelligent Block Placement
Dynamic Resource Management
…
Storage Indexes
Quickly identify candidate
data blocks
Minimize I/O
5
6
Time/CyclestoProcess
Vector: To Boldly Go…
0
5
10
15
20
25
30
35
Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98
“Impala Subset” of TPC-DS Queries at Scale Factor 3000 (3TB)
Speedup vs Impala
Impala Actian
16x faster on average
NumberoftimesfasterthanImpala
Both Executed on the same hardware and software environment:
5 Node Cluster with 64GB of RAM per node and 24 x 1TB Hard Disks.
The SQL Behind the Actian Numbers
The Impala Equivalent Uses “Hints”
Note the
use of
partition
keys
SQL in Hadoop  To Boldly Go Where no Data Warehouse Has Gone Before
Trickle Update Support
 The Kobayashi Maru of Hadoop
- The design paradigm for HDFS is for data to be written once and read ever after.
- Appending updated records to the end of a column/table or rewriting the entire
table significantly impacts system performance
 The Solution – Positional Delta Trees
- Enable on-line updates, without impacting read performance
- Keep track of the tuple position of Inserts/Modifies/Deletes
- Designed to make merging in of these updates fast by providing the tuple
positions where differences have to be applied at update time.
Positional Delta Trees
Data Security
 Access Control
 Role Separation
- System Administrator & Database Administrator should not have access to all
data
 Security Auditing
- Ability to audit who accessed, or attempted to access, what and when
 Encryption
- Data at rest – minimize performance impact by enabling at column level
- Data in motion
- File system
It’s SQL in Hadoop, Jim, but not as we know it
Highest Performing and Fully Industrialized SQL in Hadoop
Fully ACID compliant – brings
transactional integrity to Hadoop to
prevent inaccurate results
Full ANSI SQL 92 support – enables use of
ALL standard BI tools and apps
Native DBMS Security - authentication, user and
role-based security, data protection, and encryption
Open APIs - allow read access to our block format
Highly Performant – up to 30x faster than our
closest competitor, Impala
Mature, proven planner and fastest optimizer
ensures customers can maximize number of
nodes, CPU, memory and cacheHadoop distribution agnostic - avoids vendor
lock-in and provides customer flexibility
Native in-Hadoop YARN – manage Hadoop
resources automatically to prevent inefficiencies
Collaborative architecture - query native Hadoop
file formats (like Parquet) without ingestion
Update Capability – provides ability to update
without impacting read performance
Highest Concurrency – allows your customers
to have simultaneous users and tasks run
without long wait times
Beam it Down, Scotty!
SQL in Hadoop  To Boldly Go Where no Data Warehouse Has Gone Before

More Related Content

PPTX
Alteryx Architecture
PPT
Attunity Efficient ODR For Sql Server Using Attunity CDC Suite For SSIS Slide...
PDF
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
PPTX
Big data journey to the cloud maz chaudhri 5.30.18
PPTX
Consolidate your data marts for fast, flexible analytics 5.24.18
PPTX
How Data Drives Business at Choice Hotels
PPTX
Analyzing Hadoop Data Using Sparklyr

PDF
Hadoop on Cloud: Why and How?
Alteryx Architecture
Attunity Efficient ODR For Sql Server Using Attunity CDC Suite For SSIS Slide...
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
Big data journey to the cloud maz chaudhri 5.30.18
Consolidate your data marts for fast, flexible analytics 5.24.18
How Data Drives Business at Choice Hotels
Analyzing Hadoop Data Using Sparklyr

Hadoop on Cloud: Why and How?

What's hot (20)

PPTX
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
PPTX
Azure SQL Database Managed Instance
PPTX
Full stack monitoring across apps & infrastructure with Azure Monitor
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
PPTX
Cloudera Altus: Big Data in the Cloud Made Easy
PDF
K2 oracle open world highlights
PDF
2017 OpenWorld Keynote for Data Integration
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
PPTX
Making Self-Service BI a Reality in the Enterprise
PDF
Reactive Worksheets By FalconSoft Ltd
PPTX
Azure SQL Database & Azure SQL Data Warehouse
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Microsoft cloud big data strategy
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Azure SQL DB Managed Instances Built to easily modernize application data layer
PDF
K1 innovation in practice
PPTX
Get started with Cloudera's cyber solution
PPT
Composite Information Server
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Azure SQL Database Managed Instance
Full stack monitoring across apps & infrastructure with Azure Monitor
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera Altus: Big Data in the Cloud Made Easy
K2 oracle open world highlights
2017 OpenWorld Keynote for Data Integration
Building a Turbo-fast Data Warehousing Platform with Databricks
Making Self-Service BI a Reality in the Enterprise
Reactive Worksheets By FalconSoft Ltd
Azure SQL Database & Azure SQL Data Warehouse
Leveraging the cloud for analytics and machine learning 1.29.19
Microsoft cloud big data strategy
Leveraging the Cloud for Big Data Analytics 12.11.18
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Modern Data Warehouse Fundamentals Part 3
Azure SQL DB Managed Instances Built to easily modernize application data layer
K1 innovation in practice
Get started with Cloudera's cyber solution
Composite Information Server
Ad

Similar to SQL in Hadoop To Boldly Go Where no Data Warehouse Has Gone Before (20)

PPTX
Actian Analytics Platform - Hadoop SQL Edition
PPTX
SQL + Hadoop: The High Performance Advantage�
PPTX
Analytics at the Speed of Thought: Actian Express Overview
PDF
Building a scalable analytics environment to support diverse workloads
PDF
Actian Matrix Datasheet
PPTX
How Yellowbrick Data Integrates to Existing Environments Webcast
PDF
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
PPTX
Azure Data platform
PDF
4AA6-4492ENW
PPTX
Introduction to microsoft sql server 2008 r2
PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
PDF
Complement Your Existing Data Warehouse with Big Data & Hadoop
PDF
Whats New Sql Server 2008 R2
PPTX
Webinar | Introducing DataStax Enterprise 4.6
PPTX
Real-time Analytics for Data-Driven Applications
PDF
Appfluent and Cloudera Solution Brief
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
PPTX
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
PDF
Whats New Sql Server 2008 R2 Cw
PPTX
Analytics and Lakehouse Integration Options for Oracle Applications
Actian Analytics Platform - Hadoop SQL Edition
SQL + Hadoop: The High Performance Advantage�
Analytics at the Speed of Thought: Actian Express Overview
Building a scalable analytics environment to support diverse workloads
Actian Matrix Datasheet
How Yellowbrick Data Integrates to Existing Environments Webcast
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Azure Data platform
4AA6-4492ENW
Introduction to microsoft sql server 2008 r2
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Complement Your Existing Data Warehouse with Big Data & Hadoop
Whats New Sql Server 2008 R2
Webinar | Introducing DataStax Enterprise 4.6
Real-time Analytics for Data-Driven Applications
Appfluent and Cloudera Solution Brief
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Whats New Sql Server 2008 R2 Cw
Analytics and Lakehouse Integration Options for Oracle Applications
Ad

More from Edgar Alejandro Villegas (20)

PDF
What's New in Predictive Analytics IBM SPSS - Apr 2016
PDF
Oracle big data discovery 994294
PDF
Actian Ingres10.2 Datasheet
PDF
Actian Matrix Whitepaper
PDF
Actian Vector Whitepaper
PDF
Actian DataFlow Whitepaper
PDF
The Four Pillars of Analytics Technology Whitepaper
PDF
Realtime analytics with_hadoop
PDF
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
PDF
Hadoop and Your Enterprise Data Warehouse
PDF
Big Data SurVey - IOUG - 2013 - 594292
PDF
Best Practices for Oracle Exadata and the Oracle Optimizer
PDF
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
PDF
Big Data and Enterprise Data - Oracle -1663869
PDF
Fast and Easy Analytics: - Tableau - Data Base Trends - Dbt06122013slides
PDF
BITGLASS - DATA BREACH DISCOVERY DATASHEET
PDF
Four Pillars of Business Analytics - e-book - Actuate
PDF
Sas hpa-va-bda-exadata-2389280
PDF
Splice machine-bloor-webinar-data-lakes
PDF
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
What's New in Predictive Analytics IBM SPSS - Apr 2016
Oracle big data discovery 994294
Actian Ingres10.2 Datasheet
Actian Matrix Whitepaper
Actian Vector Whitepaper
Actian DataFlow Whitepaper
The Four Pillars of Analytics Technology Whitepaper
Realtime analytics with_hadoop
SQL – The Natural Language for Analysis - Oracle - Whitepaper - 2431343
Hadoop and Your Enterprise Data Warehouse
Big Data SurVey - IOUG - 2013 - 594292
Best Practices for Oracle Exadata and the Oracle Optimizer
Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...
Big Data and Enterprise Data - Oracle -1663869
Fast and Easy Analytics: - Tableau - Data Base Trends - Dbt06122013slides
BITGLASS - DATA BREACH DISCOVERY DATASHEET
Four Pillars of Business Analytics - e-book - Actuate
Sas hpa-va-bda-exadata-2389280
Splice machine-bloor-webinar-data-lakes
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Computer network topology notes for revision
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Foundation of Data Science unit number two notes
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
1_Introduction to advance data techniques.pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Acumen Training GuidePresentation.pptx
Computer network topology notes for revision
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Fluorescence-microscope_Botany_detailed content
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

SQL in Hadoop To Boldly Go Where no Data Warehouse Has Gone Before

  • 1. SQL in Hadoop: To Boldly Go Where No Data Warehouse has Gone Before Emma McGrattan SVP Engineering, Actian Corp
  • 2. $140M Revenues + Profitable 10,000+ Customers Global Presence: 8 world-wide offices, 7x 24 multinational support model 2 “Fast becoming a big data powerhouse to challenge the market.” Forrester “Actian is now very powerfully positioned in the big data and analytics markets.” Bloor Who is Actian?
  • 3. Actian Management Console DATAPLATFORM SPARQLverse ANALYTIC APPS Financial Services Health Care Other Verticals SQL Java,C/++, Pythn SOURCE DATA Databases / Marts Warehouses Cloud / SaaS Applications Structured & Unstructured Data Enterprise Applications APPLICATIONDEV Application Development and Tools Deployment Options DataFlow Elastic Data Prep Vector in Hadoop SQL Analytics Predictive Analytics Graph Analytics DataFlow INFRASTRUCTURE Library of Analytic Blueprints Actian Vortex
  • 4. X100 X100 X100 X100 HDFS HDFS HDFS HDFS HDFS X100 Workernode[1..n](datanodes) Actian Vector in Hadoop Architecture SQLProcessing SQL parser Optimizer Cross compiler parsed tree query plan Client application X100 algebra X100 Distributed rewriter Builder Execution engine annotated query tree operator tree Buffer manager datadata request HDFS Masternode SQL query I/O X100 Rewriter Builder Execution engine annotated query tree partial operator tree Buffer manager datadata request HDFS I/O MPI annotated tree result MPI partial result set MPI inter-nodecommunication HDFS namenode HDFS datanode X100 X100 X100 X100
  • 5. Vector: Built for Warp Speed Time/CyclestoProcess Data Processed DISK RAM CHIP 10GB2-3GB40-400MB 2-20150-250Millions Vectorized End-to-End Single Instruction Multiple Data Update Capability Limit I/O Efficient real time updates Update & Delete individual records Smart Compression Maximize throughput Vectorized decompression Exploiting Chip Cache Process data on chip – not in RAM 1 2 3 4 YARN Integration Intelligent Block Placement Dynamic Resource Management … Storage Indexes Quickly identify candidate data blocks Minimize I/O 5 6 Time/CyclestoProcess
  • 6. Vector: To Boldly Go… 0 5 10 15 20 25 30 35 Q3 Q7 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98 “Impala Subset” of TPC-DS Queries at Scale Factor 3000 (3TB) Speedup vs Impala Impala Actian 16x faster on average NumberoftimesfasterthanImpala Both Executed on the same hardware and software environment: 5 Node Cluster with 64GB of RAM per node and 24 x 1TB Hard Disks.
  • 7. The SQL Behind the Actian Numbers
  • 8. The Impala Equivalent Uses “Hints” Note the use of partition keys
  • 10. Trickle Update Support  The Kobayashi Maru of Hadoop - The design paradigm for HDFS is for data to be written once and read ever after. - Appending updated records to the end of a column/table or rewriting the entire table significantly impacts system performance  The Solution – Positional Delta Trees - Enable on-line updates, without impacting read performance - Keep track of the tuple position of Inserts/Modifies/Deletes - Designed to make merging in of these updates fast by providing the tuple positions where differences have to be applied at update time.
  • 12. Data Security  Access Control  Role Separation - System Administrator & Database Administrator should not have access to all data  Security Auditing - Ability to audit who accessed, or attempted to access, what and when  Encryption - Data at rest – minimize performance impact by enabling at column level - Data in motion - File system
  • 13. It’s SQL in Hadoop, Jim, but not as we know it Highest Performing and Fully Industrialized SQL in Hadoop Fully ACID compliant – brings transactional integrity to Hadoop to prevent inaccurate results Full ANSI SQL 92 support – enables use of ALL standard BI tools and apps Native DBMS Security - authentication, user and role-based security, data protection, and encryption Open APIs - allow read access to our block format Highly Performant – up to 30x faster than our closest competitor, Impala Mature, proven planner and fastest optimizer ensures customers can maximize number of nodes, CPU, memory and cacheHadoop distribution agnostic - avoids vendor lock-in and provides customer flexibility Native in-Hadoop YARN – manage Hadoop resources automatically to prevent inefficiencies Collaborative architecture - query native Hadoop file formats (like Parquet) without ingestion Update Capability – provides ability to update without impacting read performance Highest Concurrency – allows your customers to have simultaneous users and tasks run without long wait times
  • 14. Beam it Down, Scotty!