SlideShare a Scribd company logo
© 2013 IBM Corporation1
SQL on Hadoop - 12th Swiss Big Data User Group
Meeting, 3rd of July, 2014, ETH Zurich
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData
(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
© 2013 IBM Corporation2
DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
●
SQL (42%)
●
R (33%)
●
Python (26%)
●
Excel (25%)
●
Java, Ruby, C++ (17%)
●
SPSS, SAS (9%)
●
Limitations (Single Node usage)
●
Main Memory
●
CPU <> Main Memory Bandwidth
●
CPU
●
Storage <> Main Memory Bandwidth (either Single node or SAN)
© 2013 IBM Corporation3
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop
© 2013 IBM Corporation4
SQL on Hadoop
●
IBM BigSQL (ANSI 2011 compliant, part of IBM BigInsights)
●
HIVE, Presto
●
Cloudera Impala
●
Lingual
●
Shark
●
...
SQL Hadoop
© 2013 IBM Corporation5
Two types of SQL Engines
●
Type I
●
Compiler and Optimizer SQL->MapReduce
●
Type II
●
Brings own distributed execution engine on Data Nodes
●
Brings own Task Scheduler
●
The Hadoop SQL Ecosystem is evolving very fast
© 2013 IBM Corporation6
Hive
●
Runs on top of MapReduce
●
→ Type I
Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
© 2013 IBM Corporation7
Lingual
●
ANSI SQL Layer on top of Cascading
●
Cascading
●
Java API do express DAG
●
Runs on top of MapReduce
●
→ Type I
© 2013 IBM Corporation8
Limits of MapReduce
●
Disk writes between Map and Reduce
●
Slow for computations which depend on previously computed values
●
JOINs are very slow and difficult to implement
●
Only sequential data access
●
Only tuple-wise data access
●
Map-Side joins have sort and size constraints
●
Reduce-Side joins require secondary sorting of values
●
…
●
...
© 2013 IBM Corporation9
Impala (Type II)
http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
© 2013 IBM Corporation10
Presto (Type II)
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
© 2013 IBM Corporation11
Spark / Shark (Type II)
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
© 2013 IBM Corporation12
BigSQL V3.0 (Type II)
Like in Spark, MapReduce has been Kicked out :)
(No JobTracker, No Task Tracker, But HDFS/GPFS remains)
© 2013 IBM Corporation13
BigSQL V3.0 – Architecture
Putting the story together….
Big SQL shares a common SQL dialect with DB2
Big SQL shares the same client drivers with DB2
© 2013 IBM Corporation14
BigSQL V3.0 – Performance
Query rewrites
Exhaustive query rewrite capabilities
Leverages additional metadata such as constraints and nullability
Optimization
Statistics and heuristic driven query optimization
Query optimizer based upon decades of IBM RDBMS experience
Tools and metrics
Highly detailed explain plans and query diagnostic tools
Extensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD),
AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT,
STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKE
Y AND
STORE.STOREKEY=DAILY_SALES.STOREKEY
AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generationQuery transformation
Dozens of query
transformations
Hundreds or thousands
of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period
© 2013 IBM Corporation15
BigSQL V3.0 – Performance
You are substantially faster if you don't use MapReduce
IBM BigInsights v3.0, with Big SQL
3.0, is the only Hadoop distribution
to successfully run ALL 99 TPC-DS
queries and ALL 22 TPC-H queries
without modification. Source:
http://www.ibmbigdatahub.com/blog/big-deal-about-
infosphere-biginsights-v30-big-sql
© 2013 IBM Corporation16
BigSQL V3.0 – Query Federation
Head Node
Big SQL
Compute Node
Task Tracker Data Node Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
© 2013 IBM Corporation17
BigSQL V1.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation18
BigSQL V1.0 – Demo (small)
CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE LOCATION
'/user/biadmin/32Gtest';
© 2013 IBM Corporation19
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation20
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation21
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1;
+----------+
| |
+----------+
| 11416740 |
+----------+
1 row in results(first row: 39.78s; total: 39.78s)
© 2013 IBM Corporation22
BigSQL V1.0 – Demo (small)
select count(hour), hour from trace group by hour order by hour
30 rows in results(first row: 37.98s; total: 37.99s)
© 2013 IBM Corporation23
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner
join trace2 t4 on t3.hour=t4.hour;
+--------+
| |
+--------+
| 477340 |
+--------+
1 row in results(first row: 32.24s; total: 32.25s)
© 2013 IBM Corporation24
BigSQL V3.0 – Demo (small)
CREATE HADOOP TABLE trace3 (
hour int, employeeid int,
departmentid int,clientid int,
date varchar(30), timestamp varchar(30) )
row format delimited
fields terminated by '|'
stored as textfile;
© 2013 IBM Corporation25
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3;
+----------+
| 1 |
+----------+
| 12014733 |
+----------+
1 row in results(first row: 2.94s; total: 2.95s)
© 2013 IBM Corporation26
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner
join trace4 t4 on t3.hour=t4.hour;
+--------+
| 1 |
+--------+
| 504360 |
+--------+
1 row in results(first row: 0.79s; total: 0.80s)
© 2013 IBM Corporation27
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3
group by hour order by hour;
29 rows in results(first row: 1.88s; total: 1.89s)
© 2013 IBM Corporation28
Questions?
http://www.ibm.com/software/data/bigdata/
BigInsights free VM and Installer for non-commercial use:
ibm.co/quickstart
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

More Related Content

PDF
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
PDF
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
PDF
High Performance Compute: NextGen Silicon Photonics Storage Solution
PDF
You might be paying too much for BigQuery
PPTX
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
PPTX
The next generation of the Montage image mosaic engine
PDF
20181116 Massive Log Processing using I/O optimized PostgreSQL
PDF
20201006_PGconf_Online_Large_Data_Processing
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
High Performance Compute: NextGen Silicon Photonics Storage Solution
You might be paying too much for BigQuery
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
The next generation of the Montage image mosaic engine
20181116 Massive Log Processing using I/O optimized PostgreSQL
20201006_PGconf_Online_Large_Data_Processing

What's hot (15)

PDF
20180920_DBTS_PGStrom_EN
PDF
IITB Poster. Benchmarking GPU-based Acceleration of Spark in ML Workload usin...
PPTX
VMworld 2009: VMworld Data Center
PPTX
Solving Challenges With 'Huge Data'
PDF
A Deeper Dive into EXPLAIN
 
PDF
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
PDF
G-Store: High-Performance Graph Store for Trillion-Edge Processing
PPTX
Visualizing database performance hotsos 13-v2
PDF
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
PDF
20190909_PGconf.ASIA_KaiGai
PPT
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
PPTX
ACIC: Automatic Cloud I/O Configurator for HPC Applications
PDF
k-means algorithm implementation on Hadoop
PDF
Landset 8 的雲層去除技巧實作
20180920_DBTS_PGStrom_EN
IITB Poster. Benchmarking GPU-based Acceleration of Spark in ML Workload usin...
VMworld 2009: VMworld Data Center
Solving Challenges With 'Huge Data'
A Deeper Dive into EXPLAIN
 
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
G-Store: High-Performance Graph Store for Trillion-Edge Processing
Visualizing database performance hotsos 13-v2
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
20190909_PGconf.ASIA_KaiGai
Stockage, manipulation et analyse de données matricielles avec PostGIS Raster
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
ACIC: Automatic Cloud I/O Configurator for HPC Applications
k-means algorithm implementation on Hadoop
Landset 8 的雲層去除技巧實作
Ad

Similar to SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich (20)

PDF
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
PDF
Hadoop Fundamentals I
PDF
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
PPTX
Oracle - Checklist for performance issues
PDF
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
PDF
IBM Analytics Accelerator Trends & Directions Namk Hrle
ODP
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PDF
Build a Big Data solution using DB2 for z/OS
PDF
Cloud-native Java EE-volution
PDF
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
PDF
SQL vs NoSQL, an experiment with MongoDB
PPTX
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
PDF
With big data comes big responsibility
PPTX
IBM THINK 2018 - IBM Cloud SQL Query Introduction
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPTX
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PPTX
Introduction to Mahout given at Twin Cities HUG
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
Hadoop Fundamentals I
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Oracle - Checklist for performance issues
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
IBM Analytics Accelerator Trends & Directions Namk Hrle
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Build a Big Data solution using DB2 for z/OS
Cloud-native Java EE-volution
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
SQL vs NoSQL, an experiment with MongoDB
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
With big data comes big responsibility
IBM THINK 2018 - IBM Cloud SQL Query Introduction
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Introduction to Mahout given at Twin Cities HUG
Ad

More from Romeo Kienzler (20)

PDF
Parallelization Stategies of DeepLearning Neural Network Training
PDF
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
PDF
Love & Innovative technology presented by a technology pioneer and an AI expe...
PDF
Blockchain Technology Book Vernisage
PDF
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
PDF
IBM Middle East Data Science Connect 2016 - Doha, Qatar
PDF
Apache SystemML - Declarative Large-Scale Machine Learning
PDF
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
PDF
DeepLearning and Advanced Machine Learning on IoT
PDF
Geo Python16 keynote
PDF
Real-time DeepLearning on IoT Sensor Data
PPT
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
PDF
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
PDF
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
PDF
TDWI_DW2014_SQLNoSQL_DBAAS
PPT
Cloudant Overview Bluemix Meetup from Lisa Neddam
ODP
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
ODP
DBaaS Bluemix Meetup DACH 26.8.14
ODP
Cloud Databases, Developer Week Nuernberg 2014
ODP
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Parallelization Stategies of DeepLearning Neural Network Training
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Love & Innovative technology presented by a technology pioneer and an AI expe...
Blockchain Technology Book Vernisage
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
IBM Middle East Data Science Connect 2016 - Doha, Qatar
Apache SystemML - Declarative Large-Scale Machine Learning
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
DeepLearning and Advanced Machine Learning on IoT
Geo Python16 keynote
Real-time DeepLearning on IoT Sensor Data
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
TDWI_DW2014_SQLNoSQL_DBAAS
Cloudant Overview Bluemix Meetup from Lisa Neddam
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
DBaaS Bluemix Meetup DACH 26.8.14
Cloud Databases, Developer Week Nuernberg 2014
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Foundation of Data Science unit number two notes
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Mega Projects Data Mega Projects Data
Lecture1 pattern recognition............
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Foundation of Data Science unit number two notes
IBA_Chapter_11_Slides_Final_Accessible.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
annual-report-2024-2025 original latest.
Mega Projects Data Mega Projects Data

SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich

  • 1. © 2013 IBM Corporation1 SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
  • 2. © 2013 IBM Corporation2 DataScience at present ● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html) ● SQL (42%) ● R (33%) ● Python (26%) ● Excel (25%) ● Java, Ruby, C++ (17%) ● SPSS, SAS (9%) ● Limitations (Single Node usage) ● Main Memory ● CPU <> Main Memory Bandwidth ● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
  • 3. © 2013 IBM Corporation3 Data Science on Hadoop SQL (42%) R (33%) Python (26%) Excel (25%) Java, Ruby, C++ (17%) SPSS, SAS (9%) Data Science Hadoop
  • 4. © 2013 IBM Corporation4 SQL on Hadoop ● IBM BigSQL (ANSI 2011 compliant, part of IBM BigInsights) ● HIVE, Presto ● Cloudera Impala ● Lingual ● Shark ● ... SQL Hadoop
  • 5. © 2013 IBM Corporation5 Two types of SQL Engines ● Type I ● Compiler and Optimizer SQL->MapReduce ● Type II ● Brings own distributed execution engine on Data Nodes ● Brings own Task Scheduler ● The Hadoop SQL Ecosystem is evolving very fast
  • 6. © 2013 IBM Corporation6 Hive ● Runs on top of MapReduce ● → Type I Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
  • 7. © 2013 IBM Corporation7 Lingual ● ANSI SQL Layer on top of Cascading ● Cascading ● Java API do express DAG ● Runs on top of MapReduce ● → Type I
  • 8. © 2013 IBM Corporation8 Limits of MapReduce ● Disk writes between Map and Reduce ● Slow for computations which depend on previously computed values ● JOINs are very slow and difficult to implement ● Only sequential data access ● Only tuple-wise data access ● Map-Side joins have sort and size constraints ● Reduce-Side joins require secondary sorting of values ● … ● ...
  • 9. © 2013 IBM Corporation9 Impala (Type II) http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
  • 10. © 2013 IBM Corporation10 Presto (Type II) https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
  • 11. © 2013 IBM Corporation11 Spark / Shark (Type II) Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
  • 12. © 2013 IBM Corporation12 BigSQL V3.0 (Type II) Like in Spark, MapReduce has been Kicked out :) (No JobTracker, No Task Tracker, But HDFS/GPFS remains)
  • 13. © 2013 IBM Corporation13 BigSQL V3.0 – Architecture Putting the story together…. Big SQL shares a common SQL dialect with DB2 Big SQL shares the same client drivers with DB2
  • 14. © 2013 IBM Corporation14 BigSQL V3.0 – Performance Query rewrites Exhaustive query rewrite capabilities Leverages additional metadata such as constraints and nullability Optimization Statistics and heuristic driven query optimization Query optimizer based upon decades of IBM RDBMS experience Tools and metrics Highly detailed explain plans and query diagnostic tools Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKE Y AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generationQuery transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily SalesNLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product StoreZZJOIN Daily Sales HSJOIN Period
  • 15. © 2013 IBM Corporation15 BigSQL V3.0 – Performance You are substantially faster if you don't use MapReduce IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about- infosphere-biginsights-v30-big-sql
  • 16. © 2013 IBM Corporation16 BigSQL V3.0 – Query Federation Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 17. © 2013 IBM Corporation17 BigSQL V1.0 – Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 18. © 2013 IBM Corporation18 BigSQL V1.0 – Demo (small) CREATE EXTERNAL TABLE trace ( hour integer, employeeid integer, departmentid integer, clientid integer, date string, timestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';
  • 19. © 2013 IBM Corporation19 BigSQL V1.0 – Demo (small)
  • 20. © 2013 IBM Corporation20 BigSQL V1.0 – Demo (small)
  • 21. © 2013 IBM Corporation21 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1; +----------+ | | +----------+ | 11416740 | +----------+ 1 row in results(first row: 39.78s; total: 39.78s)
  • 22. © 2013 IBM Corporation22 BigSQL V1.0 – Demo (small) select count(hour), hour from trace group by hour order by hour 30 rows in results(first row: 37.98s; total: 37.99s)
  • 23. © 2013 IBM Corporation23 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour; +--------+ | | +--------+ | 477340 | +--------+ 1 row in results(first row: 32.24s; total: 32.25s)
  • 24. © 2013 IBM Corporation24 BigSQL V3.0 – Demo (small) CREATE HADOOP TABLE trace3 ( hour int, employeeid int, departmentid int,clientid int, date varchar(30), timestamp varchar(30) ) row format delimited fields terminated by '|' stored as textfile;
  • 25. © 2013 IBM Corporation25 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3; +----------+ | 1 | +----------+ | 12014733 | +----------+ 1 row in results(first row: 2.94s; total: 2.95s)
  • 26. © 2013 IBM Corporation26 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour; +--------+ | 1 | +--------+ | 504360 | +--------+ 1 row in results(first row: 0.79s; total: 0.80s)
  • 27. © 2013 IBM Corporation27 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour; 29 rows in results(first row: 1.88s; total: 1.89s)
  • 28. © 2013 IBM Corporation28 Questions? http://www.ibm.com/software/data/bigdata/ BigInsights free VM and Installer for non-commercial use: ibm.co/quickstart Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps