SlideShare a Scribd company logo
1
Extending Analytic Reach:
From The Warehouse to The Data Lake
Mike Limcaco | CTO
2017 Big Data Day LA
University of Southern California | 2017-08-06
2
(Most) Data is Dark
http://bit.ly/2k4fDJQ
3
4
Big Data
Enormous Data
5
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
6
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
The Virtual Warehouse
7
The Emerging Analytics Architecture (AWS)
Storage
Serverless
Compute
Data
Processing
Amazon S3
Datalake Storage
AWS Glue Data Catalog
Hive compatible Metastore
Amazon Kinesis
Streaming
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
AWS Lambda
Triggered Code
Amazon Redshift
PB-scale MPP Warehouse
Amazon Athena
SQL as a Service
Amazon EMR
Hadoop as a Service
AWS Glue
ETL
8
The Emerging Analytics Architecture (AWS)
Amazon S3
Datalake Storage
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
Amazon Redshift
PB-scale MPP Warehouse
Amazon EMR
Hadoop as a Service
9
Pick one …
• Direct access to object store (S3)
• Scale out to thousands of nodes
• Open Data Formats
• Popular big data frameworks
• Developer-friendly
• Fast local disk performance
• Sophisticated query optimization
• Join-optimized
• Familiar DW/BI workflows
Hadoop (e.g. EMR) SQL-Based Warehousing
(e.g. Amazon Redshift)
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
11
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
Data Lake
Object Storage
Amazon
Redshift
SQL
Client
Amazon
S3 Storage
SpectrumBridge
MPP
Warehouse
HTTP
JDBC/ODBC
12
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
SQL
Client
JDBC/ODBC
The Enormous
Virtual Warehouse
13
Query Flow
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
HTTP
Spectrum
14
Query Flow
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
1
HTTP
Spectrum
15
Query Flow
Query optimized &
compiled. Plan sent to
all Compute Nodes
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
2
HTTP
Spectrum
16
Query Flow
Compute nodes dynamically
prune partitions based on
Catalog info
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
3
HTTP
Spectrum
17
Query Flow
Spectrum nodes scan
S3, projects/filters/scans
and aggregates
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
4
HTTP
Spectrum
18
Query Flow
Final aggregations and
joins on local tables
done in-cluster
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
5
HTTP
Spectrum
19
Query Flow
Results sent back to
client
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
6
HTTP
Spectrum
Demo
21
Domain Model
Dimensions
Facts
(Online)
Dimensions Dimensions
Data Pond
Data Lake
Data
(RAW)
Facts
(Archive)
Data
(Other)
22LastFM Music Streaming Events
Horizontal Partitioning Datetime User_ID Country
2007 Mike USA
2008 Jack Finland
Datetime User_ID Track Artist
2015 5:00pm Alice Songbird Kenny G
2013 11:14pm Mike Suit and Tie Justin Timberlake
Datetime User_ID Track Artist
1999 5:15pm Mike Ice Ice Baby Vanilla Ice
1994 4:48pm Mike Wannabe Spice Girls
Colder
User Profile
Streaming
Events
(RECENT)
Streaming
Events
(ARCHIVE)
23
Colder
24
Dimensions
FACTS (Online)
Facts
(ARCHIVE)
Amazon S3
Redshift
Spectrum Glue CatalogAthena
25
create external table lastfm_music_streaming_events
(
userid string,
datetime timestamp,
artist_id string,
artist_name string,
track_id string,
track_name string
)
stored as parquet
location 's3://my-archived-facts/lastfm/parquet/events/';
Register EXTERNAL S3 Table
26
SELECT
u.country, COUNT(*) AS plays, 'REDSHIFT' AS source
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
GROUP BY
u.country
Query Redshift ONLINE Data
27
28
SELECT ….
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
…
UNION
SELECT …
FROM
lastfm_users u,
datalake.lastfm_music_streaming_events dl
WHERE
u.userid = dl.userid
…
Query Redshift ONLINE + ARCHIVED S3 Data
Local Redshift Tables
External S3 Data
29
TL;DR
31
Summary
• Online warehousing can participate in extended data lake operations
• External tables in Internet-scale object storage (S3) can be shared
between
• Hadoop workloads (EMR)
• Serverless SQL as a Service (Athena)
• SQL-based MPP Warehousing (Redshift)
• You can readily tap extra capacity, concurrency, throughput via
Amazon Redshift Spectrum
mike@agilisium.com
2629 Townsgate Road Suite 235
Westlake Village, CA 91361
Thank You
contact@agilisium.com
careers@agilisium.com

More Related Content

PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PPTX
An Intro to Elasticsearch and Kibana
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Hugfr SPARK & RIAK -20160114_hug_france
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Open Source Big Data Ingestion - Without the Heartburn!
An Intro to Elasticsearch and Kibana
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

What's hot (20)

PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
PDF
Observability for Data Pipelines With OpenLineage
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
PDF
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
PDF
Superset druid realtime
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
PPTX
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
PDF
Data Science with the Help of Metadata
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
PDF
Building real time data-driven products
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
Airflow at lyft for Airflow summit 2020 conference
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
PPTX
Spark sql meetup
PDF
Presto Summit 2018 - 01 - Facebook Presto
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
What to Expect for Big Data and Apache Spark in 2017
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Observability for Data Pipelines With OpenLineage
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Superset druid realtime
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Data Science with the Help of Metadata
Apache Flink @ Tel Aviv / Herzliya Meetup
Building real time data-driven products
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Presto Summit 2018 - 09 - Netflix Iceberg
Airflow at lyft for Airflow summit 2020 conference
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Spark sql meetup
Presto Summit 2018 - 01 - Facebook Presto
Ad

Similar to Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco (11)

PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
PDF
Owning Your Own (Data) Lake House
PDF
IBM Cloud Native Day April 2021: Serverless Data Lake
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
PDF
Dataflow in 104corp - DataConTW2018
PDF
Building Serverless Data Infrastructure in the AWS Cloud
PDF
Amazed by AWS Series #4
PPTX
Data Analysis on AWS
PDF
AWS Big Data Landscape
PDF
Module 2 - Datalake
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Owning Your Own (Data) Lake House
IBM Cloud Native Day April 2021: Serverless Data Lake
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
IBM Cloud Day January 2021 Data Lake Deep Dive
Dataflow in 104corp - DataConTW2018
Building Serverless Data Infrastructure in the AWS Cloud
Amazed by AWS Series #4
Data Analysis on AWS
AWS Big Data Landscape
Module 2 - Datalake
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
August Patch Tuesday
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
STKI Israel Market Study 2025 version august
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
project resource management chapter-09.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
What is a Computer? Input Devices /output devices
cloud_computing_Infrastucture_as_cloud_p
A comparative study of natural language inference in Swahili using monolingua...
Final SEM Unit 1 for mit wpu at pune .pptx
TLE Review Electricity (Electricity).pptx
August Patch Tuesday
Univ-Connecticut-ChatGPT-Presentaion.pdf
1. Introduction to Computer Programming.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hindi spoken digit analysis for native and non-native speakers
OMC Textile Division Presentation 2021.pptx
A novel scalable deep ensemble learning framework for big data classification...
STKI Israel Market Study 2025 version august
Getting Started with Data Integration: FME Form 101
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...
project resource management chapter-09.pdf
Tartificialntelligence_presentation.pptx
Getting started with AI Agents and Multi-Agent Systems
What is a Computer? Input Devices /output devices

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

  • 1. 1 Extending Analytic Reach: From The Warehouse to The Data Lake Mike Limcaco | CTO 2017 Big Data Day LA University of Southern California | 2017-08-06
  • 2. 2 (Most) Data is Dark http://bit.ly/2k4fDJQ
  • 3. 3
  • 5. 5 Warehouse (e.g. Amazon Redshift) Vast Object Storage Domain (e.g. Data Lake on Amazon S3)
  • 6. 6 Warehouse (e.g. Amazon Redshift) Vast Object Storage Domain (e.g. Data Lake on Amazon S3) The Virtual Warehouse
  • 7. 7 The Emerging Analytics Architecture (AWS) Storage Serverless Compute Data Processing Amazon S3 Datalake Storage AWS Glue Data Catalog Hive compatible Metastore Amazon Kinesis Streaming Amazon Redshift Spectrum Warehouse-Datalake Bridge AWS Lambda Triggered Code Amazon Redshift PB-scale MPP Warehouse Amazon Athena SQL as a Service Amazon EMR Hadoop as a Service AWS Glue ETL
  • 8. 8 The Emerging Analytics Architecture (AWS) Amazon S3 Datalake Storage Amazon Redshift Spectrum Warehouse-Datalake Bridge Amazon Redshift PB-scale MPP Warehouse Amazon EMR Hadoop as a Service
  • 9. 9 Pick one … • Direct access to object store (S3) • Scale out to thousands of nodes • Open Data Formats • Popular big data frameworks • Developer-friendly • Fast local disk performance • Sophisticated query optimization • Join-optimized • Familiar DW/BI workflows Hadoop (e.g. EMR) SQL-Based Warehousing (e.g. Amazon Redshift)
  • 11. 11 Amazon Redshift Spectrum Run SQL queries against S3 • Leverages Amazon Redshift advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automated parallelization of query execution against S3 data • Efficient join processing with the Amazon Redshift cluster App Data Lake Object Storage Amazon Redshift SQL Client Amazon S3 Storage SpectrumBridge MPP Warehouse HTTP JDBC/ODBC
  • 12. 12 Amazon Redshift Spectrum Run SQL queries against S3 • Leverages Amazon Redshift advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automated parallelization of query execution against S3 data • Efficient join processing with the Amazon Redshift cluster App SQL Client JDBC/ODBC The Enormous Virtual Warehouse
  • 13. 13 Query Flow Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore HTTP Spectrum
  • 14. 14 Query Flow Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1 HTTP Spectrum
  • 15. 15 Query Flow Query optimized & compiled. Plan sent to all Compute Nodes Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2 HTTP Spectrum
  • 16. 16 Query Flow Compute nodes dynamically prune partitions based on Catalog info Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3 HTTP Spectrum
  • 17. 17 Query Flow Spectrum nodes scan S3, projects/filters/scans and aggregates Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4 HTTP Spectrum
  • 18. 18 Query Flow Final aggregations and joins on local tables done in-cluster Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5 HTTP Spectrum
  • 19. 19 Query Flow Results sent back to client Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6 HTTP Spectrum
  • 20. Demo
  • 21. 21 Domain Model Dimensions Facts (Online) Dimensions Dimensions Data Pond Data Lake Data (RAW) Facts (Archive) Data (Other)
  • 22. 22LastFM Music Streaming Events Horizontal Partitioning Datetime User_ID Country 2007 Mike USA 2008 Jack Finland Datetime User_ID Track Artist 2015 5:00pm Alice Songbird Kenny G 2013 11:14pm Mike Suit and Tie Justin Timberlake Datetime User_ID Track Artist 1999 5:15pm Mike Ice Ice Baby Vanilla Ice 1994 4:48pm Mike Wannabe Spice Girls Colder User Profile Streaming Events (RECENT) Streaming Events (ARCHIVE)
  • 25. 25 create external table lastfm_music_streaming_events ( userid string, datetime timestamp, artist_id string, artist_name string, track_id string, track_name string ) stored as parquet location 's3://my-archived-facts/lastfm/parquet/events/'; Register EXTERNAL S3 Table
  • 26. 26 SELECT u.country, COUNT(*) AS plays, 'REDSHIFT' AS source FROM lastfm_users u, lastfm_music_streaming_events s WHERE u.userid = s.userid GROUP BY u.country Query Redshift ONLINE Data
  • 27. 27
  • 28. 28 SELECT …. FROM lastfm_users u, lastfm_music_streaming_events s WHERE u.userid = s.userid … UNION SELECT … FROM lastfm_users u, datalake.lastfm_music_streaming_events dl WHERE u.userid = dl.userid … Query Redshift ONLINE + ARCHIVED S3 Data Local Redshift Tables External S3 Data
  • 29. 29
  • 30. TL;DR
  • 31. 31 Summary • Online warehousing can participate in extended data lake operations • External tables in Internet-scale object storage (S3) can be shared between • Hadoop workloads (EMR) • Serverless SQL as a Service (Athena) • SQL-based MPP Warehousing (Redshift) • You can readily tap extra capacity, concurrency, throughput via Amazon Redshift Spectrum
  • 32. mike@agilisium.com 2629 Townsgate Road Suite 235 Westlake Village, CA 91361 Thank You contact@agilisium.com careers@agilisium.com