Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

1
Extending Analytic Reach:
From The Warehouse to The Data Lake
Mike Limcaco | CTO
2017 Big Data Day LA
University of Southern California | 2017-08-06

2
(Most) Data is Dark
http://bit.ly/2k4fDJQ

5
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)

6
Warehouse
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
The Virtual Warehouse

7
The Emerging Analytics Architecture (AWS)
Storage
Serverless
Compute
Data
Processing
Amazon S3
Datalake Storage
AWS Glue Data Catalog
Hive compatible Metastore
Amazon Kinesis
Streaming
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
AWS Lambda
Triggered Code
Amazon Redshift
PB-scale MPP Warehouse
Amazon Athena
SQL as a Service
Amazon EMR
Hadoop as a Service
AWS Glue
ETL

8
The Emerging Analytics Architecture (AWS)
Amazon S3
Datalake Storage
Warehouse-Datalake Bridge
Amazon Redshift
PB-scale MPP Warehouse
Amazon EMR
Hadoop as a Service

9
Pick one …
• Direct access to object store (S3)
• Scale out to thousands of nodes
• Open Data Formats
• Popular big data frameworks
• Developer-friendly
• Fast local disk performance
• Sophisticated query optimization
• Join-optimized
• Familiar DW/BI workflows
Hadoop (e.g. EMR) SQL-Based Warehousing

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

11
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
Data Lake
Object Storage
Amazon
Redshift
SQL
Client
Amazon
S3 Storage
SpectrumBridge
MPP
Warehouse
HTTP
JDBC/ODBC

12
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
SQL
Client
JDBC/ODBC
The Enormous
Virtual Warehouse

13
Query Flow
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
HTTP
Spectrum

14
Query Flow
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
storage
Data Catalog
Apache Hive
Metastore
1
HTTP
Spectrum

15
Query Flow
Query optimized &
compiled. Plan sent to
all Compute Nodes
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
storage
Data Catalog
Apache Hive
Metastore
2
HTTP
Spectrum

16
Query Flow
Compute nodes dynamically
prune partitions based on
Catalog info
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
storage
Data Catalog
Apache Hive
Metastore
3
HTTP
Spectrum

17
Query Flow
Spectrum nodes scan
S3, projects/filters/scans
and aggregates
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
storage
Data Catalog
Apache Hive
Metastore
4
HTTP
Spectrum

18
Query Flow
Final aggregations and
joins on local tables
done in-cluster
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
storage
Data Catalog
Apache Hive
Metastore
5
HTTP
Spectrum

19
Query Flow
Results sent back to
client
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
storage
Data Catalog
Apache Hive
Metastore
6
HTTP
Spectrum

21
Domain Model
Dimensions
Facts
(Online)
Dimensions Dimensions
Data Pond
Data Lake
Data
(RAW)
Facts
(Archive)
Data
(Other)

22LastFM Music Streaming Events
Horizontal Partitioning Datetime User_ID Country
2007 Mike USA
2008 Jack Finland
Datetime User_ID Track Artist
2015 5:00pm Alice Songbird Kenny G
2013 11:14pm Mike Suit and Tie Justin Timberlake
Datetime User_ID Track Artist
1999 5:15pm Mike Ice Ice Baby Vanilla Ice
1994 4:48pm Mike Wannabe Spice Girls
Colder
User Profile
Streaming
Events
(RECENT)
Streaming
Events
(ARCHIVE)

24
Dimensions
FACTS (Online)
Facts
(ARCHIVE)
Amazon S3
Redshift
Spectrum Glue CatalogAthena

25
create external table lastfm_music_streaming_events
(
userid string,
datetime timestamp,
artist_id string,
artist_name string,
track_id string,
track_name string
)
stored as parquet
location 's3://my-archived-facts/lastfm/parquet/events/';
Register EXTERNAL S3 Table

26
SELECT
u.country, COUNT(*) AS plays, 'REDSHIFT' AS source
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
GROUP BY
u.country
Query Redshift ONLINE Data

28
SELECT ….
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
…
UNION
SELECT …
FROM
lastfm_users u,
datalake.lastfm_music_streaming_events dl
WHERE
u.userid = dl.userid
…
Query Redshift ONLINE + ARCHIVED S3 Data
Local Redshift Tables
External S3 Data

31
Summary
• Online warehousing can participate in extended data lake operations
• External tables in Internet-scale object storage (S3) can be shared
between
• Hadoop workloads (EMR)
• Serverless SQL as a Service (Athena)
• SQL-based MPP Warehousing (Redshift)
• You can readily tap extra capacity, concurrency, throughput via

mike@agilisium.com
2629 Townsgate Road Suite 235
Westlake Village, CA 91361
Thank You
contact@agilisium.com
careers@agilisium.com

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco

More Related Content

What's hot (20)

Similar to Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco (11)

More from Data Con LA (20)

Recently uploaded (20)

Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco