SlideShare a Scribd company logo
Building The
Enterprise Data Lake

Important Considerations Before
You Jump In
December 8, 2015
Building The Enterprise Data Lake
Today’s Presenters
Mark
Madsen
Industry
Analyst
Third Nature
@markmadsen
Craig
Stewart
Sr. Dir.
Product
Management
SnapLogic
@01Badger
Erin
Curtis
Sr. Dir.
Product
Marketing
SnapLogic
@erncrts
Building	
  the	
  
Enterprise	
  Data	
  Lake	
  
Considera6ons	
  before	
  you	
  
jump	
  in	
  
	
  
	
  
	
  
	
  
	
  
December,	
  2015	
  
	
  
Mark	
  Madsen	
  
www.ThirdNature.net	
  
@markmadsen1	
  
What	
  This	
  Session	
  Isn’t	
  
SQL..
.
SQL!
SQL?
SQL
The	
  craB	
  model	
  of	
  informa6on	
  delivery	
  does	
  not	
  scale	
  
©	
  Third	
  Nature,	
  Inc.	
  
So	
  we	
  shiBed	
  to	
  data	
  publishing	
  
Industrialized	
  data	
  delivery	
  for	
  self-­‐service	
  access.	
  
Events	
  and	
  sensors	
  are	
  a	
  rela6vely	
  new	
  data	
  source	
  
Sensor	
  data	
  doesn’t	
  fit	
  well	
  with	
  current	
  methods	
  of	
  modeling,	
  
collecEon	
  and	
  storage,	
  or	
  with	
  the	
  technology	
  to	
  process	
  and	
  analyze	
  it.	
  
There’s	
  lots	
  of	
  other	
  new	
  data	
  involved	
  
©	
  Third	
  Nature,	
  Inc.	
  
You	
  can	
  store	
  this	
  data	
  in	
  an	
  RDBMS,	
  but…	
  
These	
  sorts	
  of	
  things	
  slow	
  user	
  requests	
  down	
  
Conclusion:	
  any	
  methodology	
  built	
  on	
  the	
  premise	
  that	
  you	
  
must	
  know	
  and	
  model	
  all	
  the	
  data	
  first	
  is	
  untenable	
  	
  
©	
  Third	
  Nature,	
  Inc.	
  
Analy6cs	
  embiggens	
  data	
  volume	
  problems	
  
Many	
  of	
  the	
  processing	
  problems	
  are	
  O(n2)	
  or	
  worse,	
  so	
  
moderate	
  data	
  can	
  be	
  a	
  problem	
  for	
  scale-­‐up	
  plaOorms	
  
©	
  Third	
  Nature,	
  Inc.	
  
Old	
  market	
  says:	
  There’s	
  nothing	
  wrong	
  with	
  what	
  
you	
  have,	
  just	
  keep	
  buying	
  new	
  products	
  from	
  us	
  
The	
  emerging	
  big	
  data	
  market	
  has	
  an	
  answer…	
  
©	
  Third	
  Nature,	
  Inc.	
  
The	
  data	
  lake	
  
©	
  Third	
  Nature,	
  Inc.	
  
Views	
  of	
  the	
  lake	
  
Is	
  the	
  business	
  vs	
  supports	
  the	
  business?	
  
ApplicaEon	
  vs	
  infrastructure?	
  
©	
  Third	
  Nature,	
  Inc.	
  
The	
  naïve	
  idea	
  of	
  a	
  data	
  lake	
  leads	
  to	
  predictable	
  results
©	
  Third	
  Nature,	
  Inc.	
  
You	
  can’t	
  install	
  Hadoop	
  and	
  hope	
  it	
  solves	
  all	
  the	
  problems	
  
Big	
  data	
  no	
  2	
  
Slide 18
The	
  answer	
  isn’t	
  just	
  technology,	
  it’s	
  architecture	
  
Schema
In	
  the	
  DW	
  world	
  both	
  data	
  and	
  processing	
  are	
  bounded	
  
No consideration for feedback loops and change
Processing only
happens here
Carefully
controlled
access
here
Nobodyherecreates
newinformation
Sources few and
well understood
Complex DI
is controlled
by IT
Schemas are few
and designed
Tools are authorized,
few in number and
kind
One way flow
This	
  is	
  a	
  monolithic,	
  layered	
  architecture	
  
©	
  Third	
  Nature,	
  Inc.	
  
In	
  the	
  big	
  data	
  world	
  flow	
  is	
  unbounded	
  and	
  con6nuous	
  
Feedback
loops allowed
End-of-analysis
dataset may be
start of a BI dataset
Continuous data
integration and delivery
Files are back as both
input and storage
Minimal
barrier of /
control on
collection
Areas of
provisioned
data
Any shape in,
rectangles out
This	
  needs	
  a	
  distributed	
  service	
  architecture	
  
©	
  Third	
  Nature,	
  Inc.	
  
Deconstruc6ng	
  data	
  environments	
  
There	
  are	
  three	
  
things	
  happening	
  in	
  a	
  
data	
  warehouse:	
  
▪  Data	
  acquisiEon	
  
▪  Data	
  management	
  
▪  Data	
  delivery	
  
Isolate	
  them	
  from	
  one	
  
another,	
  allow	
  read-­‐
write	
  use,	
  and	
  you	
  are	
  
on	
  the	
  path.	
  
Data
Warehouse
Data	
  lake	
  subsystems	
  /	
  components	
  
The	
  acquisi6on	
  component	
  allows	
  any	
  data	
  to	
  be	
  collected	
  at	
  any	
  latency.	
  The	
  
management	
  	
  component	
  allows	
  some	
  data	
  to	
  be	
  standardized	
  and	
  integrated.	
  The	
  
access	
  component	
  provides	
  access	
  at	
  any	
  latency	
  and	
  via	
  any	
  means	
  an	
  applica6on	
  
chooses.	
  Processing	
  can	
  be	
  done	
  to	
  any	
  data	
  at	
  any	
  6me	
  from	
  any	
  area.	
  
Data	
  AcquisiEon	
  
Collect	
  &	
  Store	
  
Incremental	
  
Batch	
  
One-­‐Eme	
  copy	
  
Real	
  Eme	
  
Data	
  Lake	
  PlaOorm	
  Services	
  
Data	
  Management	
  
Process	
  &	
  Integrate	
  
Data	
  Access	
  
Deliver	
  &	
  Use	
  
Data	
  storage	
  
In	
  reality,	
  you	
  are	
  building	
  three	
  systems,	
  not	
  one.	
  Avoid	
  the	
  monolith.	
  
©	
  Third	
  Nature,	
  Inc.	
  
Data	
  lake	
  func6ons	
  depend	
  on	
  plaUorm	
  services	
  
Base Platform Services
Data Movement MetadataData Persistence
Workflow
Management
Processing Engines Dataflow Services
Data Curation
Data Access
Services
Data	
  AcquisiEon	
  
Collect	
  &	
  Store	
  
Data	
  Management	
  
Process	
  &	
  Integrate	
  
Data	
  Access	
  
Deliver	
  &	
  Use	
  
PlaOorm	
  services	
  needed	
  
DATA	
  ARCHITECTURE	
  
We’re	
  so	
  focused	
  on	
  the	
  light	
  switch	
  that	
  we’re	
  not	
  
talking	
  about	
  the	
  light	
  
©	
  Third	
  Nature,	
  Inc.	
  
Decouple	
  the	
  Data	
  Architecture	
  
The	
  core	
  of	
  the	
  data	
  lake	
  isn’t	
  a	
  database	
  or	
  HDFS,	
  
it’s	
  the	
  data	
  architecture	
  that	
  the	
  tools	
  implement.	
  
	
  
We	
  need	
  a	
  data	
  architecture	
  that	
  is	
  not	
  limiEng:	
  
▪  Deals	
  with	
  change	
  easily	
  and	
  at	
  scale	
  
▪  Does	
  not	
  enforce	
  requirements	
  and	
  models	
  up	
  front	
  
▪  Does	
  not	
  limit	
  the	
  format	
  or	
  structure	
  of	
  data	
  
▪  Assumes	
  the	
  range	
  of	
  data	
  latencies	
  in	
  and	
  out,	
  from	
  
streaming	
  to	
  one-­‐Eme	
  bulk	
  
©	
  Third	
  Nature,	
  Inc.	
  
Food	
  supply	
  chain:	
  an	
  analogy	
  for	
  data	
  
MulEple	
  contexts	
  of	
  use,	
  differing	
  quality	
  levels	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
You	
  need	
  to	
  keep	
  the	
  original	
  because	
  just	
  like	
  baking,	
  
you	
  can’t	
  unmake	
  dough	
  once	
  it’s	
  mixed.	
  
©	
  Third	
  Nature,	
  Inc.	
  
Data	
  architecture	
  is	
  required	
  by	
  the	
  services,	
  and	
  vice	
  versa	
  
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
Data	
  AcquisiEon	
  
Collect	
  &	
  Store	
  
PlaOorm	
  Services	
  
Data	
  Access	
  
Deliver	
  &	
  Use	
  
Data	
  Management	
  
Process	
  &	
  Integrate	
  
©	
  Third	
  Nature,	
  Inc.	
  
The	
  data	
  areas	
  map	
  (mostly)	
  to	
  func6onal	
  areas	
  of	
  the	
  lake	
  
CollecEon	
  can’t	
  be	
  limited	
  by	
  database	
  scale	
  and	
  latency.	
  
Immutability,	
  persistence	
  and	
  concurrency	
  are	
  required.	
  
Incremental	
  
Collect	
  
Batch	
  
One-­‐Eme	
  copy	
  
Real	
  Eme	
  
Manage	
  	
  &	
  Integrate	
   Process,	
  Deliver,	
  Use	
  
©	
  Third	
  Nature,	
  Inc.	
  
Stages,	
  not	
  layers	
  
Some	
  tools	
  require	
  specific	
  repositories	
  or	
  models.	
  
Others	
  can	
  reach	
  in	
  to	
  get	
  what	
  they	
  need.	
  Do	
  not	
  
enforce	
  a	
  single	
  access	
  point	
  or	
  model.	
  
©	
  Third	
  Nature,	
  Inc.	
  
The	
  geography	
  has	
  been	
  redefined	
  
The	
  box	
  IT	
  created:	
  
• not	
  any	
  data,	
  rigidly	
  typed	
  data	
  
• not	
  any	
  form,	
  tabular	
  rows	
  and	
  
columns	
  of	
  typed	
  data	
  
• not	
  any	
  latency,	
  persist	
  what	
  the	
  
DB	
  can	
  keep	
  up	
  with	
  
• not	
  any	
  process,	
  only	
  queries	
  
	
  
The	
  digital	
  world	
  was	
  diminished	
  
to	
  only	
  what’s	
  inside	
  the	
  box	
  un6l	
  
we	
  forgot	
  the	
  box	
  was	
  there.	
  
	
  
©	
  Third	
  Nature,	
  Inc.	
  
Layered	
  data	
  architecture	
  
The	
  DW	
  assumed	
  a	
  single	
  flat	
  
model	
  of	
  data,	
  DB	
  in	
  the	
  center.	
  	
  
The	
  data	
  lake	
  enables	
  new	
  ways	
  
to	
  organize	
  data:	
  
▪  Raw	
  –	
  straight	
  from	
  the	
  source	
  
▪  Enhanced	
  –cleaned,	
  standardized	
  
▪  Integrated	
  –	
  modeled,	
  
augmented,	
  ~semi-­‐persistent	
  
▪  Derived	
  –	
  analyEc	
  output,	
  
pacern	
  based	
  sets,	
  ephemeral	
  
Implies	
  a	
  new	
  technology	
  architecture	
  
and	
  data	
  modeling	
  approaches.	
  
©	
  Third	
  Nature,	
  Inc.	
  
The	
  data	
  lake	
  enables	
  evolu6onary	
  design	
  for	
  data	
  
EvoluEonary	
  design	
  is	
  required	
  because	
  data	
  needs	
  change.	
  You	
  
need	
  a	
  system	
  not	
  for	
  stability	
  –	
  we	
  have	
  that	
  in	
  the	
  DW	
  -­‐	
  but	
  for	
  
evoluEon	
  and	
  change,	
  the	
  data	
  lake.	
  	
  
Data	
  AcquisiEon	
  
Collect	
  &	
  Store	
  
Incremental	
  
Batch	
  
One-­‐Eme	
  copy	
  
Real	
  Eme	
  
Data	
  Lake	
  PlaOorm	
  Services	
  
Data	
  Management	
  
Process	
  &	
  Integrate	
  
Data	
  Access	
  
Deliver	
  &	
  Use	
  
Data	
  storage	
  
You	
  can’t	
  build	
  this	
  all	
  at	
  once.	
  You	
  need	
  to	
  grow	
  it	
  over	
  6me.	
  
©	
  Third	
  Nature,	
  Inc.	
  
Away	
  from	
  “one	
  throat	
  to	
  choke”,	
  back	
  to	
  best	
  of	
  breed	
  
Tight	
  coupling	
  leads	
  to	
  efficient	
  
reuse	
  and	
  standardizaEon,	
  and	
  
to	
  slow	
  changes.	
  
In	
  a	
  rapidly	
  evolving	
  market	
  
componenEzed	
  architectures,	
  
modularity	
  	
  and	
  loose	
  coupling	
  
are	
  favorable	
  over	
  monolithic	
  
stacks,	
  single-­‐vendor	
  
architectures	
  and	
  Eght	
  
coupling.	
  
Architecture,	
  not	
  blueprints:	
  
there	
  is	
  no	
  single	
  answer.	
  It	
  
depends	
  on	
  your	
  goals	
  and	
  
starEng	
  posiEon.	
  
	
  
Ques6ons?	
  “When	
  a	
  new	
  technology	
  rolls	
  over	
  you,	
  you're	
  either	
  part	
  
of	
  the	
  steamroller	
  or	
  part	
  of	
  the	
  road.”	
  –	
  Stewart	
  Brand	
  
©	
  Third	
  Nature,	
  Inc.	
  
CC	
  Image	
  Abribu6ons	
  
Thanks	
  to	
  the	
  people	
  who	
  supplied	
  the	
  creaEve	
  commons	
  licensed	
  images	
  used	
  in	
  this	
  presentaEon:	
  
	
  
donuts_4_views.jpg	
  -­‐	
  hcp://www.flickr.com/photos/le_hibou/76718773/	
  
glass_buildings.jpg	
  -­‐	
  hcp://www.flickr.com/photos/erikvanhannen/547701721	
  
	
  
	
  
©	
  Third	
  Nature,	
  Inc.	
  
About	
  the	
  Presenter	
  
Mark	
  Madsen	
  is	
  president	
  of	
  Third	
  Nature,	
  a	
  
consulEng	
  and	
  advisory	
  firm	
  focused	
  on	
  
analyEcs,	
  business	
  intelligence	
  and	
  data	
  
management.	
  Mark	
  is	
  an	
  award-­‐winning	
  
author,	
  architect	
  and	
  CTO.	
  Over	
  the	
  past	
  ten	
  
years	
  Mark	
  received	
  awards	
  for	
  his	
  work	
  
from	
  the	
  American	
  ProducEvity	
  &	
  Quality	
  
Center,	
  TDWI,	
  and	
  the	
  Smithsonian	
  InsEtute.	
  
He	
  is	
  an	
  internaEonal	
  speaker,	
  a	
  contributor	
  
to	
  Forbes,	
  member	
  of	
  the	
  O’Reilly	
  Strata	
  
program	
  commicee.	
  For	
  more	
  informaEon	
  or	
  
to	
  contact	
  Mark,	
  follow	
  @markmadsen	
  on	
  
Twicer	
  or	
  visit	
  	
  hcp://ThirdNature.net	
  	
  
About	
  Third	
  Nature	
  
Third	
  Nature	
  is	
  a	
  consulEng	
  and	
  advisory	
  firm	
  focused	
  on	
  new	
  and	
  emerging	
  technology	
  
and	
  pracEces	
  in	
  informaEon	
  strategy,	
  analyEcs,	
  business	
  intelligence	
  and	
  data	
  
management.	
  If	
  your	
  quesEon	
  is	
  related	
  to	
  data,	
  analyEcs,	
  informaEon	
  strategy	
  and	
  
technology	
  infrastructure	
  then	
  you‘re	
  at	
  the	
  right	
  place.	
  
Our	
  goal	
  is	
  to	
  help	
  organizaEons	
  solve	
  problems	
  using	
  data.	
  We	
  offer	
  educaEon,	
  
consulEng	
  and	
  research	
  services	
  to	
  support	
  business	
  and	
  IT	
  organizaEons	
  as	
  well	
  as	
  
technology	
  vendors.	
  
We	
  fill	
  the	
  gap	
  between	
  what	
  the	
  industry	
  analyst	
  firms	
  cover	
  and	
  what	
  IT	
  needs.	
  We	
  
specialize	
  in	
  strategy	
  and	
  architecture,	
  so	
  we	
  look	
  at	
  emerging	
  technologies	
  and	
  markets,	
  
evaluaEng	
  how	
  technologies	
  are	
  applied	
  to	
  solve	
  problems	
  rather	
  than	
  evaluaEng	
  product	
  
features.	
  
About SnapLogic
Anything
apps | APIs | things | data
Anytime 
batch | streaming | real-time
Anywhere
on premises | in the cloud


SnapLogic helps enterprises
connect data and 

applications faster
Modern Architecture: Hybrid and Elastic
Streams: No data is
stored/cached
Secure: 100%
standards-based
Elastic: Scales out &
handles data and app
integration use cases
Metadata
Data
Databases
On Prem
Apps
Big Data
Cloud Apps
and DataCloud-Based Designer, Manager,
Dashboard
Cloudplex
Groundplex
Hadooplex
Sparkplex
Firewall
z
Data
Acquisition
On Prem Apps
and Data
Data
Access
z
Data
Management
Data Lake
Add information
and improve data


Spark
Python
Scala
Java
R
Pig
Collect and
integrate data
from multiple
sources

HDFS

AWS S3

MS Azure Blob
•  ERP
•  CRM
•  RDBMS
Cloud Apps
and Data
•  CRM
•  HCM
•  Social
IoT Data
•  Sensors
•  Wearables
•  Devices
Lakeshore

Data Mart
•  MS Azure
•  AWS
Redshift
•  …
BI / Analytics
•  Tableau
•  MS
PowerBI /
Azure
•  AWS
QuickSight
Organize and
prepare data for
visualization


HDFS

AWS S3

MS Azure Blob
Hive
Batch
Streaming
Schedule and manage:
Oozie, Ambari
Kafka, Sqoop,
Flume
Real-time
Ingest Prepare Deliver
Impala, HiveSQL,
SparkSQL
z
Data
Acquisition
On Prem Apps
and Data
Data
Access
z
Data
Management
The Modern Data Lake
Powered by SnapLogic
•  ERP
•  CRM
•  RDBMS
Cloud Apps
and Data
•  CRM
•  HCM
•  Social
IoT Data
•  Sensors
•  Wearables
•  Devices
Lakeshore

Data Mart
•  MS Azure
•  AWS
Redshift
•  …
BI / Analytics
•  Tableau
•  MS
PowerBI /
Azure
•  AWS
QuickSight
Batch
Streaming
Schedule and manage: SnapLogicSnapLogic Pipelines
Real-time
Ingest Prepare Deliver
SnapLogic Pipelines
Sort,
Aggregate,
Join, Merge,
Transform



SnapLogic
abstracts and
operationalizes
with
SnapReduce or
Spark pipelines
Collect and
integrate data
from multiple
sources

SnapLogic
pipelines with
standard mode
execution
Organize and
prepare data for
visualization


SnapLogic
pipelines with
standard mode
execution
Thank You
Watch SnapLogic in action:"
video/snaplogic.com

Contact us:
info@snaplogic.com

Follow us on Twitter:
@SnapLogic

More Related Content

PPTX
Building an Effective Data Warehouse Architecture
PDF
Data platform architecture
PDF
NOsql Presentation.pdf
PPTX
Traditional data warehouse vs data lake
PDF
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
PDF
Data Lake,beyond the Data Warehouse
PDF
Why Data Vault?
PPTX
Operational Data Vault
Building an Effective Data Warehouse Architecture
Data platform architecture
NOsql Presentation.pdf
Traditional data warehouse vs data lake
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Lake,beyond the Data Warehouse
Why Data Vault?
Operational Data Vault

What's hot (20)

PDF
How a Semantic Layer Makes Data Mesh Work at Scale
PDF
Introduction to Data Vault Modeling
PPTX
Building a modern data warehouse
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
PPT
Data Warehouse Basic Guide
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Introduction to Data Engineering
PDF
Introduction to Azure Data Lake
PPTX
Introduction to Data Engineering
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
PDF
Modernizing to a Cloud Data Architecture
PDF
5 Steps for Architecting a Data Lake
PPTX
Data Lake Overview
PPTX
Oltp vs olap
PDF
Big Data Architecture
PDF
8 Steps to Creating a Data Strategy
PDF
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
PDF
Data warehousing
PDF
Data Lake Architecture – Modern Strategies & Approaches
How a Semantic Layer Makes Data Mesh Work at Scale
Introduction to Data Vault Modeling
Building a modern data warehouse
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehouse Basic Guide
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Introduction to Data Engineering
Introduction to Azure Data Lake
Introduction to Data Engineering
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Modernizing to a Cloud Data Architecture
5 Steps for Architecting a Data Lake
Data Lake Overview
Oltp vs olap
Big Data Architecture
8 Steps to Creating a Data Strategy
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Data warehousing
Data Lake Architecture – Modern Strategies & Approaches
Ad

Viewers also liked (20)

PPTX
Big data architectures and the data lake
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
PDF
Data Lake: A simple introduction
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PDF
Building the Enterprise Data Lake: A look at architecture
PDF
SnapLogic Big Data Integration
PDF
PDF
The Emerging Data Lake IT Strategy
PDF
Planing and optimizing data lake architecture
PPTX
Taming the Data Lake with Scalable Metrics Model Framework
PPT
using big-data methods analyse the Cross platform aviation
PPTX
March Marketers: Research Trends Presentation
PDF
Big model, big data
PDF
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
PPTX
A Higher-Order Data Flow Model for Heterogeneous Big Data
PPTX
Metadata Use Cases You Can Use
PPTX
Azure data factory
PDF
Handling the Extremes: Scaling and Streaming in Finance
PPTX
Analyze billions of records on Salesforce App Cloud with BigObject
PDF
The world's next top data model
Big data architectures and the data lake
Implementing a Data Lake with Enterprise Grade Data Governance
Data Lake: A simple introduction
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Building the Enterprise Data Lake: A look at architecture
SnapLogic Big Data Integration
The Emerging Data Lake IT Strategy
Planing and optimizing data lake architecture
Taming the Data Lake with Scalable Metrics Model Framework
using big-data methods analyse the Cross platform aviation
March Marketers: Research Trends Presentation
Big model, big data
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
A Higher-Order Data Flow Model for Heterogeneous Big Data
Metadata Use Cases You Can Use
Azure data factory
Handling the Extremes: Scaling and Streaming in Finance
Analyze billions of records on Salesforce App Cloud with BigObject
The world's next top data model
Ad

Similar to Building the Enterprise Data Lake - Important Considerations Before You Jump In (20)

PDF
Everything Has Changed Except Us: Modernizing the Data Warehouse
PDF
Big Data and Bad Analogies
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Lecture 5- Data Collection and Storage.pptx
PDF
Building a Data Platform Strata SF 2019
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
PDF
How to understand trends in the data & software market
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
PDF
Whitepaper-The-Data-Lake-3_0
PDF
Data Lakes: A Logical Approach for Faster Unified Insights
PDF
Data lakes
PPTX
Data modeling trends for analytics
PDF
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
PDF
Designing the Next Generation Data Lake
PDF
So You Want to Build a Data Lake?
PDF
Database Revolution - Exploratory Webcast
PDF
Database revolution opening webcast 01 18-12
PDF
Introduction Big Data
PPTX
Data lake ppt
Everything Has Changed Except Us: Modernizing the Data Warehouse
Big Data and Bad Analogies
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Lecture 5- Data Collection and Storage.pptx
Building a Data Platform Strata SF 2019
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
How to understand trends in the data & software market
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Chap3-Data Warehousing and OLAP operations..pptx
Whitepaper-The-Data-Lake-3_0
Data Lakes: A Logical Approach for Faster Unified Insights
Data lakes
Data modeling trends for analytics
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Designing the Next Generation Data Lake
So You Want to Build a Data Lake?
Database Revolution - Exploratory Webcast
Database revolution opening webcast 01 18-12
Introduction Big Data
Data lake ppt

More from SnapLogic (20)

PPTX
The AI Mindset: Bridging Industry and Academic Perspectives
PPTX
Supercharging Self-Service API Integration with AI
PPTX
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
PPTX
SnapLogic Culture Deck
PPTX
Euromoney's integration journey: Selecting SnapLogic's self-service integrati...
PPTX
Digital Transformation is Cloud-Powered
PPTX
How to Build a Winning Data Culture
PPTX
Data Warehousing in the Cloud: Practical Migration Strategies
PPTX
Overcoming the challenge of multiple data frameworks in a multiple cloud envi...
PPTX
SnapLogic Technology Open House – January 2018
PDF
Self-Service Integration in the Age of Digital Transformation at Box
PPTX
Live Demo: Accelerate the integration of workday applications
PPTX
The new dominant companies are running on data
PDF
Spring 2017 release customer webinar
PDF
SnapLogic unveils machine-learning-driven integration assistant
PDF
Webinar: Evolution of Data Management for the IoT
PDF
The API Lie
PPTX
SnapLogic Culture
PPTX
SnapLogic Live: Enabling the Citizen Integrator
PPTX
Big Data Management: What's New, What's Different, and What You Need To Know
The AI Mindset: Bridging Industry and Academic Perspectives
Supercharging Self-Service API Integration with AI
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
SnapLogic Culture Deck
Euromoney's integration journey: Selecting SnapLogic's self-service integrati...
Digital Transformation is Cloud-Powered
How to Build a Winning Data Culture
Data Warehousing in the Cloud: Practical Migration Strategies
Overcoming the challenge of multiple data frameworks in a multiple cloud envi...
SnapLogic Technology Open House – January 2018
Self-Service Integration in the Age of Digital Transformation at Box
Live Demo: Accelerate the integration of workday applications
The new dominant companies are running on data
Spring 2017 release customer webinar
SnapLogic unveils machine-learning-driven integration assistant
Webinar: Evolution of Data Management for the IoT
The API Lie
SnapLogic Culture
SnapLogic Live: Enabling the Citizen Integrator
Big Data Management: What's New, What's Different, and What You Need To Know

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Monthly Chronicles - July 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KodekX | Application Modernization Development

Building the Enterprise Data Lake - Important Considerations Before You Jump In

  • 1. Building The Enterprise Data Lake
 Important Considerations Before You Jump In December 8, 2015
  • 2. Building The Enterprise Data Lake Today’s Presenters Mark Madsen Industry Analyst Third Nature @markmadsen Craig Stewart Sr. Dir. Product Management SnapLogic @01Badger Erin Curtis Sr. Dir. Product Marketing SnapLogic @erncrts
  • 3. Building  the   Enterprise  Data  Lake   Considera6ons  before  you   jump  in             December,  2015     Mark  Madsen   www.ThirdNature.net   @markmadsen1  
  • 4. What  This  Session  Isn’t   SQL.. . SQL! SQL? SQL
  • 5. The  craB  model  of  informa6on  delivery  does  not  scale  
  • 6. ©  Third  Nature,  Inc.   So  we  shiBed  to  data  publishing   Industrialized  data  delivery  for  self-­‐service  access.  
  • 7. Events  and  sensors  are  a  rela6vely  new  data  source   Sensor  data  doesn’t  fit  well  with  current  methods  of  modeling,   collecEon  and  storage,  or  with  the  technology  to  process  and  analyze  it.  
  • 8. There’s  lots  of  other  new  data  involved  
  • 9. ©  Third  Nature,  Inc.   You  can  store  this  data  in  an  RDBMS,  but…  
  • 10. These  sorts  of  things  slow  user  requests  down   Conclusion:  any  methodology  built  on  the  premise  that  you   must  know  and  model  all  the  data  first  is  untenable    
  • 11. ©  Third  Nature,  Inc.   Analy6cs  embiggens  data  volume  problems   Many  of  the  processing  problems  are  O(n2)  or  worse,  so   moderate  data  can  be  a  problem  for  scale-­‐up  plaOorms  
  • 12. ©  Third  Nature,  Inc.   Old  market  says:  There’s  nothing  wrong  with  what   you  have,  just  keep  buying  new  products  from  us  
  • 13. The  emerging  big  data  market  has  an  answer…  
  • 14. ©  Third  Nature,  Inc.   The  data  lake  
  • 15. ©  Third  Nature,  Inc.   Views  of  the  lake   Is  the  business  vs  supports  the  business?   ApplicaEon  vs  infrastructure?  
  • 16. ©  Third  Nature,  Inc.   The  naïve  idea  of  a  data  lake  leads  to  predictable  results
  • 17. ©  Third  Nature,  Inc.   You  can’t  install  Hadoop  and  hope  it  solves  all  the  problems   Big  data  no  2  
  • 18. Slide 18 The  answer  isn’t  just  technology,  it’s  architecture  
  • 19. Schema In  the  DW  world  both  data  and  processing  are  bounded   No consideration for feedback loops and change Processing only happens here Carefully controlled access here Nobodyherecreates newinformation Sources few and well understood Complex DI is controlled by IT Schemas are few and designed Tools are authorized, few in number and kind One way flow This  is  a  monolithic,  layered  architecture  
  • 20. ©  Third  Nature,  Inc.   In  the  big  data  world  flow  is  unbounded  and  con6nuous   Feedback loops allowed End-of-analysis dataset may be start of a BI dataset Continuous data integration and delivery Files are back as both input and storage Minimal barrier of / control on collection Areas of provisioned data Any shape in, rectangles out This  needs  a  distributed  service  architecture  
  • 21. ©  Third  Nature,  Inc.   Deconstruc6ng  data  environments   There  are  three   things  happening  in  a   data  warehouse:   ▪  Data  acquisiEon   ▪  Data  management   ▪  Data  delivery   Isolate  them  from  one   another,  allow  read-­‐ write  use,  and  you  are   on  the  path.   Data Warehouse
  • 22. Data  lake  subsystems  /  components   The  acquisi6on  component  allows  any  data  to  be  collected  at  any  latency.  The   management    component  allows  some  data  to  be  standardized  and  integrated.  The   access  component  provides  access  at  any  latency  and  via  any  means  an  applica6on   chooses.  Processing  can  be  done  to  any  data  at  any  6me  from  any  area.   Data  AcquisiEon   Collect  &  Store   Incremental   Batch   One-­‐Eme  copy   Real  Eme   Data  Lake  PlaOorm  Services   Data  Management   Process  &  Integrate   Data  Access   Deliver  &  Use   Data  storage   In  reality,  you  are  building  three  systems,  not  one.  Avoid  the  monolith.  
  • 23. ©  Third  Nature,  Inc.   Data  lake  func6ons  depend  on  plaUorm  services   Base Platform Services Data Movement MetadataData Persistence Workflow Management Processing Engines Dataflow Services Data Curation Data Access Services Data  AcquisiEon   Collect  &  Store   Data  Management   Process  &  Integrate   Data  Access   Deliver  &  Use   PlaOorm  services  needed  
  • 24. DATA  ARCHITECTURE   We’re  so  focused  on  the  light  switch  that  we’re  not   talking  about  the  light  
  • 25. ©  Third  Nature,  Inc.   Decouple  the  Data  Architecture   The  core  of  the  data  lake  isn’t  a  database  or  HDFS,   it’s  the  data  architecture  that  the  tools  implement.     We  need  a  data  architecture  that  is  not  limiEng:   ▪  Deals  with  change  easily  and  at  scale   ▪  Does  not  enforce  requirements  and  models  up  front   ▪  Does  not  limit  the  format  or  structure  of  data   ▪  Assumes  the  range  of  data  latencies  in  and  out,  from   streaming  to  one-­‐Eme  bulk  
  • 26. ©  Third  Nature,  Inc.   Food  supply  chain:  an  analogy  for  data   MulEple  contexts  of  use,  differing  quality  levels                   You  need  to  keep  the  original  because  just  like  baking,   you  can’t  unmake  dough  once  it’s  mixed.  
  • 27. ©  Third  Nature,  Inc.   Data  architecture  is  required  by  the  services,  and  vice  versa   Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data Data  AcquisiEon   Collect  &  Store   PlaOorm  Services   Data  Access   Deliver  &  Use   Data  Management   Process  &  Integrate  
  • 28. ©  Third  Nature,  Inc.   The  data  areas  map  (mostly)  to  func6onal  areas  of  the  lake   CollecEon  can’t  be  limited  by  database  scale  and  latency.   Immutability,  persistence  and  concurrency  are  required.   Incremental   Collect   Batch   One-­‐Eme  copy   Real  Eme   Manage    &  Integrate   Process,  Deliver,  Use  
  • 29. ©  Third  Nature,  Inc.   Stages,  not  layers   Some  tools  require  specific  repositories  or  models.   Others  can  reach  in  to  get  what  they  need.  Do  not   enforce  a  single  access  point  or  model.  
  • 30. ©  Third  Nature,  Inc.   The  geography  has  been  redefined   The  box  IT  created:   • not  any  data,  rigidly  typed  data   • not  any  form,  tabular  rows  and   columns  of  typed  data   • not  any  latency,  persist  what  the   DB  can  keep  up  with   • not  any  process,  only  queries     The  digital  world  was  diminished   to  only  what’s  inside  the  box  un6l   we  forgot  the  box  was  there.    
  • 31. ©  Third  Nature,  Inc.   Layered  data  architecture   The  DW  assumed  a  single  flat   model  of  data,  DB  in  the  center.     The  data  lake  enables  new  ways   to  organize  data:   ▪  Raw  –  straight  from  the  source   ▪  Enhanced  –cleaned,  standardized   ▪  Integrated  –  modeled,   augmented,  ~semi-­‐persistent   ▪  Derived  –  analyEc  output,   pacern  based  sets,  ephemeral   Implies  a  new  technology  architecture   and  data  modeling  approaches.  
  • 32. ©  Third  Nature,  Inc.   The  data  lake  enables  evolu6onary  design  for  data   EvoluEonary  design  is  required  because  data  needs  change.  You   need  a  system  not  for  stability  –  we  have  that  in  the  DW  -­‐  but  for   evoluEon  and  change,  the  data  lake.     Data  AcquisiEon   Collect  &  Store   Incremental   Batch   One-­‐Eme  copy   Real  Eme   Data  Lake  PlaOorm  Services   Data  Management   Process  &  Integrate   Data  Access   Deliver  &  Use   Data  storage   You  can’t  build  this  all  at  once.  You  need  to  grow  it  over  6me.  
  • 33. ©  Third  Nature,  Inc.   Away  from  “one  throat  to  choke”,  back  to  best  of  breed   Tight  coupling  leads  to  efficient   reuse  and  standardizaEon,  and   to  slow  changes.   In  a  rapidly  evolving  market   componenEzed  architectures,   modularity    and  loose  coupling   are  favorable  over  monolithic   stacks,  single-­‐vendor   architectures  and  Eght   coupling.   Architecture,  not  blueprints:   there  is  no  single  answer.  It   depends  on  your  goals  and   starEng  posiEon.    
  • 34. Ques6ons?  “When  a  new  technology  rolls  over  you,  you're  either  part   of  the  steamroller  or  part  of  the  road.”  –  Stewart  Brand  
  • 35. ©  Third  Nature,  Inc.   CC  Image  Abribu6ons   Thanks  to  the  people  who  supplied  the  creaEve  commons  licensed  images  used  in  this  presentaEon:     donuts_4_views.jpg  -­‐  hcp://www.flickr.com/photos/le_hibou/76718773/   glass_buildings.jpg  -­‐  hcp://www.flickr.com/photos/erikvanhannen/547701721      
  • 36. ©  Third  Nature,  Inc.   About  the  Presenter   Mark  Madsen  is  president  of  Third  Nature,  a   consulEng  and  advisory  firm  focused  on   analyEcs,  business  intelligence  and  data   management.  Mark  is  an  award-­‐winning   author,  architect  and  CTO.  Over  the  past  ten   years  Mark  received  awards  for  his  work   from  the  American  ProducEvity  &  Quality   Center,  TDWI,  and  the  Smithsonian  InsEtute.   He  is  an  internaEonal  speaker,  a  contributor   to  Forbes,  member  of  the  O’Reilly  Strata   program  commicee.  For  more  informaEon  or   to  contact  Mark,  follow  @markmadsen  on   Twicer  or  visit    hcp://ThirdNature.net    
  • 37. About  Third  Nature   Third  Nature  is  a  consulEng  and  advisory  firm  focused  on  new  and  emerging  technology   and  pracEces  in  informaEon  strategy,  analyEcs,  business  intelligence  and  data   management.  If  your  quesEon  is  related  to  data,  analyEcs,  informaEon  strategy  and   technology  infrastructure  then  you‘re  at  the  right  place.   Our  goal  is  to  help  organizaEons  solve  problems  using  data.  We  offer  educaEon,   consulEng  and  research  services  to  support  business  and  IT  organizaEons  as  well  as   technology  vendors.   We  fill  the  gap  between  what  the  industry  analyst  firms  cover  and  what  IT  needs.  We   specialize  in  strategy  and  architecture,  so  we  look  at  emerging  technologies  and  markets,   evaluaEng  how  technologies  are  applied  to  solve  problems  rather  than  evaluaEng  product   features.  
  • 39. Anything apps | APIs | things | data Anytime batch | streaming | real-time Anywhere on premises | in the cloud SnapLogic helps enterprises connect data and 
 applications faster
  • 40. Modern Architecture: Hybrid and Elastic Streams: No data is stored/cached Secure: 100% standards-based Elastic: Scales out & handles data and app integration use cases Metadata Data Databases On Prem Apps Big Data Cloud Apps and DataCloud-Based Designer, Manager, Dashboard Cloudplex Groundplex Hadooplex Sparkplex Firewall
  • 41. z Data Acquisition On Prem Apps and Data Data Access z Data Management Data Lake Add information and improve data Spark Python Scala Java R Pig Collect and integrate data from multiple sources HDFS
 AWS S3
 MS Azure Blob •  ERP •  CRM •  RDBMS Cloud Apps and Data •  CRM •  HCM •  Social IoT Data •  Sensors •  Wearables •  Devices Lakeshore
 Data Mart •  MS Azure •  AWS Redshift •  … BI / Analytics •  Tableau •  MS PowerBI / Azure •  AWS QuickSight Organize and prepare data for visualization HDFS
 AWS S3
 MS Azure Blob Hive Batch Streaming Schedule and manage: Oozie, Ambari Kafka, Sqoop, Flume Real-time Ingest Prepare Deliver Impala, HiveSQL, SparkSQL
  • 42. z Data Acquisition On Prem Apps and Data Data Access z Data Management The Modern Data Lake Powered by SnapLogic •  ERP •  CRM •  RDBMS Cloud Apps and Data •  CRM •  HCM •  Social IoT Data •  Sensors •  Wearables •  Devices Lakeshore
 Data Mart •  MS Azure •  AWS Redshift •  … BI / Analytics •  Tableau •  MS PowerBI / Azure •  AWS QuickSight Batch Streaming Schedule and manage: SnapLogicSnapLogic Pipelines Real-time Ingest Prepare Deliver SnapLogic Pipelines Sort, Aggregate, Join, Merge, Transform SnapLogic abstracts and operationalizes with SnapReduce or Spark pipelines Collect and integrate data from multiple sources SnapLogic pipelines with standard mode execution Organize and prepare data for visualization SnapLogic pipelines with standard mode execution
  • 43. Thank You Watch SnapLogic in action:" video/snaplogic.com Contact us: info@snaplogic.com Follow us on Twitter: @SnapLogic