SlideShare a Scribd company logo
1
Scaling ETL
with Hadoop
Gwen Shapira, Solutions Architect
@gwenshap
gshapira@cloudera.com
2
ETL is…
• Extracting data from outside sources
• Transforming it to fit operational needs
• Loading it into the end target
• (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load)
3
Hadoop Is…
• HDFS – Replicated, distributed data storage
• Map-Reduce – Batch oriented data processing at
scale. Parallel programming platform.
4
The Ecosystem
• High level languages and abstractions
• File, relational and streaming data integration
• Process Orchestration and Scheduling
• Libraries for data wrangling
• Low latency query language
5
6
Why ETL with Hadoop?
Got unstructured data?
• Traditional ETL:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
8
Replace ETL Clusters
• Informatica is:
• Easy to program
• Complete ETL solution
• Hadoop is:
• Cheaper
• MUCH more flexible
• Faster?
• More scalable?
• You can have both
9
Data Warehouse Offloading
• DWH resources are
EXPENSIVE
• Reduce storage costs
• Release CPU capacity
• Scale
• Better tools
10
What I often see
ETL
Cluster
ELT in
DWH
ETL in
Hadoop
11
ETL Cluster
ETL Cluster with
some Hadoop
12
We’ll Discuss:
Technologies Speed & Scale Tips & Tricks
Extract
Transform
Load
Workflow
13
14
Extract
Let me count the ways
• From Databases: Sqoop
• Log Data: Flume + CDK
• Or just write data to HDFS
15
Sqoop – The Balancing Act
16
Scale Sqoop Slowly
• Balance between:
• Maximizing network utilization
• Minimizing database impact
• Start with smallish table (1-10G)
• 1 mapper, 2 mappers, 4 mappers
• Where’s the bottleneck?
• vmtat, iostat, mpstat, netstat, iptraf
17
When Loading Files:
Same principles apply:
• Parallel Copy
• Add Parallelism
• Find Bottlenecks
• Resolve them
• Avoid Self-DDOS
18
Scaling Sqoop
• Split column - match index or partitions
• Compression
• Drivers + Connectors + Direct mode
• Incremental import
19
Ingest Tips
• Use file system tricks to ensure consistency
• Directory structure:
/intent
/category
/application (optional)
/dataset
/partitions
/files
• Examples:
/data/fraud/txs/2011-01-01/20110101-00.avro
/group/research/model-17/training-tx/part-00000.txt
/user/gshapira/scratch/surge/
20
Ingest Tips
• External tables in Hive
• Keep raw data
• Trigger workflows on
file arrival
21
22
Transform
Endless Possibilites
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
• Morphlines
23
Prototype
24
Partitioning
• Hive
• Directory Structure
• Pre-filter
• Adds metadata
25
Tune Data Structures
• Joins are expensive
• Disk space is not
• De-normalize
• Store same data in
multiple formats
26
Map-Reduce
• Assembly language of data processing
• Simple things are hard, hard things are possible
• Use for:
• Optimization: Do in one MR job what Hive does in 3
• Optimization: Partition the data just right
• GeoSpatial
• Mahout – Map/Reduce machine learning
27
Parallelism –Unit of Work
• Amdahl’s Law
• Small Units
• That stay small
• One user?
• One day?
• Ten square meter?
28
Remember the Basics
• X reduce output is 3X disk IO and 2X network IO
• Less jobs = Less reduces = Less IO = Faster and Scalier
• Know your network and disk throughput
• Have rough idea of ops-per-second
29
Instrumentation
• Optimize the right things
• Right jobs
• Right hardware
• 90% of the time –
its not the hardware
30
Tips
Avoid the
“old data warehouse guy”
syndrome
31
Tips
• Slowly changing dimensions:
• Load changes
• Merge
• And swap
• Store intermediate results:
• Performance
• Debugging
• Store Source/Derived relation
32
Fault and Rebuild
• Tier 0 – raw data
• Tier 1 – cleaned data
• Tier 2 – transformations, lookups and denormalization
• Tier 3 - Aggregations
33
Never Metadata I didn’t like
• Metadata is small data
• That changes frequently
• Solutions:
• Oozie
• Directory structure / Partitions
• Outside of HDFS
• HBase
• Cloudera Navigator
34
Few words about Real Time ETL
• What does it even mean?
• Fast reporting?
• No delay from OLTP to DWH?
• Micro-batches make more sense:
• Aggregation
• Economy of scale
• Late data happens
• Near-line solutions
35
36
Load
Technologies
• Sqoop
• Fuse-DFS
• Oracle Connectors
• NoSQLs
37
Scaling
• Sqoop works better for some DBs
• Fuse-FS and DB tools give more control
• Load in parallel
• Directly to FS
• Partitions
• Do all formatting on Hadoop
• Do you REALLY need to load that?
38
How not to Load
• Most Hadoop customers don’t load data in bulk
• History can stay in Hadoop
• Load only aggregated data
• Or computation results – recommendations, reports.
• Most queries can run in Hadoop
• BI tools often run in Hadoop
39
40
Workflow Management
Tools
• Oozie
• Azkaban
• Pentaho Kettle
• TalenD
• Informatica
41
} Native Hadoop
} Kinda Open Source
Scaling Challenges
• Keeping track of:
• Code Components
• Metadata
• Integrations and Adapters
• Reports, results, artifacts
• Scheduling and Orchestration
• Cohesive System View
• Life Cycle
• Instrumentation, Measurement and Monitoring
42
My Toolbox
• Hue + Oozie:
• Scheduling + Orchestration
• Cohesive system view
• Process repository
• Some metadata
• Some instrumentation
• Cloudera Manager for monitoring
• … and way too many home grown scripts
43
Hue + Oozie
44
45
— Josh Wills
A lot is still missing
• Metadata solution
• Instrumentation and monitoring solution
• Lifecycle management
• Lineage tracking
Contributions are welcome
46
47

More Related Content

PPTX
Scaling ETL with Hadoop - Avoiding Failure
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
R for hadoopers
PPTX
Is hadoop for you
PPTX
Twitter with hadoop for oow
PPTX
Realtime Detection of DDOS attacks using Apache Spark and MLLib
PPTX
Incredible Impala
KEY
Large scale ETL with Hadoop
Scaling ETL with Hadoop - Avoiding Failure
Data Wrangling and Oracle Connectors for Hadoop
R for hadoopers
Is hadoop for you
Twitter with hadoop for oow
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Incredible Impala
Large scale ETL with Hadoop

What's hot (20)

PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Architecting Applications with Hadoop
PDF
Cisco connect toronto 2015 big data sean mc keown
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
Apache Eagle - Monitor Hadoop in Real Time
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
PPTX
Apache HAWQ Architecture
PDF
Building large scale transactional data lake using apache hudi
PPTX
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
PPTX
Data Architectures for Robust Decision Making
PPTX
Debunking the Myths of HDFS Erasure Coding Performance
PPTX
Kafka and Hadoop at LinkedIn Meetup
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Hadoop 3.0 - Revolution or evolution?
PDF
Application architectures with Hadoop – Big Data TechCon 2014
PPTX
Hoodie: Incremental processing on hadoop
PPTX
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Architecting Applications with Hadoop
Cisco connect toronto 2015 big data sean mc keown
Unified Batch & Stream Processing with Apache Samza
Apache Eagle - Monitor Hadoop in Real Time
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
LLAP: Sub-Second Analytical Queries in Hive
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Apache HAWQ Architecture
Building large scale transactional data lake using apache hudi
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Data Architectures for Robust Decision Making
Debunking the Myths of HDFS Erasure Coding Performance
Kafka and Hadoop at LinkedIn Meetup
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Hadoop 3.0 - Revolution or evolution?
Application architectures with Hadoop – Big Data TechCon 2014
Hoodie: Incremental processing on hadoop
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Ad

Viewers also liked (12)

PDF
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
PPTX
Designing High Performance ETL for Data Warehouse
PDF
ETL Is Dead, Long-live Streams
PPT
Designing and implementing_an_etl_framework
PDF
A Reference Architecture for ETL 2.0
PDF
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
PPTX
Hadoop and Enterprise Data Warehouse
PDF
ETL Using Informatica Power Center
PPTX
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
PPTX
How to document a database
PDF
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
PPTX
Hadoop and Your Data Warehouse
A Walk Through the Kimball ETL Subsystems with Oracle Data Integration
Designing High Performance ETL for Data Warehouse
ETL Is Dead, Long-live Streams
Designing and implementing_an_etl_framework
A Reference Architecture for ETL 2.0
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Hadoop and Enterprise Data Warehouse
ETL Using Informatica Power Center
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
How to document a database
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
Hadoop and Your Data Warehouse
Ad

Similar to Scaling etl with hadoop shapira 3 (20)

PPTX
CCD-410 Cloudera Study Material
PPTX
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
PDF
Learning Hadoop 2 Garry Turkington Gabriele Modena
PDF
Hitachi Data Systems Hadoop Solution
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
PDF
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PPTX
Foxvalley bigdata
PPTX
Hadoop: An Industry Perspective
PPTX
The modern analytics architecture
PPTX
Big Data/Cloudera from Excelerate Systems
PPTX
Keynote - Cloudera - Mike Olson - Hadoop World 2010
PPTX
Cloudera - Mike Olson - Hadoop World 2010
PDF
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
PPTX
Hadoop and Hive in Enterprises
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
PDF
Modern data warehouse
CCD-410 Cloudera Study Material
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Learning Hadoop 2 Garry Turkington Gabriele Modena
Hitachi Data Systems Hadoop Solution
Oct 2011 CHADNUG Presentation on Hadoop
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Introduction to Spark - Phoenix Meetup 08-19-2014
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Foxvalley bigdata
Hadoop: An Industry Perspective
The modern analytics architecture
Big Data/Cloudera from Excelerate Systems
Keynote - Cloudera - Mike Olson - Hadoop World 2010
Cloudera - Mike Olson - Hadoop World 2010
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Hadoop and Hive in Enterprises
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Modern data warehouse

More from Gwen (Chen) Shapira (20)

PPTX
Velocity 2019 - Kafka Operations Deep Dive
PPTX
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
PPTX
Gluecon - Kafka and the service mesh
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
Papers we love realtime at facebook
PPTX
Kafka reliability velocity 17
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PPTX
Streaming Data Integration - For Women in Big Data Meetup
PPTX
Kafka at scale facebook israel
PPTX
Kafka connect-london-meetup-2016
PPTX
Fraud Detection for Israel BigThings Meetup
PPT
Kafka Reliability - When it absolutely, positively has to be there
PPTX
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
PPTX
Fraud Detection Architecture
PPTX
Have your cake and eat it too
PPTX
Kafka for DBAs
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PPTX
Intro to Spark - for Denver Big Data Meetup
PPTX
Ssd collab13
PPTX
Integrated dwh 3
Velocity 2019 - Kafka Operations Deep Dive
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gluecon - Kafka and the service mesh
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Papers we love realtime at facebook
Kafka reliability velocity 17
Multi-Datacenter Kafka - Strata San Jose 2017
Streaming Data Integration - For Women in Big Data Meetup
Kafka at scale facebook israel
Kafka connect-london-meetup-2016
Fraud Detection for Israel BigThings Meetup
Kafka Reliability - When it absolutely, positively has to be there
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Fraud Detection Architecture
Have your cake and eat it too
Kafka for DBAs
Kafka & Hadoop - for NYC Kafka Meetup
Intro to Spark - for Denver Big Data Meetup
Ssd collab13
Integrated dwh 3

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Scaling etl with hadoop shapira 3

  • 1. 1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap gshapira@cloudera.com
  • 2. 2
  • 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  • 4. Hadoop Is… • HDFS – Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  • 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  • 6. 6 Why ETL with Hadoop?
  • 7. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  • 8. Replace ETL Clusters • Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  • 9. Data Warehouse Offloading • DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  • 10. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  • 11. 12
  • 12. We’ll Discuss: Technologies Speed & Scale Tips & Tricks Extract Transform Load Workflow 13
  • 14. Let me count the ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  • 15. Sqoop – The Balancing Act 16
  • 16. Scale Sqoop Slowly • Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  • 17. When Loading Files: Same principles apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  • 18. Scaling Sqoop • Split column - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  • 19. Ingest Tips • Use file system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  • 20. Ingest Tips • External tables in Hive • Keep raw data • Trigger workflows on file arrival 21
  • 22. Endless Possibilites • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  • 24. Partitioning • Hive • Directory Structure • Pre-filter • Adds metadata 25
  • 25. Tune Data Structures • Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  • 26. Map-Reduce • Assembly language of data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  • 27. Parallelism –Unit of Work • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  • 28. Remember the Basics • X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  • 29. Instrumentation • Optimize the right things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  • 30. Tips Avoid the “old data warehouse guy” syndrome 31
  • 31. Tips • Slowly changing dimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  • 32. Fault and Rebuild • Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  • 33. Never Metadata I didn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  • 34. Few words about Real Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  • 36. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • NoSQLs 37
  • 37. Scaling • Sqoop works better for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  • 38. How not to Load • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  • 40. Tools • Oozie • Azkaban • Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  • 41. Scaling Challenges • Keeping track of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  • 42. My Toolbox • Hue + Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  • 45. A lot is still missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  • 46. 47

Editor's Notes

  • #25: Start with a portion of your data for fast iterationsPrototype – with Impala / streamingStart high level – tune as you go
  • #29: Pregnancy takes 9 month, no matter how many women are assigned to itBut – Elephants are pregnant for two years!
  • #32: 50% of Hadoop systems end up managed by the data warehouse team. And they want to do things as they always did – Kimball methodologies, time dimensions, etc.Not always a good idea, not always scalable, not always possible.