SlideShare a Scribd company logo
First Steps Towards a Data Lake:
Insight from Southwest Power Pool
Srinivas Kolluru and
Russell Mason
10/26/2016
Background
• Data stored in historical data store for data analysis and reporting
• Near real-time data feeds
• ETL data feeds
• More data than initial expectations
• Almost 50% of data is less frequently used
3
Background
9/23/2016World of Watson 2016
Near-term requirements – Offload less frequently used data
• SQL query access to the data
• Minimal changes to the business queries
Long-term vision – Data Lake infrastructure
• ETL and real-time data feeds
• Transactional and analytics workloads
• Cost economical model
• Should be able to add compute and storage incrementally
• Long-term partnership
4
Proof-of-Concept (PoC) Evaluation Criteria
9/23/2016World of Watson 2016
• Evaluated three different vendor technologies
• On-site PoC with three months of data and with a sample of business queries
• Six week evaluation of each vendor technology
• Chose BigInsights product
• Minimal business query modifications
• Support for Netezza functions
• Federated query capabilities between Netezza and BigInsights
• Partnership
5
Proof-of-Concept (PoC) Evaluation
9/23/2016World of Watson 2016
Data Lake Vision and Phase 1
Implementation
7
Data Lake Vision
9/23/2016World of Watson 2016
Data Zone
ETL Data
DB Data
Streaming
API Data
BLOB Data Metadata SecurityQuality GovernanceMonitoring
Landing
Zone
Data Extraction
Data Refinement
Data Discovery
Data Analytics
Business
Users
IT
Users
• Phase 1 – Offload less frequently used data, provide SQL query, and
federated query capabilities.
• Phase 2 – Query performance improvements, transactional capabilities,
and security controls
• Phase 3 – ETL data, streaming data, and governance
• Phase 4 – Real-time and BI analytics
8
Data Lake Implementation
9/23/2016World of Watson 2016
9
Phase1 - Data Offload Process
9/23/2016World of Watson 2016
Landing Zone
ORCORC
Data Zone
Data export & CRC
check validation
ORC table import and
CRC check validation
Metadata Security Monitoring
Source System
:
• Custom export process instead of using Fluid query or Scoop for data
transfer
• Fluid query
• Additional burden on source system
• Scoop
• Performance challenges with JDBC approach
• Limitations with incremental data pulls with external table approach
• Limitations with using database views
• Difficulty with data validation checks
• Difficulty with partitioning data based on record’s last update date
• Operational controls to minimize the impact on the source system
10
Architecture and Design Decisions
9/23/2016World of Watson 2016
• Data Integrity
• Row level CRC checks with data export and import processes
• ORC format
• Monthly data partitioning
• Seen better compression with ORC format compared to Parquet (Haven’t compared
the query performance with large sets of data)
• Separate HDFS storage cluster
• Scale storage independent of compute
• Align with the IT strategy and processes
• Expose the same set of files through multiple file system protocols
11
Architecture and Design Decisions
9/23/2016World of Watson 2016
12
Phase 1 Deployment Model
9/23/2016World of Watson 2016
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
Hadoop
Postgres
Metadata
Database
IBM BigInsights Cluster
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
BigInsights Component Deployment (Tentative)
Node #1
Zookeeper Server, Hive Metastore, HiveServer2, Knox Gateway, WebHCat
Server, BigInsights Home Server, BigSheets Master, BigSQL worker,
Metrics Monitoring, and Node Manager.
Node #2
Zookeeper Server, History Server, Hive Metastore, HiveServer2, Metrics
Collector, Active Resource Manager, WebHCat Server, Big SQL Worker,
Metrics Collector, and Node Manager.
Node #3
Zookeeper Server, App Timeline Server, Active Hbase Master, Hive
Metastore, HiveServer2, Oozie Server, Standby Resource Manager,
WebHCat Server, Big SQL Worker, Metrics Monitor, and Node Manager.
Node #4
Big SQL Head, Data Server Manager, Hive Metastore, HiveServer2,
WebHCat Server, Metrics Monitor, and Node Manager.
Node #5
BigSQL Secondary Head, Zookeeper Server, Hive Metastore, HiveServer2,
WebHCat Server, Region Server, HBaseREST Server, Metrics Monitor, and
Node Manager.
13
Phase 1 Production Deployment Model
9/23/2016World of Watson 2016
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
IBM BigInsights Cluster
Hadoop
Postgres
Metadata
Database
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
Node
#1
Node
#2
Node
#3
Node
#4
Node
#5
IBM BigInsights Cluster
Hadoop
Postgres
Metadata
Database
Node
#1
Node
#2
Node
#3
HDFS Storage Cluster
Data Center 1 Data Center 2
• Active-Active deployment across the data centers
• Evaluating IBM Big Replicate product
• Training for support teams
• Day light savings challenges
14
Challenges
9/23/2016World of Watson 2016
Thank You

More Related Content

PDF
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
PDF
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
PDF
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
PDF
Constant Contact: An Online Marketing Leader’s Data Lake Journey
PDF
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
PDF
Top 5 Considerations for a Big Data Solution
PDF
Building a Logical Data Fabric using Data Virtualization (ASEAN)
PPTX
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Constant Contact: An Online Marketing Leader’s Data Lake Journey
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
Top 5 Considerations for a Big Data Solution
Building a Logical Data Fabric using Data Virtualization (ASEAN)
How to Optimize Sales Analytics Using 10x the Data at 1/10th the Cost

What's hot (20)

PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
PDF
Data Lake,beyond the Data Warehouse
PPTX
A brief history of data warehousing
PDF
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
PPTX
Hadoop: Extending your Data Warehouse
ODP
Big Data Testing Strategies
PDF
Transforming GE Healthcare with Data Platform Strategy
PPTX
Pervasive analytics through data & analytic centricity
PDF
O'Reilly ebook: Operationalizing the Data Lake
PDF
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
PDF
Taming Big Data With Modern Software Architecture
PPTX
Designing modern dw and data lake
PDF
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
PPTX
Operational Analytics Using Spark and NoSQL Data Stores
PPTX
Big Data: Setting Up the Big Data Lake
PPTX
Big Data at Geisinger Health System: Big Wins in a Short Time
PDF
The Emerging Data Lake IT Strategy
PDF
Modern data warehouse
PDF
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
PDF
Building the Enterprise Data Lake: A look at architecture
Agile Big Data Analytics Development: An Architecture-Centric Approach
Data Lake,beyond the Data Warehouse
A brief history of data warehousing
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Hadoop: Extending your Data Warehouse
Big Data Testing Strategies
Transforming GE Healthcare with Data Platform Strategy
Pervasive analytics through data & analytic centricity
O'Reilly ebook: Operationalizing the Data Lake
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
Taming Big Data With Modern Software Architecture
Designing modern dw and data lake
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Operational Analytics Using Spark and NoSQL Data Stores
Big Data: Setting Up the Big Data Lake
Big Data at Geisinger Health System: Big Wins in a Short Time
The Emerging Data Lake IT Strategy
Modern data warehouse
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Building the Enterprise Data Lake: A look at architecture
Ad

Viewers also liked (20)

PDF
Big Fish Games: Democratizing Data Access
PDF
Medical University of South Carolina: Using Big Data and Predictive Analytics...
PDF
BigInsights For Telecom
PDF
Cloud Based Data Warehousing and Analytics
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
PDF
Integrating BigInsights and Puredata system for analytics with query federati...
PDF
Concept to production Nationwide Insurance BigInsights Journey with Telematics
PDF
Big Data: Getting started with Big SQL self-study guide
PDF
Big Data: Querying complex JSON data with BigInsights and Hadoop
PDF
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
PDF
Big Data: HBase and Big SQL self-study lab
PDF
Big Data: Big SQL and HBase
PDF
Big Data: Working with Big SQL data from Spark
PDF
Big Data: SQL on Hadoop from IBM
PDF
AddReality company overview
PPTX
Automate Hadoop Cluster Deployment in a Banking Ecosystem
PDF
Getting started with Hadoop on the Cloud with Bluemix
PDF
The Warranty Data Lake – After, Inc.
PPTX
Real timefrauddetectiononbigdata
PDF
Contexti / Oracle - Big Data : From Pilot to Production
Big Fish Games: Democratizing Data Access
Medical University of South Carolina: Using Big Data and Predictive Analytics...
BigInsights For Telecom
Cloud Based Data Warehousing and Analytics
Hadoop and SQL: Delivery Analytics Across the Organization
Integrating BigInsights and Puredata system for analytics with query federati...
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Big Data: Getting started with Big SQL self-study guide
Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: HBase and Big SQL self-study lab
Big Data: Big SQL and HBase
Big Data: Working with Big SQL data from Spark
Big Data: SQL on Hadoop from IBM
AddReality company overview
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Getting started with Hadoop on the Cloud with Bluemix
The Warranty Data Lake – After, Inc.
Real timefrauddetectiononbigdata
Contexti / Oracle - Big Data : From Pilot to Production
Ad

Similar to Southwest Power Pool big data case study (20)

PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
World of Watson 2016 - Put your Analytics on Cloud 9
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PPTX
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PPTX
Hadoop and Your Data Warehouse
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
PPTX
Data Lake Overview
DOC
Chapter 1
PPTX
Big Data Introduction
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
PPTX
TechEvent DWH Modernization
PDF
Hitachi Data Systems Big Data Roadmap
PDF
The Central Hub: Defining the Data Lake
PPTX
Deutsche Telekom on Big Data
PDF
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
PPTX
Big data architectures and the data lake
PPTX
How to build a successful Data Lake
PDF
Fundamentals Big Data and AI Architecture
PDF
Modern data warehouse
PPTX
Migrating from Big Data Architecture to Spring Cloud
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
World of Watson 2016 - Put your Analytics on Cloud 9
The Future of Apache Hadoop an Enterprise Architecture View
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
Hadoop and Your Data Warehouse
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Data Lake Overview
Chapter 1
Big Data Introduction
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
TechEvent DWH Modernization
Hitachi Data Systems Big Data Roadmap
The Central Hub: Defining the Data Lake
Deutsche Telekom on Big Data
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
Big data architectures and the data lake
How to build a successful Data Lake
Fundamentals Big Data and AI Architecture
Modern data warehouse
Migrating from Big Data Architecture to Spring Cloud

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PPTX
Logistic Regression ml machine learning.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Global journeys: estimating international migration
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
Taxes Foundatisdcsdcsdon Certificate.pdf
Logistic Regression ml machine learning.pptx
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Mega Projects Data Mega Projects Data
Global journeys: estimating international migration
Introduction-to-Cloud-ComputingFinal.pptx

Southwest Power Pool big data case study

  • 1. First Steps Towards a Data Lake: Insight from Southwest Power Pool Srinivas Kolluru and Russell Mason 10/26/2016
  • 3. • Data stored in historical data store for data analysis and reporting • Near real-time data feeds • ETL data feeds • More data than initial expectations • Almost 50% of data is less frequently used 3 Background 9/23/2016World of Watson 2016
  • 4. Near-term requirements – Offload less frequently used data • SQL query access to the data • Minimal changes to the business queries Long-term vision – Data Lake infrastructure • ETL and real-time data feeds • Transactional and analytics workloads • Cost economical model • Should be able to add compute and storage incrementally • Long-term partnership 4 Proof-of-Concept (PoC) Evaluation Criteria 9/23/2016World of Watson 2016
  • 5. • Evaluated three different vendor technologies • On-site PoC with three months of data and with a sample of business queries • Six week evaluation of each vendor technology • Chose BigInsights product • Minimal business query modifications • Support for Netezza functions • Federated query capabilities between Netezza and BigInsights • Partnership 5 Proof-of-Concept (PoC) Evaluation 9/23/2016World of Watson 2016
  • 6. Data Lake Vision and Phase 1 Implementation
  • 7. 7 Data Lake Vision 9/23/2016World of Watson 2016 Data Zone ETL Data DB Data Streaming API Data BLOB Data Metadata SecurityQuality GovernanceMonitoring Landing Zone Data Extraction Data Refinement Data Discovery Data Analytics Business Users IT Users
  • 8. • Phase 1 – Offload less frequently used data, provide SQL query, and federated query capabilities. • Phase 2 – Query performance improvements, transactional capabilities, and security controls • Phase 3 – ETL data, streaming data, and governance • Phase 4 – Real-time and BI analytics 8 Data Lake Implementation 9/23/2016World of Watson 2016
  • 9. 9 Phase1 - Data Offload Process 9/23/2016World of Watson 2016 Landing Zone ORCORC Data Zone Data export & CRC check validation ORC table import and CRC check validation Metadata Security Monitoring Source System :
  • 10. • Custom export process instead of using Fluid query or Scoop for data transfer • Fluid query • Additional burden on source system • Scoop • Performance challenges with JDBC approach • Limitations with incremental data pulls with external table approach • Limitations with using database views • Difficulty with data validation checks • Difficulty with partitioning data based on record’s last update date • Operational controls to minimize the impact on the source system 10 Architecture and Design Decisions 9/23/2016World of Watson 2016
  • 11. • Data Integrity • Row level CRC checks with data export and import processes • ORC format • Monthly data partitioning • Seen better compression with ORC format compared to Parquet (Haven’t compared the query performance with large sets of data) • Separate HDFS storage cluster • Scale storage independent of compute • Align with the IT strategy and processes • Expose the same set of files through multiple file system protocols 11 Architecture and Design Decisions 9/23/2016World of Watson 2016
  • 12. 12 Phase 1 Deployment Model 9/23/2016World of Watson 2016 Node #1 Node #2 Node #3 Node #4 Node #5 Hadoop Postgres Metadata Database IBM BigInsights Cluster Node #1 Node #2 Node #3 HDFS Storage Cluster BigInsights Component Deployment (Tentative) Node #1 Zookeeper Server, Hive Metastore, HiveServer2, Knox Gateway, WebHCat Server, BigInsights Home Server, BigSheets Master, BigSQL worker, Metrics Monitoring, and Node Manager. Node #2 Zookeeper Server, History Server, Hive Metastore, HiveServer2, Metrics Collector, Active Resource Manager, WebHCat Server, Big SQL Worker, Metrics Collector, and Node Manager. Node #3 Zookeeper Server, App Timeline Server, Active Hbase Master, Hive Metastore, HiveServer2, Oozie Server, Standby Resource Manager, WebHCat Server, Big SQL Worker, Metrics Monitor, and Node Manager. Node #4 Big SQL Head, Data Server Manager, Hive Metastore, HiveServer2, WebHCat Server, Metrics Monitor, and Node Manager. Node #5 BigSQL Secondary Head, Zookeeper Server, Hive Metastore, HiveServer2, WebHCat Server, Region Server, HBaseREST Server, Metrics Monitor, and Node Manager.
  • 13. 13 Phase 1 Production Deployment Model 9/23/2016World of Watson 2016 Node #1 Node #2 Node #3 Node #4 Node #5 IBM BigInsights Cluster Hadoop Postgres Metadata Database Node #1 Node #2 Node #3 HDFS Storage Cluster Node #1 Node #2 Node #3 Node #4 Node #5 IBM BigInsights Cluster Hadoop Postgres Metadata Database Node #1 Node #2 Node #3 HDFS Storage Cluster Data Center 1 Data Center 2
  • 14. • Active-Active deployment across the data centers • Evaluating IBM Big Replicate product • Training for support teams • Day light savings challenges 14 Challenges 9/23/2016World of Watson 2016