SlideShare a Scribd company logo
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things 
To Do With a 
Hadoop-Based Data 
Lake 
Strata Conference New York 2014 
Greg Chase 
Director, Product Marketing, Pivotal Software 
© 2014 Pivotal Software, Inc. All rights reserved. 2
Pivotal Business Data Lake Architecture 
Sources Ingestion 
Action Tier 
Tier 
Insights 
Tier 
Unified Operations Tier 
Command Center 
Spring XD, Oozie 
Processing Tier 
GemFire XD 
HAWQ/Greenplum 
Distillation Tier 
Pivotal HD 
Unstructured and structured data 
GemFire XD 
Spring XD 
Spring XD 
GemFire XD 
Sqoop 
Flume 
Spring XD 
GemFire XD 
HAWQ 
HBase 
HAWQ 
GemFire XD 
HBase 
HAWQ 
MapReduce 
Hive 
Pig 
Query interfaces 
Clickstream 
Sensor Data 
Weblogs 
Network 
Data 
CRM Data 
ERP Data 
GemFire 
RabbitMQ 
Redis 
Pivotal CF 
© 2014 Pivotal Software, Inc. All rights reserved. 3
Pivotal Business Data Lake Architecture 
Sources Ingestion 
Action Tier 
Tier 
Insights 
Tier 
Unified Operations Tier 
Command Center 
Spring XD, Oozie 
Processing Tier 
GemFire XD 
HAWQ/Greenplum 
Distillation Tier 
Pivotal HD 
Unstructured and structured data 
GemFire XD 
Spring XD 
Spring XD 
GemFire XD 
Sqoop 
Flume 
Spring XD 
GemFire XD 
HAWQ 
HBase 
HAWQ 
GemFire XD 
HBase 
HAWQ 
MapReduce 
Hive 
Pig 
Query interfaces 
Clickstream 
Sensor Data 
Weblogs 
Network 
Data 
CRM Data 
ERP Data 
GemFire 
RabbitMQ 
Redis 
Pivotal CF 
© 2014 Pivotal Software, Inc. All rights reserved. 4
1. Store Massive Data Sets 
… 
Rack 1 Rack 2 Rack 3 Rack n 
Scale-out: 
use 
commodity 
hardware 
and storage 
© 2014 Pivotal Software, Inc. All rights reserved. 5
2. Mix Disparate Data Sources 
101010101010 
Sensor data 
CRM data 
Website click streams 
Schema 
flexibility: 
adsorb 
different 
data types 
from data 
sources 
© 2014 Pivotal Software, Inc. All rights reserved. 6
Pivotal Business Data Lake Architecture 
Sources Ingestion 
Action Tier 
Tier 
Insights 
Tier 
Unified Operations Tier 
Command Center 
Spring XD, Oozie 
Processing Tier 
GemFire XD 
HAWQ/Greenplum 
Distillation Tier 
Pivotal HD 
Unstructured and structured data 
GemFire XD 
Spring XD 
Spring XD 
GemFire XD 
Sqoop 
Flume 
Spring XD 
GemFire XD 
HAWQ 
HBase 
HAWQ 
GemFire XD 
HBase 
HAWQ 
MapReduce 
Hive 
Pig 
Query interfaces 
Clickstream 
Sensor Data 
Weblogs 
Network 
Data 
CRM Data 
ERP Data 
GemFire 
RabbitMQ 
Redis 
Pivotal CF 
© 2014 Pivotal Software, Inc. All rights reserved. 7
3. Ingest Bulk Data 
D … 
D … D 
Microbatch 
Scalable 
open source 
tools for 
batch 
loading data 
Batch 
Flume 
 Event driven 
 Any source 
Spring XD 
 Bulk load 
 With processing 
 With analytics 
 Any source 
Sqoop 
 Bulk load 
 RDBMS 
© 2014 Pivotal Software, Inc. All rights reserved. 8
4. Ingest High-Velocity Data 
Capture all 
volatile data. 
Apply 
structure. 
1010101010101010101 
1010101010101010101 
1010101010101010101 
Spring XD 
 Bulk load 
 Real-time ingest 
 With processing 
 With analytics 
 Any source 
Pivotal GemFire XD 
 Advanced DB operations 
 Consistency 
 Reliable persistence 
 Convert to structured 
Streaming data 
© 2014 Pivotal Software, Inc. All rights reserved. 9
Pivotal Business Data Lake Architecture 
Sources Ingestion 
Action Tier 
Tier 
Insights 
Tier 
Unified Operations Tier 
Command Center 
Spring XD, Oozie 
Processing Tier 
GemFire XD 
HAWQ/Greenplum 
Distillation Tier 
Pivotal HD 
Unstructured and structured data 
GemFire XD 
Spring XD 
Spring XD 
GemFire XD 
Sqoop 
Flume 
Spring XD 
GemFire XD 
HAWQ 
HBase 
HAWQ 
GemFire XD 
HBase 
HAWQ 
MapReduce 
Hive 
Pig 
Query interfaces 
Clickstream 
Sensor Data 
Weblogs 
Network 
Data 
CRM Data 
ERP Data 
GemFire 
RabbitMQ 
Redis 
Pivotal CF 
© 2014 Pivotal Software, Inc. All rights reserved. 10
5. Apply Structure to Unstructured / Semi- 
Structured Data 
Flexible 
processing 
of different 
data types 
101010101010 
1 
101010101010 
1 
101010101010 
1 
© 2014 Pivotal Software, Inc. All rights reserved. 11
6. Make Data Available for MPP SQL Analysis 
Name 
Node 
Fast 
processing 
for 
advanced 
analytics in 
many 
supported 
HDFS 
formats 
Resource 
Manager 
HAWQ 
Master 
Data 
Node 
Node 
Manager 
HAWQ 
Segment(s) 
Data 
Node 
Node 
Manager 
Data 
Node 
Node 
Manager 
Data 
Node 
Node 
Manager 
HAWQ 
Segment(s) 
HAWQ 
Segment(s) 
HAWQ 
Segment(s) 
Hadoop Cluster 
© 2014 Pivotal Software, Inc. All rights reserved. 12
7. Achieve Data Integration 
Create multi-dimensional 
analytical 
models. 
101010101010 
1 
101010101010 
1 
101010101010 
1 
© 2014 Pivotal Software, Inc. All rights reserved. 13
8. Improve Machine Learning & Predictive 
Analytics 
Richer, 
deeper data 
sets for 
accurate 
predictive 
analytics. 
HAWQ 
Master 
HAWQ 
Segment(s) 
HAWQ 
Segment(s) 
HAWQ 
Segment(s) 
© 2014 Pivotal Software, Inc. All rights reserved. 14
9. Deploy Real-Time Automation at Scale 
Respond in 
real-time, at 
scale. 
Archive 
history in 
Hadoop. 
Pivotal 
GemFire XD 
101010101010 
Web 
App 
Web 
App 
Web 
App 
101010101010 
In-Memory 
© 2014 Pivotal Software, Inc. All rights reserved. 15
10. Achieve Continuous Innovation at Scale 
HAWQ 
Master 
HAWQ 
Segment(s) 
HAWQ 
Segment(s) 
HAWQ 
Segment(s) 
In-Memory 
Web 
App 
Web 
App 
Web 
App 
101010101010 
Sensor data 
CRM data 
Website click streams 
Deploy automation 
At scale 
Capture and store all data 
Analyze to 
discover insights 
& algorithms 
© 2014 Pivotal Software, Inc. All rights reserved. 16
Increase Value Derived from Data With a Data 
Lake 
Store 
massive 
data sets 
Mix 
disparate 
data 
Ingest bulk 
data 
Ingest 
high-velocity 
data 
Apply 
structure 
Enable 
MPP 
analysis 
Achieve 
data 
integration 
Business Value 
Improve 
predictive 
analytics 
Deploy 
real-time 
automation 
at scale 
Achieve 
continuous 
innovation 
© 2014 Pivotal Software, Inc. All rights reserved. 17
For more information on 
Pivotal Big Data Suite 
Visit Pivotal.io/big-data 
© 2014 Pivotal Software, Inc. All rights reserved. 18
10 Amazing Things To Do With a Hadoop-Based Data Lake

More Related Content

PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
PDF
Data Lake for the Cloud: Extending your Hadoop Implementation
PPTX
Hadoop Powers Modern Enterprise Data Architectures
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
PDF
Hortonworks and Clarity Solution Group
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Data Lake for the Cloud: Extending your Hadoop Implementation
Hadoop Powers Modern Enterprise Data Architectures
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Hortonworks and Clarity Solution Group

What's hot (20)

PPTX
Hortonworks Oracle Big Data Integration
PDF
Solving Big Data Problems using Hortonworks
PPTX
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
PDF
Planing and optimizing data lake architecture
PPTX
Create a Smarter Data Lake with HP Haven and Apache Hadoop
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PDF
The Next Generation of Big Data Analytics
PDF
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
PDF
Data Governance for Data Lakes
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
PPTX
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
PPTX
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
PDF
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PDF
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
PDF
A Reference Architecture for ETL 2.0
PPTX
Oncrawl elasticsearch meetup france #12
PDF
Building the Enterprise Data Lake: A look at architecture
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
Hortonworks Oracle Big Data Integration
Solving Big Data Problems using Hortonworks
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Planing and optimizing data lake architecture
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
The Next Generation of Big Data Analytics
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Data Governance for Data Lakes
Swimming Across the Data Lake, Lessons learned and keys to success
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Hadoop Reporting and Analysis - Jaspersoft
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
A Reference Architecture for ETL 2.0
Oncrawl elasticsearch meetup france #12
Building the Enterprise Data Lake: A look at architecture
Spark and Hadoop Perfect Togeher by Arun Murthy
Ad

Similar to 10 Amazing Things To Do With a Hadoop-Based Data Lake (20)

PDF
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
PDF
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
 
PDF
ds_Pivotal_Big_Data_Suite_Product_Suite
PPTX
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
PDF
Pivotal Big Data Suite: A Technical Overview
PDF
Pivotal Big Data Roadshow
PDF
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
PPTX
Driving Real Insights Through Data Science
PDF
Pivotal Big Data Suite: A Technical Overview
PDF
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
PDF
Operationalizing Data Analytics
PDF
Data and its Role in Your Digital Transformation
PDF
Role of Data in Digital Transformation
PDF
Real Time Business Platform by Ivan Novick from Pivotal
PDF
Pivotal Real Time Data Stream Analytics
PDF
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
PDF
The technology of the business data lake
PDF
Spark meets Spring
PDF
EMC Pivotal overview deck
PDF
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Pivotal Digital Transformation Forum: Journey to Become a Data-Driven Enterprise
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...
 
ds_Pivotal_Big_Data_Suite_Product_Suite
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Roadshow
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Driving Real Insights Through Data Science
Pivotal Big Data Suite: A Technical Overview
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
Operationalizing Data Analytics
Data and its Role in Your Digital Transformation
Role of Data in Digital Transformation
Real Time Business Platform by Ivan Novick from Pivotal
Pivotal Real Time Data Stream Analytics
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
The technology of the business data lake
Spark meets Spring
EMC Pivotal overview deck
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Ad

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
PDF
What AI Means For Your Product Strategy And What To Do About It
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
PPTX
Enhancing DevEx and Simplifying Operations at Scale
PDF
Spring Update | July 2023
PPTX
Platforms, Platform Engineering, & Platform as a Product
PPTX
Building Cloud Ready Apps
PDF
Spring Boot 3 And Beyond
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
PPTX
tanzu_developer_connect.pptx
PDF
Tanzu Virtual Developer Connect Workshop - French
PDF
Tanzu Developer Connect Workshop - English
PDF
Virtual Developer Connect Workshop - English
PDF
Tanzu Developer Connect - French
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
PDF
SpringOne Tour: The Influential Software Engineer
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
Spring into AI presented by Dan Vega 5/14
What AI Means For Your Product Strategy And What To Do About It
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Enhancing DevEx and Simplifying Operations at Scale
Spring Update | July 2023
Platforms, Platform Engineering, & Platform as a Product
Building Cloud Ready Apps
Spring Boot 3 And Beyond
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
tanzu_developer_connect.pptx
Tanzu Virtual Developer Connect Workshop - French
Tanzu Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
Tanzu Developer Connect - French
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: Domain-Driven Design: Theory vs Practice

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

10 Amazing Things To Do With a Hadoop-Based Data Lake

  • 2. 10 Amazing Things To Do With a Hadoop-Based Data Lake Strata Conference New York 2014 Greg Chase Director, Product Marketing, Pivotal Software © 2014 Pivotal Software, Inc. All rights reserved. 2
  • 3. Pivotal Business Data Lake Architecture Sources Ingestion Action Tier Tier Insights Tier Unified Operations Tier Command Center Spring XD, Oozie Processing Tier GemFire XD HAWQ/Greenplum Distillation Tier Pivotal HD Unstructured and structured data GemFire XD Spring XD Spring XD GemFire XD Sqoop Flume Spring XD GemFire XD HAWQ HBase HAWQ GemFire XD HBase HAWQ MapReduce Hive Pig Query interfaces Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data GemFire RabbitMQ Redis Pivotal CF © 2014 Pivotal Software, Inc. All rights reserved. 3
  • 4. Pivotal Business Data Lake Architecture Sources Ingestion Action Tier Tier Insights Tier Unified Operations Tier Command Center Spring XD, Oozie Processing Tier GemFire XD HAWQ/Greenplum Distillation Tier Pivotal HD Unstructured and structured data GemFire XD Spring XD Spring XD GemFire XD Sqoop Flume Spring XD GemFire XD HAWQ HBase HAWQ GemFire XD HBase HAWQ MapReduce Hive Pig Query interfaces Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data GemFire RabbitMQ Redis Pivotal CF © 2014 Pivotal Software, Inc. All rights reserved. 4
  • 5. 1. Store Massive Data Sets … Rack 1 Rack 2 Rack 3 Rack n Scale-out: use commodity hardware and storage © 2014 Pivotal Software, Inc. All rights reserved. 5
  • 6. 2. Mix Disparate Data Sources 101010101010 Sensor data CRM data Website click streams Schema flexibility: adsorb different data types from data sources © 2014 Pivotal Software, Inc. All rights reserved. 6
  • 7. Pivotal Business Data Lake Architecture Sources Ingestion Action Tier Tier Insights Tier Unified Operations Tier Command Center Spring XD, Oozie Processing Tier GemFire XD HAWQ/Greenplum Distillation Tier Pivotal HD Unstructured and structured data GemFire XD Spring XD Spring XD GemFire XD Sqoop Flume Spring XD GemFire XD HAWQ HBase HAWQ GemFire XD HBase HAWQ MapReduce Hive Pig Query interfaces Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data GemFire RabbitMQ Redis Pivotal CF © 2014 Pivotal Software, Inc. All rights reserved. 7
  • 8. 3. Ingest Bulk Data D … D … D Microbatch Scalable open source tools for batch loading data Batch Flume  Event driven  Any source Spring XD  Bulk load  With processing  With analytics  Any source Sqoop  Bulk load  RDBMS © 2014 Pivotal Software, Inc. All rights reserved. 8
  • 9. 4. Ingest High-Velocity Data Capture all volatile data. Apply structure. 1010101010101010101 1010101010101010101 1010101010101010101 Spring XD  Bulk load  Real-time ingest  With processing  With analytics  Any source Pivotal GemFire XD  Advanced DB operations  Consistency  Reliable persistence  Convert to structured Streaming data © 2014 Pivotal Software, Inc. All rights reserved. 9
  • 10. Pivotal Business Data Lake Architecture Sources Ingestion Action Tier Tier Insights Tier Unified Operations Tier Command Center Spring XD, Oozie Processing Tier GemFire XD HAWQ/Greenplum Distillation Tier Pivotal HD Unstructured and structured data GemFire XD Spring XD Spring XD GemFire XD Sqoop Flume Spring XD GemFire XD HAWQ HBase HAWQ GemFire XD HBase HAWQ MapReduce Hive Pig Query interfaces Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data GemFire RabbitMQ Redis Pivotal CF © 2014 Pivotal Software, Inc. All rights reserved. 10
  • 11. 5. Apply Structure to Unstructured / Semi- Structured Data Flexible processing of different data types 101010101010 1 101010101010 1 101010101010 1 © 2014 Pivotal Software, Inc. All rights reserved. 11
  • 12. 6. Make Data Available for MPP SQL Analysis Name Node Fast processing for advanced analytics in many supported HDFS formats Resource Manager HAWQ Master Data Node Node Manager HAWQ Segment(s) Data Node Node Manager Data Node Node Manager Data Node Node Manager HAWQ Segment(s) HAWQ Segment(s) HAWQ Segment(s) Hadoop Cluster © 2014 Pivotal Software, Inc. All rights reserved. 12
  • 13. 7. Achieve Data Integration Create multi-dimensional analytical models. 101010101010 1 101010101010 1 101010101010 1 © 2014 Pivotal Software, Inc. All rights reserved. 13
  • 14. 8. Improve Machine Learning & Predictive Analytics Richer, deeper data sets for accurate predictive analytics. HAWQ Master HAWQ Segment(s) HAWQ Segment(s) HAWQ Segment(s) © 2014 Pivotal Software, Inc. All rights reserved. 14
  • 15. 9. Deploy Real-Time Automation at Scale Respond in real-time, at scale. Archive history in Hadoop. Pivotal GemFire XD 101010101010 Web App Web App Web App 101010101010 In-Memory © 2014 Pivotal Software, Inc. All rights reserved. 15
  • 16. 10. Achieve Continuous Innovation at Scale HAWQ Master HAWQ Segment(s) HAWQ Segment(s) HAWQ Segment(s) In-Memory Web App Web App Web App 101010101010 Sensor data CRM data Website click streams Deploy automation At scale Capture and store all data Analyze to discover insights & algorithms © 2014 Pivotal Software, Inc. All rights reserved. 16
  • 17. Increase Value Derived from Data With a Data Lake Store massive data sets Mix disparate data Ingest bulk data Ingest high-velocity data Apply structure Enable MPP analysis Achieve data integration Business Value Improve predictive analytics Deploy real-time automation at scale Achieve continuous innovation © 2014 Pivotal Software, Inc. All rights reserved. 17
  • 18. For more information on Pivotal Big Data Suite Visit Pivotal.io/big-data © 2014 Pivotal Software, Inc. All rights reserved. 18

Editor's Notes

  • #4: This is an architecture of a Business Data Lake. It is centered around Hadoop-based storage. It includes tools and components for ingesting data from different kinds of data sources, processing data for analytics and insights, and for supporting applications that utilize data, implement insights, and contribute data back to the data lake. In this presentation, we will look at the various components of a business data lake architecture, and show how to use it to maximize the value of your company’s data.
  • #5: Let’s first look at why Hadoop and HDFS for a data lake makes a lot of sense.
  • #6: Hadoop, and its underlying Hadoop File System, or HDFS, is a distributed file system that supports arbitrarily large clusters. This means your data storage can theoretically be as large as needed to fit your needs. You simply add more clusters as you need more space.
  • #7: HDFS is schema-less, which means it can support files of any type and format. This is great for storing unstructured or semi-structured data, as well as non relational data formats such as binary streams from sensors, image data, machine logging. It’s also just fine for storing for structured, relational tabular data.
  • #8: When your data storage can take any kind of data from any kind of source, allowing this data to be loaded and stored can be a challenge. This is why a wide selection of tools for ingest is needed to implement a data lake.
  • #9: Batch loading can be achieved with a variety of tools, depending on additional sources needed. Sqoop, for example, is great for handling large data batch loading, and can even pull data from legacy databases. On the other hand, if your bulk loading operation needs some additional processing on it – such as you want to transform data from one format to another or create metadata, and if you want to be able to create analytics, then another open source tool, Spring XD, is available and provides scale and flexibility to handle your specific needs. Microbatch – in other words, smaller, but recurring batch loads, such as data change deltas or event-triggered updates, is handled well by Flume.
  • #10: Storing high-velocity data into Hadoop is a different challenge altogether. Considering that your source could be in any volume in addition to speed. If ensuring you store all the data is paramount, you need tools that can capture and queuedata in any scale or volume until the Hadoop cluster is able to store. A data lake based on Pivotal Big Data Suite has two tools built for these use cases. In fact they can work together: Spring XD can scale to handle data streaming at real time, and provide the same capabilities of processing and analyzing. Pivotal GemFire XD can work with Spring XD to provide advanced database operations such search for duplicates in a window of time, for example, and allows you to ensure consistency of data in writes. Since it it’s a SQL-based database, it’s also great for helping convert or add structure to ingested data.
  • #11: Once you have the ability to store and load data into your data lake, the next is deriving business value by processing, gaining insights, and taking action on the data.
  • #12: It’s great that one can can get any kind of data into an HDFS data store. However, to be able to conduct advanced analytics on it, you often need to make it accessible to structured-based analysis tools. This kind of processing may involve direct transformation of file types, or it might simply mean analyzing and creating meta data about the file type. This can be done on ingest with some of the tools described, or can be processed after being stored in Hadoop. Examples might be transforming binary image formats into RDBMS tables to enable large scale image processing, or even simple ETL processes on web logs so that it can later be turned into fact tables.
  • #13: Once you have structure applied to your data, its possible to leverage SQL-based tools to do fast processing on your data for advanced analytics and data science. Only HAWQ provides full analytic SQL support on Hadoop in massive parallel processing. This allows you enjoy very high performance leveraging advanced analytics functions in MADlib, as well as when using analytics applications such as SAS.
  • #14: With structure applied to your data, and the ability to deploy advanced analytics, now you can start doing some very powerful investigation, which is actually supported by Hadoop. By discovering relationships between otherwise seemingly unrelated data sets, its possible to discover correlations and potential causation, and create multi-dimensional analytical models that have higher precision in predictive analytics.
  • #15: Since HDFS allows you to store as much data as you want at a very cheap price, its possible to store larger detail data sets such as time series feeds, and application logs. In traditional data warehousing, ETL processes will aggregate and summarize this information, and lose detail for purposes of facilitating reporting. By saving the detail, its possible to run machine learning algorithms on the data to help build more accurate predictive analytics.
  • #16: Distributed in-memory databases such as Pivotal GemFire XD make it possible to deploy real-time data-driven automation at scale. This means you can deploy applications for responding to and processing incoming streaming data such as for Internet of Things applications, or support large scale mobile-web applications. You want to create intelligent user experiences, and provide smart automation and processing in the backend. You also want to be able to capture and store detailed logging of all interactions for further analysis.
  • #17: The ability to deploy automation at scale, capture and store all data, and analyze to discover insights and algorithms is an ongoing process of continuous improvement and innovation.
  • #18: Using the full capabilities of a data lake from storing massive data sets to achieving coninuous innovation allows your company to maximize the business value it generates off its data.