SlideShare a Scribd company logo
@joe_Caserta@GreatLakesBI
Architecting for Big Data:
Trends, Tips, and Deployment Options
Joe Caserta
President
Caserta Concepts
@joe_Caserta@GreatLakesBI
Top 20 Big Data
Consulting - CIO Review
Joe Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, Hortonworks, IBM, Cisco,
Datameer, Basho more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
Meetup in NYC ~ 1,500 Members
2012
2014
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
Top 20 Most Powerful
Big Data consulting firms
Dedicated to Data Governance
Techniques on Big Data (Innovation)
@joe_Caserta@GreatLakesBI
About Caserta Concepts
• Technology services company with expertise in data
analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Digital Marketing
• Financial Services / Insurance
• Healthcare / Higher Education
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation, Analytics
• Writing, Education, Mentoring
• Data Science & Analytics
• Cloud Computing
• Data Interaction & Visualization
@joe_Caserta@GreatLakesBI
Sales
Marketing
Finance
ETL
Data Exploration
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Others…
The Evolution of Enterprise Data
Data Science
Enterprise
Data Warehouse
ETL
@joe_Caserta@GreatLakesBI
Tools and Technologies
Best Practices
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data Analytics
@joe_Caserta@GreatLakesBI
@joe_Caserta@GreatLakesBI
The one’s you need to know….
Hadoop Distribution: Cloudera, Hortonworks, MapR, Pivotal-HD, IBM
 Tools:
 Hive: Map data to structures and use SQL-like queries
 Pig: Data transformation language for big data
 Sqoop: Extracts external sources and loads Hadoop
 Spark: General-purpose cluster computing framework
 Storm: Real-time ETL
 NoSQL:
 Document: MongoDB, CouchDB
 Graph: Neo4j, Titan
 Key Value: Riak, Redis
 Columnar: Cassandra, Hbase
 Search: Lucene, Solr, ElasticSearch
 Languages: Python, SciPy, Java, R, Scala
@joe_Caserta@GreatLakesBI
Advertising
Real time interactive queries on massive
audience datasets in the cloud
Global analytics on the cloud
Integrate SAP implementations from
across the globe into single cloud solution
Why are we Changing?
Recommendation Engines
“You chose… you might also like…”
Real-Time
Aggregation, Monitoring & Alerting on
events at extremely high message
rates… ~1M msgs/sec
Big Data Warehouse
Extending EDW with Hadoop
Governing data from the “lake” to the
EDW
Personal/Commercial Banking
Investment/Trading Bank
World-wide beauty company
Cable Television
Audience-based Advertising
@joe_Caserta@GreatLakesBI
• This is the ‘people’ part. Establishing Enterprise
Data Council, Data Stewards, etc.Organization
• Definitions, lineage (where does this data come
from), business definitions, technical metadataMetadata
• Identify and control sensitive data, regulatory
compliancePrivacy/Security
• Data must be complete and correct. Measure,
improve, certify
Data Quality and
Monitoring
• Policies around data frequency, source availability,
etc.
Business Process
Integration
• Ensure consistent business critical data i.e.
Members, Providers, Agents, etc.
Master Data
Management
• Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
@joe_Caserta@GreatLakesBI
• This is the ‘people’ part. Establishing Enterprise
Data Council, Data Stewards, etc.Organization
• Definitions, lineage (where does this data come
from), business definitions, technical metadataMetadata
• Identify and control sensitive data, regulatory
compliancePrivacy/Security
• Data must be complete and correct. Measure,
improve, certify
Data Quality and
Monitoring
• Policies around data frequency, source availability,
etc.
Business Process
Integration
• Ensure consistent business critical data i.e.
Members, Providers, Agents, etc.
Master Data
Management
• Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map
Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory
requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business
operations
For Big Data
@joe_Caserta@GreatLakesBI
The Big Data Pyramid
 Data has different governance demands at each tier.
 Only top tier of the is fully governed and ready for Enterprise BI
Big
Data
Warehouse
Data Science
Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine
data collection,
collect everything
Data is ready to be turned
into information: organized,
well defined, complete.
Agile business insight through
data-munging, machine learning,
blending with external data,
development of to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitor completeness of data
Metadata  Catalog
ILM  who has access, how long to “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
@joe_Caserta@GreatLakesBI
• The Big Data movement breaks the relational database
barrier and enables analysis on massive amounts of
structured and unstructured data.
• NoSQL puts the value of SQL based relational databases
into question. This disruption is forging a new road for the
progress and advancement of scalable data analytics.
• The value of legacy Business Intelligence comes into
question.
• Rather than forcing data users to become technologists, it
must make data analysis available for the masses.
BI is About to be Disrupted!
@joe_Caserta@GreatLakesBI
• The role of the ‘Business Analyst’, the primary user of the
BI tool, is being replaced or by two types of data users:
1. Highly technical Data Scientists
2. Non-technical Business Persons
• New analytics (BI) platforms must be created to
accommodate the new users. We see these very discrete
users using very different technologies.
• Perhaps legacy BI tools will not go away, but the market is
absolutely about to be disrupted.
Who Does BI Today?
@joe_Caserta@GreatLakesBI
• Data Scientists have deep technical knowledge
• They enjoy writing code and mining data
• The best way to serve a data scientist is to provide access
to raw data and then get out of their way.
Empower the Data Scientist
@joe_Caserta@GreatLakesBI
What does a Data Scientist Do, Anyway?
 Searching for the data they need
 Making sense of the data
 Figuring why the data looks the way is does and assessing its validity
 Cleaning up all the garbage within the data so it represents true
business
 Combining events with Reference data to give it context
 Correlating event data with other events
 Finally, they write algorithms to perform mining, clustering and
predictive analytics – the sexy stuff.
 Writes really cool and
sophisticated algorithms that
impacts the way the business
runs.
 Much of the time of a Data
Scientist is spent:
 NOT
@joe_Caserta@GreatLakesBI
• Business users don’t have, and don’t want to have,
technical wherewithal to interact with ‘data’.
• “We have a business to run! Programming should be done by
people in rooms with no windows.”
• “I need information at my fingertips and I should not need a PhD in
SQL to get it.”
• “It’s a myth that BI tools will solve my problems, I still need IT to get
new reports. This is unacceptable.”
• Every business professional on the planet knows how to
search for needed information via a Google search bar.
• Business people want to be able to ‘Google’ their
corporate data for the information they need.
Empower the Business Person
@joe_Caserta@GreatLakesBI
The Future of BI (if the Business gets its way)…
@joe_Caserta@GreatLakesBI
Facets created
automatically
based on
relevant data
Navigating Data in BI…
@joe_Caserta@GreatLakesBI
• During normal BI
implementations, much
time is spent/wasted on
selecting the best way to
graphically represent a
set of metrics.
• We can embed
algorithms that are
statistically proven to
best represent
information depending
on the type of question
being asked.
• The user should be able
to preview and change
from the default
infographic as easy as
clicking ‘next’ on a
Yahoo! Slideshow.
Why do we make it so difficult?
@joe_Caserta@GreatLakesBI
Lady gaga sales by state by customer age Go!
joe@casertaconcepts.com
Region
Northeast
Midwest
South
West
Product
Records
Perfume
Clothes
Performances
Dates
2009 to 2013
DOWNLOAD
TO EXCEL
Imagine the Possibilities….
@joe_Caserta@GreatLakesBI
• Modern web application framework
• Developed and supported by Google
• Bootstrap used for Mobile
Angular
• JavaScript library for data visualization.
• Exposes full capability CSS3, HTML5 and SVG. Is extremely fast
• Support large datasets and dynamic behaviors for interaction
D3.js
• The “glue” that brings other components together
• The ‘engine’ that transforms search strings into queries.
• Integrated with the Customer Metadata repository
Python
• Full-text and faceted-search engine and database
• This is the backbone of the applicationSolr
• Customer Metadata repository. Stores all business rules (default
facets, etc) and user preferences (default graph types, etc)
• Cassandra may not be ultimate selection
Cassandra
• Amazon Web Services
• Product is a zero-footprint cloud based solution
• User experience is same as Googling info
AWS
Building the Future of BI (Hint: it’s Big Data)
@joe_Caserta@GreatLakesBI
Innovation is the only sustainable
competitive advantage a company can
have.
Closing Thought
Challenge the status quo!
@joe_Caserta@GreatLakesBI
Thank You & Questions
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

More Related Content

PPTX
Big data (reversim)
PPTX
The Streaming Assessment – An Introduction
PPTX
Event-driven Business: How Leading Companies Are Adopting Streaming Strategies
PDF
Msst 2019 v4
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PDF
Data and its Role in Your Digital Transformation
PDF
Using Kafka in Your Organization with Real-Time User Insights for a Customer ...
Big data (reversim)
The Streaming Assessment – An Introduction
Event-driven Business: How Leading Companies Are Adopting Streaming Strategies
Msst 2019 v4
Edc event vienna presentation 1 oct 2019
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Data and its Role in Your Digital Transformation
Using Kafka in Your Organization with Real-Time User Insights for a Customer ...

What's hot (20)

PPTX
Everything you need to know about cloud migration(Build Stuff 2021)
PDF
Event-Streaming verstehen in unter 10 Min
PDF
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
PDF
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
PDF
LIVE DEMO: Big Data Suite
PPTX
Cloudera - IoT & Smart Cities
PDF
Application Modernization
PDF
Connecting Legacy Data Sources to the Data Lifecycle
PPT
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
PPTX
How does a Modern Integration Platform Innovate
PPTX
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
PDF
Real Time Business Platform by Ivan Novick from Pivotal
PDF
[INFOGRAPHIC] Event-driven Business: How to Handle the Flow of Event Data
PDF
Kafka Summit SF 2017 - Real time Streaming Platform
PDF
Pivotal Big Data Suite: A Technical Overview
PDF
Blockchain and Apache NiFi
PDF
Event Mesh Presentation at Gartner AADI Mumbai
PDF
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
PDF
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Everything you need to know about cloud migration(Build Stuff 2021)
Event-Streaming verstehen in unter 10 Min
Event Streaming: from Projects to Platform (Lyndon Hedderly, Confluent) 2019 ...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
LIVE DEMO: Big Data Suite
Cloudera - IoT & Smart Cities
Application Modernization
Connecting Legacy Data Sources to the Data Lifecycle
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How does a Modern Integration Platform Innovate
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Real Time Business Platform by Ivan Novick from Pivotal
[INFOGRAPHIC] Event-driven Business: How to Handle the Flow of Event Data
Kafka Summit SF 2017 - Real time Streaming Platform
Pivotal Big Data Suite: A Technical Overview
Blockchain and Apache NiFi
Event Mesh Presentation at Gartner AADI Mumbai
Apache Kafka in Gaming Industry (Games, Mobile, Betting, Gambling, Bookmaker,...
Big Data beyond Apache Hadoop - How to integrate ALL your Data
Ad

Viewers also liked (9)

PDF
Marketwired - Social Media in the Military: Mining & Monitoring
PDF
Defense Intelligence & The Information Challenge
PDF
Unveiling FATA a Visual Journey.
PPTX
Big data presentation linked in simon zhang 20140714
PDF
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
PPTX
Big data大数据presentation1
PDF
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
PDF
Big Data for Defense and Security
 
PDF
唯品会大数据实践 Sacc pub
Marketwired - Social Media in the Military: Mining & Monitoring
Defense Intelligence & The Information Challenge
Unveiling FATA a Visual Journey.
Big data presentation linked in simon zhang 20140714
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Big data大数据presentation1
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Big Data for Defense and Security
 
唯品会大数据实践 Sacc pub
Ad

Similar to Architecting for Big Data: Trends, Tips, and Deployment Options (20)

PPTX
Big Data's Impact on the Enterprise
PDF
What Data Do You Have and Where is It?
PPTX
Big Data Analytics with Microsoft
PDF
Balancing Data Governance and Innovation
PDF
Setting Up the Data Lake
PDF
Balancing Data Governance and Innovation
PPTX
Big Data: Setting Up the Big Data Lake
PDF
Incorporating the Data Lake into Your Analytic Architecture
PDF
The Data Lake - Balancing Data Governance and Innovation
PPTX
Defining and Applying Data Governance in Today’s Business Environment
PPTX
Introduction to Data Science
PDF
Building a New Platform for Customer Analytics
PDF
Intro to Data Science on Hadoop
PPTX
From Business Intelligence to Big Data - hack/reduce Dec 2014
PPTX
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PDF
INF2190_W1_2016_public
PDF
The Emerging Role of the Data Lake
Big Data's Impact on the Enterprise
What Data Do You Have and Where is It?
Big Data Analytics with Microsoft
Balancing Data Governance and Innovation
Setting Up the Data Lake
Balancing Data Governance and Innovation
Big Data: Setting Up the Big Data Lake
Incorporating the Data Lake into Your Analytic Architecture
The Data Lake - Balancing Data Governance and Innovation
Defining and Applying Data Governance in Today’s Business Environment
Introduction to Data Science
Building a New Platform for Customer Analytics
Intro to Data Science on Hadoop
From Business Intelligence to Big Data - hack/reduce Dec 2014
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
INF2190_W1_2016_public
The Emerging Role of the Data Lake

More from Caserta (18)

PPTX
Using Machine Learning & Spark to Power Data-Driven Marketing
PPTX
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PDF
General Data Protection Regulation - BDW Meetup, October 11th, 2017
PDF
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
PDF
Introduction to Data Science (Data Summit, 2017)
PDF
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
PDF
The Rise of the CDO in Today's Enterprise
PDF
You're the New CDO, Now What?
PDF
Making Big Data Easy for Everyone
PDF
Benefits of the Azure Cloud
PDF
Big Data Analytics on the Cloud
PDF
Not Your Father's Database by Databricks
PDF
Mastering Customer Data on Apache Spark
PDF
Moving Past Infrastructure Limitations
PDF
Introducing Kudu, Big Data Warehousing Meetup
PPTX
Real Time Big Data Processing on AWS
PPTX
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Using Machine Learning & Spark to Power Data-Driven Marketing
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Introduction to Data Science (Data Summit, 2017)
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
The Rise of the CDO in Today's Enterprise
You're the New CDO, Now What?
Making Big Data Easy for Everyone
Benefits of the Azure Cloud
Big Data Analytics on the Cloud
Not Your Father's Database by Databricks
Mastering Customer Data on Apache Spark
Moving Past Infrastructure Limitations
Introducing Kudu, Big Data Warehousing Meetup
Real Time Big Data Processing on AWS
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Modernizing your data center with Dell and AMD
NewMind AI Monthly Chronicles - July 2025
Machine learning based COVID-19 study performance prediction
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Architecting for Big Data: Trends, Tips, and Deployment Options

  • 1. @joe_Caserta@GreatLakesBI Architecting for Big Data: Trends, Tips, and Deployment Options Joe Caserta President Caserta Concepts
  • 2. @joe_Caserta@GreatLakesBI Top 20 Big Data Consulting - CIO Review Joe Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Dedicated to Data Warehousing, Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Formalized Alliances / Partnerships – System Integrators Partnered with Big Data vendors Cloudera, Hortonworks, IBM, Cisco, Datameer, Basho more… Launched Training practice, teaching data concepts world-wide Laser focus on extending Data Warehouses with Big Data solutions 1986 2004 1996 2009 2001 2010 2013 Launched Big Data Warehousing Meetup in NYC ~ 1,500 Members 2012 2014 Established best practices for big data ecosystem implementation – Healthcare, Finance, Insurance Top 20 Most Powerful Big Data consulting firms Dedicated to Data Governance Techniques on Big Data (Innovation)
  • 3. @joe_Caserta@GreatLakesBI About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Digital Marketing • Financial Services / Insurance • Healthcare / Higher Education • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation, Analytics • Writing, Education, Mentoring • Data Science & Analytics • Cloud Computing • Data Interaction & Visualization
  • 4. @joe_Caserta@GreatLakesBI Sales Marketing Finance ETL Data Exploration Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Others… The Evolution of Enterprise Data Data Science Enterprise Data Warehouse ETL
  • 5. @joe_Caserta@GreatLakesBI Tools and Technologies Best Practices Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Big Data Analytics
  • 7. @joe_Caserta@GreatLakesBI The one’s you need to know…. Hadoop Distribution: Cloudera, Hortonworks, MapR, Pivotal-HD, IBM  Tools:  Hive: Map data to structures and use SQL-like queries  Pig: Data transformation language for big data  Sqoop: Extracts external sources and loads Hadoop  Spark: General-purpose cluster computing framework  Storm: Real-time ETL  NoSQL:  Document: MongoDB, CouchDB  Graph: Neo4j, Titan  Key Value: Riak, Redis  Columnar: Cassandra, Hbase  Search: Lucene, Solr, ElasticSearch  Languages: Python, SciPy, Java, R, Scala
  • 8. @joe_Caserta@GreatLakesBI Advertising Real time interactive queries on massive audience datasets in the cloud Global analytics on the cloud Integrate SAP implementations from across the globe into single cloud solution Why are we Changing? Recommendation Engines “You chose… you might also like…” Real-Time Aggregation, Monitoring & Alerting on events at extremely high message rates… ~1M msgs/sec Big Data Warehouse Extending EDW with Hadoop Governing data from the “lake” to the EDW Personal/Commercial Banking Investment/Trading Bank World-wide beauty company Cable Television Audience-based Advertising
  • 9. @joe_Caserta@GreatLakesBI • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization • Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata • Identify and control sensitive data, regulatory compliancePrivacy/Security • Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring • Policies around data frequency, source availability, etc. Business Process Integration • Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance
  • 10. @joe_Caserta@GreatLakesBI • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization • Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata • Identify and control sensitive data, regulatory compliancePrivacy/Security • Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring • Policies around data frequency, source availability, etc. Business Process Integration • Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Big Data
  • 11. @joe_Caserta@GreatLakesBI The Big Data Pyramid  Data has different governance demands at each tier.  Only top tier of the is fully governed and ready for Enterprise BI Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitor completeness of data Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data Fully Data Governed ( trusted) User community arbitrary queries and reporting
  • 12. @joe_Caserta@GreatLakesBI • The Big Data movement breaks the relational database barrier and enables analysis on massive amounts of structured and unstructured data. • NoSQL puts the value of SQL based relational databases into question. This disruption is forging a new road for the progress and advancement of scalable data analytics. • The value of legacy Business Intelligence comes into question. • Rather than forcing data users to become technologists, it must make data analysis available for the masses. BI is About to be Disrupted!
  • 13. @joe_Caserta@GreatLakesBI • The role of the ‘Business Analyst’, the primary user of the BI tool, is being replaced or by two types of data users: 1. Highly technical Data Scientists 2. Non-technical Business Persons • New analytics (BI) platforms must be created to accommodate the new users. We see these very discrete users using very different technologies. • Perhaps legacy BI tools will not go away, but the market is absolutely about to be disrupted. Who Does BI Today?
  • 14. @joe_Caserta@GreatLakesBI • Data Scientists have deep technical knowledge • They enjoy writing code and mining data • The best way to serve a data scientist is to provide access to raw data and then get out of their way. Empower the Data Scientist
  • 15. @joe_Caserta@GreatLakesBI What does a Data Scientist Do, Anyway?  Searching for the data they need  Making sense of the data  Figuring why the data looks the way is does and assessing its validity  Cleaning up all the garbage within the data so it represents true business  Combining events with Reference data to give it context  Correlating event data with other events  Finally, they write algorithms to perform mining, clustering and predictive analytics – the sexy stuff.  Writes really cool and sophisticated algorithms that impacts the way the business runs.  Much of the time of a Data Scientist is spent:  NOT
  • 16. @joe_Caserta@GreatLakesBI • Business users don’t have, and don’t want to have, technical wherewithal to interact with ‘data’. • “We have a business to run! Programming should be done by people in rooms with no windows.” • “I need information at my fingertips and I should not need a PhD in SQL to get it.” • “It’s a myth that BI tools will solve my problems, I still need IT to get new reports. This is unacceptable.” • Every business professional on the planet knows how to search for needed information via a Google search bar. • Business people want to be able to ‘Google’ their corporate data for the information they need. Empower the Business Person
  • 17. @joe_Caserta@GreatLakesBI The Future of BI (if the Business gets its way)…
  • 19. @joe_Caserta@GreatLakesBI • During normal BI implementations, much time is spent/wasted on selecting the best way to graphically represent a set of metrics. • We can embed algorithms that are statistically proven to best represent information depending on the type of question being asked. • The user should be able to preview and change from the default infographic as easy as clicking ‘next’ on a Yahoo! Slideshow. Why do we make it so difficult?
  • 20. @joe_Caserta@GreatLakesBI Lady gaga sales by state by customer age Go! joe@casertaconcepts.com Region Northeast Midwest South West Product Records Perfume Clothes Performances Dates 2009 to 2013 DOWNLOAD TO EXCEL Imagine the Possibilities….
  • 21. @joe_Caserta@GreatLakesBI • Modern web application framework • Developed and supported by Google • Bootstrap used for Mobile Angular • JavaScript library for data visualization. • Exposes full capability CSS3, HTML5 and SVG. Is extremely fast • Support large datasets and dynamic behaviors for interaction D3.js • The “glue” that brings other components together • The ‘engine’ that transforms search strings into queries. • Integrated with the Customer Metadata repository Python • Full-text and faceted-search engine and database • This is the backbone of the applicationSolr • Customer Metadata repository. Stores all business rules (default facets, etc) and user preferences (default graph types, etc) • Cassandra may not be ultimate selection Cassandra • Amazon Web Services • Product is a zero-footprint cloud based solution • User experience is same as Googling info AWS Building the Future of BI (Hint: it’s Big Data)
  • 22. @joe_Caserta@GreatLakesBI Innovation is the only sustainable competitive advantage a company can have. Closing Thought Challenge the status quo!
  • 23. @joe_Caserta@GreatLakesBI Thank You & Questions Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta