SlideShare a Scribd company logo
@joe_Caserta#BDWMeetup
Topic:
Predictive
Analytics
Big Data Warehousing
June 3, 2015
Presented by:
@joe_Caserta#BDWMeetup
About Caserta Concepts
• Technology innovation company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Solve highly complex business data challenges
• Award-winning solutions
• Business Transformation
• Maximize Data Value
• Industry Recognized Workforce
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Services
• Strategy, Roadmap, Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
@joe_Caserta#BDWMeetup
@joe_Caserta#BDWMeetup
Why do we need predictive analytics today?
@joe_Caserta#BDWMeetup
The Progression of Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations
@joe_Caserta#BDWMeetup
@joe_Caserta#BDWMeetup
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Today’s Data Environment
Data Science
@joe_Caserta#BDWMeetup
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Data has different governance demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
@joe_Caserta#BDWMeetup
Data Science Subway Map
@joe_Caserta#BDWMeetup
Notable Predictive Analytic Tools
Open Source Tools:
scikit-learn
KNIME
OpenNN
Orange
R
Weka
GNU Octave
Apache Mahout
Commercial Tools:
Alpine Data Labs
BIRT Analytics
Angoss KnowledgeSTUDIO
IBM SPSS Statistics and IBM SPSS
Modeler
KXEN Modeler
Mathematica
MATLAB
Minitab
Oracle Data Mining (ODM)
Pervasive
Predixion Software
RapidMiner
RCASE
Most Popular:
SAS
SPSS
Statistica
R
@joe_Caserta#BDWMeetup
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
@joe_Caserta#BDWMeetup
Easier to Find Than an Awesome Data Scientist
@joe_Caserta#BDWMeetup
Modern Data Engineering
@joe_Caserta#BDWMeetup
Advanced Mathematics / Statistics
@joe_Caserta#BDWMeetup
Domain and Outcome Sensibility
@joe_Caserta#BDWMeetup
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
@joe_Caserta#BDWMeetup
1. Business Understanding
In this initial phase of the project we will need to speak to
humans.
• It would be premature to jump in to the data, or begin
selection of the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business
requirements into a preliminary technical design (decision
model) and plan.
Since this is an iterative process, this phase will be revisited
throughout the entire process.
@joe_Caserta#BDWMeetup
2. Data Understanding
• Data Discovery  understand where the data you
need comes from
• Data Profiling  interrogate the data at the entity
level, understand key entities and fields that are
relevant to the analysis.
• Cleansing Requirements  understand data
quality, data density, skew, etc
• Data Munging  collocate, blend and analyze data
for early insights! Valuable information can be
achieved from simple group-by, aggregate queries,
and even more with SQL Jujitsu!
Significant iteration between Business Understanding
and Data Understanding phases.
@joe_Caserta#BDWMeetup
3. Data Preparation
ETL (Extract Transform Load)
90+% of Data Science time goes into Data
Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete
values, whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
@joe_Caserta#BDWMeetup
Data Quality and Monitoring
• BUILD a robust data quality
subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL
Toolkit
• Each error instance of each data
quality check is captured
• Implemented as sub-system
after ingestion
• Each fact stores unique
identifier of the defective source
row
@joe_Caserta#BDWMeetup
4. Modeling
The Lovers of Algebra & Statistics
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data
preparation techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional
data points, or uncover additional data quality issues!
@joe_Caserta#BDWMeetup
What to use When?
@joe_Caserta#BDWMeetup
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
@joe_Caserta#BDWMeetup
6. Deployment
Engineering Time!
• It’s time for the work products of data science to
“graduate” from “new insights” to real applications.
• Processes must be hardened, repeatable, and generally
perform well too!
• Full Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based
interchange format
@joe_Caserta#BDWMeetup
Some Thoughts
 Big Science requires the convergence
of
data governance,
advanced data engineering,
math and statistics and
business smarts
 Data Science must be guided by best
practices and standards
 Tools and techniques must
ultimately be platform agnostic
(portable)
 Work with experts that have done it
before!
@joe_Caserta#BDWMeetup
Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
@joe_Caserta#BDWMeetup
Swag Giveaway
RAFFLE!!!

More Related Content

PPTX
Creating an Enterprise AI Strategy
PPTX
Cloudera Fast Forward Labs: Accelerate machine learning
PPTX
The Big Picture: Real-time Data is Defining Intelligent Offers
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
PDF
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
PDF
Setting Up the Data Lake
PDF
Making Big Data Easy for Everyone
PDF
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Creating an Enterprise AI Strategy
Cloudera Fast Forward Labs: Accelerate machine learning
The Big Picture: Real-time Data is Defining Intelligent Offers
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
CSNI: How State Medicaid Agencies Can Use Analytics to Predict Opioid Abuse a...
Setting Up the Data Lake
Making Big Data Easy for Everyone
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent

What's hot (20)

PDF
Best Practices for Big Data Analytics with Machine Learning by Datameer
PPTX
Domino and AWS: collaborative analytics and model governance at financial ser...
PDF
Informatica Becomes Part of the Business Data Lake Ecosystem
PPTX
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
PDF
You're the New CDO, Now What?
PDF
Analyzing Unstructured Data in Hadoop Webinar
PDF
Real-Time Data Integration for Modern BI
 
PDF
The Emerging Role of the Data Lake
PPTX
2020 Big Data & Analytics Maturity Survey Results
PDF
Splunk Business Analytics
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PDF
Modern Integrated Data Environment - Whitepaper | Qubole
PPT
Choosing the Right Big Data Architecture for your Business
PDF
What are actionable insights? (Introduction to Operational Analytics Software)
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
PPTX
Solution Architecture US healthcare
PDF
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PDF
How Businesses use Big Data to Impact the Bottom Line
PDF
Accelerating Fast Data Strategy with Data Virtualization
Best Practices for Big Data Analytics with Machine Learning by Datameer
Domino and AWS: collaborative analytics and model governance at financial ser...
Informatica Becomes Part of the Business Data Lake Ecosystem
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
You're the New CDO, Now What?
Analyzing Unstructured Data in Hadoop Webinar
Real-Time Data Integration for Modern BI
 
The Emerging Role of the Data Lake
2020 Big Data & Analytics Maturity Survey Results
Splunk Business Analytics
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Modern Integrated Data Environment - Whitepaper | Qubole
Choosing the Right Big Data Architecture for your Business
What are actionable insights? (Introduction to Operational Analytics Software)
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Solution Architecture US healthcare
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
How Businesses use Big Data to Impact the Bottom Line
Accelerating Fast Data Strategy with Data Virtualization
Ad

Similar to Predictive Analytics - Big Data Warehousing Meetup (20)

PDF
Introduction to Data Science (Data Summit, 2017)
PPTX
Big Data's Impact on the Enterprise
PDF
What Data Do You Have and Where is It?
PDF
Balancing Data Governance and Innovation
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
PDF
Incorporating the Data Lake into Your Analytic Architecture
PPTX
Big Data: Setting Up the Big Data Lake
PPTX
Big Data Analytics with Microsoft
PDF
Balancing Data Governance and Innovation
PDF
Building a New Platform for Customer Analytics
PDF
Big data Analytics
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
PPTX
Introduction to Data Science
PDF
Intro to Data Science on Hadoop
PDF
Data lake benefits
PDF
Take Action: The New Reality of Data-Driven Business
PDF
The Data Lake - Balancing Data Governance and Innovation
PDF
Self-Service Analytics with Guard Rails
PDF
When and How Data Lakes Fit into a Modern Data Architecture
Introduction to Data Science (Data Summit, 2017)
Big Data's Impact on the Enterprise
What Data Do You Have and Where is It?
Balancing Data Governance and Innovation
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Architecting for Big Data: Trends, Tips, and Deployment Options
Incorporating the Data Lake into Your Analytic Architecture
Big Data: Setting Up the Big Data Lake
Big Data Analytics with Microsoft
Balancing Data Governance and Innovation
Building a New Platform for Customer Analytics
Big data Analytics
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Introduction to Data Science
Intro to Data Science on Hadoop
Data lake benefits
Take Action: The New Reality of Data-Driven Business
The Data Lake - Balancing Data Governance and Innovation
Self-Service Analytics with Guard Rails
When and How Data Lakes Fit into a Modern Data Architecture
Ad

More from Caserta (14)

PPTX
Using Machine Learning & Spark to Power Data-Driven Marketing
PPTX
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
PDF
General Data Protection Regulation - BDW Meetup, October 11th, 2017
PDF
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
PDF
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
PDF
The Rise of the CDO in Today's Enterprise
PDF
Benefits of the Azure Cloud
PDF
Big Data Analytics on the Cloud
PDF
Not Your Father's Database by Databricks
PDF
Mastering Customer Data on Apache Spark
PDF
Moving Past Infrastructure Limitations
PDF
Introducing Kudu, Big Data Warehousing Meetup
PPTX
Real Time Big Data Processing on AWS
PPTX
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Using Machine Learning & Spark to Power Data-Driven Marketing
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
The Rise of the CDO in Today's Enterprise
Benefits of the Azure Cloud
Big Data Analytics on the Cloud
Not Your Father's Database by Databricks
Mastering Customer Data on Apache Spark
Moving Past Infrastructure Limitations
Introducing Kudu, Big Data Warehousing Meetup
Real Time Big Data Processing on AWS
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Cloud computing and distributed systems.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Cloud computing and distributed systems.
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
KodekX | Application Modernization Development

Predictive Analytics - Big Data Warehousing Meetup

  • 2. @joe_Caserta#BDWMeetup About Caserta Concepts • Technology innovation company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Solve highly complex business data challenges • Award-winning solutions • Business Transformation • Maximize Data Value • Industry Recognized Workforce • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Services • Strategy, Roadmap, Implementation • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 4. @joe_Caserta#BDWMeetup Why do we need predictive analytics today?
  • 5. @joe_Caserta#BDWMeetup The Progression of Analytics Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner Reports  Correlations  Predictions  Recommendations
  • 7. @joe_Caserta#BDWMeetup Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… Today’s Data Environment Data Science
  • 8. @joe_Caserta#BDWMeetup Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Big Data Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Data has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted) User community arbitrary queries and reporting
  • 10. @joe_Caserta#BDWMeetup Notable Predictive Analytic Tools Open Source Tools: scikit-learn KNIME OpenNN Orange R Weka GNU Octave Apache Mahout Commercial Tools: Alpine Data Labs BIRT Analytics Angoss KnowledgeSTUDIO IBM SPSS Statistics and IBM SPSS Modeler KXEN Modeler Mathematica MATLAB Minitab Oracle Data Mining (ODM) Pervasive Predixion Software RapidMiner RCASE Most Popular: SAS SPSS Statistica R
  • 11. @joe_Caserta#BDWMeetup The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Business Expertise Advanced Mathematics/ Statistics
  • 12. @joe_Caserta#BDWMeetup Easier to Find Than an Awesome Data Scientist
  • 16. @joe_Caserta#BDWMeetup Are there Standards? CRISP-DM: Cross Industry Standard Process for Data Mining 1. Business Understanding • Solve a single business problem 2. Data Understanding • Discovery • Data Munging • Cleansing Requirements 3. Data Preparation • ETL 4. Modeling • Evaluate various models • Iterative experimentation 5. Evaluation • Does the model achieve business objectives? 6. Deployment • PMML; application integration; data platform; Excel
  • 17. @joe_Caserta#BDWMeetup 1. Business Understanding In this initial phase of the project we will need to speak to humans. • It would be premature to jump in to the data, or begin selection of the appropriate model(s) or algorithm • Understand the project objective • Review the business requirements • The output of this phase will be conversion of business requirements into a preliminary technical design (decision model) and plan. Since this is an iterative process, this phase will be revisited throughout the entire process.
  • 18. @joe_Caserta#BDWMeetup 2. Data Understanding • Data Discovery  understand where the data you need comes from • Data Profiling  interrogate the data at the entity level, understand key entities and fields that are relevant to the analysis. • Cleansing Requirements  understand data quality, data density, skew, etc • Data Munging  collocate, blend and analyze data for early insights! Valuable information can be achieved from simple group-by, aggregate queries, and even more with SQL Jujitsu! Significant iteration between Business Understanding and Data Understanding phases.
  • 19. @joe_Caserta#BDWMeetup 3. Data Preparation ETL (Extract Transform Load) 90+% of Data Science time goes into Data Preparation! • Select required entities/fields • Address Data Quality issues: missing or incomplete values, whitespace, bad data-points • Join/Enrich disparate datasets • Transform/Aggregate data for intended use: • Sample • Aggregate • Pivot
  • 20. @joe_Caserta#BDWMeetup Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row
  • 21. @joe_Caserta#BDWMeetup 4. Modeling The Lovers of Algebra & Statistics • Evaluate various models/algorithms • Classification • Clustering • Regression • Many others….. • Tune parameters • Iterative experimentation • Different models may require different data preparation techniques (ie. Sparse Vector Format) • Additionally we may discover the need for additional data points, or uncover additional data quality issues!
  • 23. @joe_Caserta#BDWMeetup 5. Evaluation What problem are we trying to solve again? • Our final solution needs to be evaluated against original Business Understanding • Did we meet our objectives? • Did we address all issues?
  • 24. @joe_Caserta#BDWMeetup 6. Deployment Engineering Time! • It’s time for the work products of data science to “graduate” from “new insights” to real applications. • Processes must be hardened, repeatable, and generally perform well too! • Full Data Governance applied • PMML (Predictive Model Markup Langauge): XML based interchange format
  • 25. @joe_Caserta#BDWMeetup Some Thoughts  Big Science requires the convergence of data governance, advanced data engineering, math and statistics and business smarts  Data Science must be guided by best practices and standards  Tools and techniques must ultimately be platform agnostic (portable)  Work with experts that have done it before!
  • 26. @joe_Caserta#BDWMeetup Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

Editor's Notes

  • #6: Reports  correlations  predictions  recommendations
  • #14: Data science is not about Hadoop, but it is about modern data engineering. Think polyglot persistence – the right tool for the job. Visualization can be tableau, excel, ggplot2 or d3.js. Or anything.
  • #19: Exploration tools: trifacta, paxata, python, pig, hive, Waterline, hcatalog, hive metastore, solr
  • #23: Paco nathan made one of these, too.
  • #25: Cascading, Zementis : Meetup on June 3