SlideShare a Scribd company logo
Mark Rittman, Independent Analyst, MJR Analytics
DATA INTEGRATION AND DATA WAREHOUSING
FOR CLOUD, BIG DATA AND IOT: 

WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW
BIG DATA WORLD, LONDON
London, March 2017
•Oracle ACE Director, Independent Analyst
•Past ODTUG Exec Board Member + Oracle Scene Editor
•Author of two books on Oracle BI
•Co-founder & CTO of Rittman Mead
•15+ Years in Oracle BI, DW, ETL + now Big Data
•Host of the Drill to Detail Podcast (www.drilltodetail.com)
•Based in Brighton & work in London, UK
About The Presenter
2
A BIT OF HISTORY…
3
•Data warehouses provided a unified view of the business
•Single place to store key data and metrics
•Joined-up view of the business
•Aggregates and conformed dimensions
•ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information
•Tabular data access using SQL-generating tools
•Drill paths, hierarchies, facts, attributes
•Fast access to pre-computed aggregates
•Packaged BI for fast-start ERP analytics
4
Oracle
MongoDB
Oracle
Sybase
IBM	DB/2
MS	SQL	
MS	SQL	Server
Core	ERP	Platform
Retail	
Banking	
Call	Center	
E-Commerce	
CRM	


Business	
Intelligence	
Tools


Data	Warehouse
Access	&

Performance

Layer
ODS	/

Foundation

Layer
4
Data Warehousing Back in Mid-2000’s
How Traditional RDBMS Data Warehousing Scaled-Up
5
Shared-Everything	Architectures	(i.e.	
Oracle	RAC,	Exadata)
Shared-Nothing	Architectures

(e.g.	Teradata,	Netezza)
•Google needed to store and query their vast amount of server log files
•And wanted to do so using cheap, commodity hardware
•Google File System and MapReduce designed together for this use
Around the Same Time…
6
•GFS optimised for particular task at hand -
computing PageRank for sites
•Streaming reads for PageRank calcs, block writes for
crawler whole-site dumps
•Master node only holds metadata
•Stops client/master I/O being bottleneck, also acts as
traffic controller for clients
•Simple design, optimised for specific Google Need
•MapReduce focused on simple computations on
abstraction framework
•Select & filter (MAP) and reduce (aggregate) functions,
easily to distribute on cluster
•MapReduce abstracted cluster compute, HDFS
abstracted cluster storage
•Projects that inspired Apache Hadoop + HDFS
Google File System + MapReduce Key Innovations
7
•A way of storing (non-relational) data cheaply and easily expandable
•Gave us a way of scaling beyond TB-size without paying $$$
•First use-cases were offline storage, active archive of data
Hadoop’s Original Appeal to Data Warehouse Owners
8
(c) 2013
•Driven by pace of business, and user demands for more agility and control
•Traditional IT-governed data loading not always appropriate
•Not all data needed to be modelled right-away
•Not all data suited storing in tabular form
•New ways of analyzing data beyond SQL
•Graph analysis
•Machine learning
Data Warehousing and ETL Needed Some Agility
9
•Hadoop started by being synonymous with MapReduce, and Java coding
•But YARN (Yet another Resource Negotiator) broke this dependency
•Hadoop now just handles resource management
•Multiple different query engines can run against data in-place
•General-purpose (e.g. MapReduce)
•Graph processing
•Machine Learning
•Real-Time Processing
Hadoop 2.0 - Enabling Multiple Query Engines
10
•Storing data in format it arrived in, and then applying schema at query time
•Suits data that may be analysed in different ways by different tools
•In addition, some datatypes may have schema embedded in file format
•Key benefit - fast arriving data of unknown value can get to users earlier
•Made possible by tools such as Apache Hive + SerDes,

Apache Drill and self-describing file formats, HDFS storage
Advent of Schema-on-Read, and Data Lakes
11
•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage
•Flexible data storage platform with cheap storage, flexible schema support + compute
•Solves the problem of how to store new types of data + choose best time/way to process it
•Hadoop/NoSQL increasingly used for all store/transform/query tasks
Data Warehousing Circa 2010 : The “Data Lake”
12
Data	Transfer Data	Access
Data	Factory
Data	Reservoir
Business	
Intelligence	Tools
Hadoop	Platform
File	Based	
Integration
Stream	
Based	
Integration
Data	streams
Discovery	&	Development	Labs
Safe	&	secure	Discovery	and	Development	
environment
Data	sets	and	
samples
Models	and	
programs
Marketing	/
Sales	Applications
Models
Machine
Learning
Segments
Operational	Data
Transactions
Customer
Master	ata
Unstructured	Data
Voice	+	Chat	
Transcripts
ETL	Based
Integration
Raw	
Customer	Data
Data	stored	in	
the	original	
format	(usually	
files)		such	as	
SS7,	ASN.1,	
JSON	etc.
Mapped	
Customer	Data
Data	sets	
produced	by	
mapping	and	
transforming	
raw	data
DATA WAREHOUSING 

& BIG DATA TODAY…
13
•On-premise Hadoop, even with simple resilient clustering, will hit limits
•Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc
•Scale limits are encountered way beyond those for DWs…
•… but future is elastically-scaled, query and compute-as-a-service
On-Premise Big Data Analytics Hits Its Limits
14
Oracle	Big	Data	Cloud	Compute	Edition	
Free	$300	developer	credit	at:

https://cloud.oracle.com/en_US/tryit
•New generation of big data platform services from Google, Amazon, Oracle
•Combines three key innovations from earlier technologies:
•Organising of data into tables and columns (from RDBMS DWs)
•Massively-scalable and distributed storage and query (from Big Data)
•Elastically-scalable Platform-as-a-Service (from Cloud)
Elastically-Scalable Data Warehouse-as-a-Service
15
Example Architecture : Google BigQuery
16
•And things come full-circle … analytics
typically requires tabular data
•Google BigQuery based-on DremelX
massively-parallel query engine
•But stores data columnar and provides SQL
interface
•Solves the problem of providing DW-like
functionality at scale, as-a-service
•This is the future … ;-)
BigQuery : Big Data Meets Data Warehousing
17
DATAFLOW PIPELINES 

ARE THE NEW ETL…
18
New ways to do BI
New ways to do BI
MACHINE LEARNING & SEARCH FOR 

“AUTOMAGIC” SCHEMA DISCOVERY
21
New ways to do BI
•By definition there's lots of data in a big data system ... so how do you find the data you
want?
•Google's own internal solution - GOODS ("Google Dataset Search")
•Uses crawler to discover new datasets
•ML classification routines to infer domain
•Data provenance and lineage
•Indexes and catalogs 26bn datasets
•Other users, vendors also have solutions
•Oracle Big Data Discovery
•Datameer
•Platfora
•Cloudera Navigator
Google GOODS - Catalog + Search At Google-Scale
23
A NEW TAKE ON BI…
24
•Came out if the data science movement, as a way to "show workings"
•A set of reproducible steps that tell a story about the data
•as well as being a better command-line environment for data analysis
•One example is Jupyter, evolution of iPython notebook
•supports pySpark, Pandas etc
•See also Apache Zepplin
Web-Based Data Analysis Notebooks
25
AND EMERGING OPEN-SOURCE

BI TOOLS AND PLATFORMS
26
And Emerging Open-Source

BI Tools and Platforms
wp-content/uploads/2016/05/paper.pdf
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s New, What’s Coming … and What’s Missing Right Now
And Emerging Open-Source

BI Tools and Platforms
… Which Is What I’m Working On Right Now
30
THANK YOU
31
Mark Rittman, Independent Analyst, MJR Analytics
DATA INTEGRATION AND DATA WAREHOUSING
FOR CLOUD, BIG DATA AND IOT: 

WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW
BIG DATA WORLD, LONDON
London, March 2017

More Related Content

PDF
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
PDF
How a Tweet Went Viral - BIWA Summit 2017
PPTX
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
PDF
Data lake analytics for the admin
PPTX
Building a Big Data Solution
PDF
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
How a Tweet Went Viral - BIWA Summit 2017
Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Data lake analytics for the admin
Building a Big Data Solution
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...

What's hot (20)

PDF
Data Lakes: 8 Enterprise Data Management Requirements
PDF
Data lake
PPTX
Big data architectures and the data lake
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PDF
Modern data warehouse
PPTX
Design Principles for a Modern Data Warehouse
PDF
Data platform architecture
PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
PDF
Designing the Next Generation Data Lake
PDF
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
PPTX
Big Data in the Real World
PDF
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
PPTX
Hadoop and Enterprise Data Warehouse
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
PDF
Integrated Data Warehouse with Hadoop and Oracle Database
PPTX
Solution architecture for big data projects
PPTX
Microsoft Azure Big Data Analytics
PDF
Big Data Architecture and Design Patterns
PDF
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
PDF
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Data Lakes: 8 Enterprise Data Management Requirements
Data lake
Big data architectures and the data lake
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Modern data warehouse
Design Principles for a Modern Data Warehouse
Data platform architecture
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Designing the Next Generation Data Lake
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Big Data in the Real World
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Hadoop and Enterprise Data Warehouse
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Integrated Data Warehouse with Hadoop and Oracle Database
Solution architecture for big data projects
Microsoft Azure Big Data Analytics
Big Data Architecture and Design Patterns
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Ad

Viewers also liked (20)

PDF
B. Malyshev. Legal regulation of the Police in the reform context (2016)
PPTX
электроное портфолио
PPT
Corte penal internacional2_IAFJSR
PPTX
Java 8 collections
PDF
Lacteos el condor arequipe
DOC
Matriz 2 fase 1 antoine_mario_gc177
PDF
Gr02 KIT post-emergenza
PDF
Electroquímica celdas ecuación de nerst-leyes de faraday
PPTX
(14-03-2017) Rabdomiolisis(PPT)
DOC
Climate Change Presentation Handout 2015
PPTX
3Com 3C17501
PDF
Le rapport de la mission “Musées du XXIe siècle”
PDF
E. Krapivin. Accreditation of the Police in Ukraine: outcomes and conclusions...
PDF
Gamestorming booster 2017
PPTX
ZTE FDD INSTALLATION
PPTX
Netwealth portfolio construction series - Successful value investing in small...
DOCX
2 k jeyaprakash diversity of medicinal plants used by adi community in and ar...
PDF
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応
PDF
Moment matching networkを用いた音声パラメータのランダム生成の検討
DOCX
Camera surveilance 7 seminar salman
B. Malyshev. Legal regulation of the Police in the reform context (2016)
электроное портфолио
Corte penal internacional2_IAFJSR
Java 8 collections
Lacteos el condor arequipe
Matriz 2 fase 1 antoine_mario_gc177
Gr02 KIT post-emergenza
Electroquímica celdas ecuación de nerst-leyes de faraday
(14-03-2017) Rabdomiolisis(PPT)
Climate Change Presentation Handout 2015
3Com 3C17501
Le rapport de la mission “Musées du XXIe siècle”
E. Krapivin. Accreditation of the Police in Ukraine: outcomes and conclusions...
Gamestorming booster 2017
ZTE FDD INSTALLATION
Netwealth portfolio construction series - Successful value investing in small...
2 k jeyaprakash diversity of medicinal plants used by adi community in and ar...
GMMに基づく固有声変換のための変調スペクトル制約付きトラジェクトリ学習・適応
Moment matching networkを用いた音声パラメータのランダム生成の検討
Camera surveilance 7 seminar salman
Ad

Similar to Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s New, What’s Coming … and What’s Missing Right Now (20)

PDF
Bi on Big Data - Strata 2016 in London
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
PDF
Data Warehousing 2016
PPTX
Architecting a datalake
PPTX
Transform your DBMS to drive engagement innovation with Big Data
PDF
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
PDF
Hadoop meets Agile! - An Agile Big Data Model
PDF
Transform from database professional to a Big Data architect
PDF
Hadoop and the Data Warehouse: When to Use Which
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
PPTX
IARE_BDBA_ PPT_0.pptx
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PDF
The Hadoop Ecosystem for Developers
PPTX
Big Data Strategy for the Relational World
PDF
An overview of modern scalable web development
PPTX
Microsoft Traditional & Modern DW solutions stack Presentation.pptx
PDF
Intro to Big Data
PPTX
Apache drill
Bi on Big Data - Strata 2016 in London
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Data Warehousing 2016
Architecting a datalake
Transform your DBMS to drive engagement innovation with Big Data
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Hadoop meets Agile! - An Agile Big Data Model
Transform from database professional to a Big Data architect
Hadoop and the Data Warehouse: When to Use Which
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
IARE_BDBA_ PPT_0.pptx
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
The Hadoop Ecosystem for Developers
Big Data Strategy for the Relational World
An overview of modern scalable web development
Microsoft Traditional & Modern DW solutions stack Presentation.pptx
Intro to Big Data
Apache drill

More from Rittman Analytics (16)

PPTX
From Zero to One with Rittman Analytics
PPTX
Where Digital Analytics is taking BI and Big Data
PDF
User Engagement Analysis using the new Looker System Activity Model
PPTX
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
PPTX
Planning a Strategy for Autonomous Analytics and Data Warehousing
PDF
Where Digital Analytics is taking BI and Big Data
PDF
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
PDF
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
PDF
Using Google Cloud Dataprep to Wrangle Strava, Fitbit and Google Locations Data
PDF
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
PDF
Using Google Cloud Dataprep to Wrangle Strava, Fitbit and Google Locations Data
PDF
Using Data & Analytics To Find Out How Much Daily Mail Readers Hate Me (and W...
PDF
Analytics, BigQuery, Looker and How I Became an Internet Meme for 48 Hours
PPTX
Analytics is Taking over the World (Again) - UKOUG Tech'17
PPTX
Petabytes to Personalization - Data Analytics with Qubit and Looker
PDF
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
From Zero to One with Rittman Analytics
Where Digital Analytics is taking BI and Big Data
User Engagement Analysis using the new Looker System Activity Model
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
Planning a Strategy for Autonomous Analytics and Data Warehousing
Where Digital Analytics is taking BI and Big Data
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
From BI Developer to Data Engineer with Oracle Analytics Cloud, Data Lake
Using Google Cloud Dataprep to Wrangle Strava, Fitbit and Google Locations Data
From BI Developer to Data Engineer with Oracle Analytics Cloud Data Lake Edition
Using Google Cloud Dataprep to Wrangle Strava, Fitbit and Google Locations Data
Using Data & Analytics To Find Out How Much Daily Mail Readers Hate Me (and W...
Analytics, BigQuery, Looker and How I Became an Internet Meme for 48 Hours
Analytics is Taking over the World (Again) - UKOUG Tech'17
Petabytes to Personalization - Data Analytics with Qubit and Looker
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)

Recently uploaded (20)

PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
How to run a consulting project- client discovery
DOCX
Factor Analysis Word Document Presentation
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPT
Predictive modeling basics in data cleaning process
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
annual-report-2024-2025 original latest.
PDF
Transcultural that can help you someday.
PPTX
modul_python (1).pptx for professional and student
PDF
Business Analytics and business intelligence.pdf
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
How to run a consulting project- client discovery
Factor Analysis Word Document Presentation
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
[EN] Industrial Machine Downtime Prediction
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Microsoft Core Cloud Services powerpoint
Predictive modeling basics in data cleaning process
SAP 2 completion done . PRESENTATION.pptx
Leprosy and NLEP programme community medicine
IBA_Chapter_11_Slides_Final_Accessible.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
annual-report-2024-2025 original latest.
Transcultural that can help you someday.
modul_python (1).pptx for professional and student
Business Analytics and business intelligence.pdf

Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s New, What’s Coming … and What’s Missing Right Now

  • 1. Mark Rittman, Independent Analyst, MJR Analytics DATA INTEGRATION AND DATA WAREHOUSING FOR CLOUD, BIG DATA AND IOT: 
 WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW BIG DATA WORLD, LONDON London, March 2017
  • 2. •Oracle ACE Director, Independent Analyst •Past ODTUG Exec Board Member + Oracle Scene Editor •Author of two books on Oracle BI •Co-founder & CTO of Rittman Mead •15+ Years in Oracle BI, DW, ETL + now Big Data •Host of the Drill to Detail Podcast (www.drilltodetail.com) •Based in Brighton & work in London, UK About The Presenter 2
  • 3. A BIT OF HISTORY… 3
  • 4. •Data warehouses provided a unified view of the business •Single place to store key data and metrics •Joined-up view of the business •Aggregates and conformed dimensions •ETL routines to load, cleanse and conform data •BI tools for simple, guided access to information •Tabular data access using SQL-generating tools •Drill paths, hierarchies, facts, attributes •Fast access to pre-computed aggregates •Packaged BI for fast-start ERP analytics 4 Oracle MongoDB Oracle Sybase IBM DB/2 MS SQL MS SQL Server Core ERP Platform Retail Banking Call Center E-Commerce CRM 
 Business Intelligence Tools 
 Data Warehouse Access &
 Performance
 Layer ODS /
 Foundation
 Layer 4 Data Warehousing Back in Mid-2000’s
  • 5. How Traditional RDBMS Data Warehousing Scaled-Up 5 Shared-Everything Architectures (i.e. Oracle RAC, Exadata) Shared-Nothing Architectures
 (e.g. Teradata, Netezza)
  • 6. •Google needed to store and query their vast amount of server log files •And wanted to do so using cheap, commodity hardware •Google File System and MapReduce designed together for this use Around the Same Time… 6
  • 7. •GFS optimised for particular task at hand - computing PageRank for sites •Streaming reads for PageRank calcs, block writes for crawler whole-site dumps •Master node only holds metadata •Stops client/master I/O being bottleneck, also acts as traffic controller for clients •Simple design, optimised for specific Google Need •MapReduce focused on simple computations on abstraction framework •Select & filter (MAP) and reduce (aggregate) functions, easily to distribute on cluster •MapReduce abstracted cluster compute, HDFS abstracted cluster storage •Projects that inspired Apache Hadoop + HDFS Google File System + MapReduce Key Innovations 7
  • 8. •A way of storing (non-relational) data cheaply and easily expandable •Gave us a way of scaling beyond TB-size without paying $$$ •First use-cases were offline storage, active archive of data Hadoop’s Original Appeal to Data Warehouse Owners 8 (c) 2013
  • 9. •Driven by pace of business, and user demands for more agility and control •Traditional IT-governed data loading not always appropriate •Not all data needed to be modelled right-away •Not all data suited storing in tabular form •New ways of analyzing data beyond SQL •Graph analysis •Machine learning Data Warehousing and ETL Needed Some Agility 9
  • 10. •Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Hadoop now just handles resource management •Multiple different query engines can run against data in-place •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing Hadoop 2.0 - Enabling Multiple Query Engines 10
  • 11. •Storing data in format it arrived in, and then applying schema at query time •Suits data that may be analysed in different ways by different tools •In addition, some datatypes may have schema embedded in file format •Key benefit - fast arriving data of unknown value can get to users earlier •Made possible by tools such as Apache Hive + SerDes,
 Apache Drill and self-describing file formats, HDFS storage Advent of Schema-on-Read, and Data Lakes 11
  • 12. •Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Solves the problem of how to store new types of data + choose best time/way to process it •Hadoop/NoSQL increasingly used for all store/transform/query tasks Data Warehousing Circa 2010 : The “Data Lake” 12 Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data
  • 13. DATA WAREHOUSING 
 & BIG DATA TODAY… 13
  • 14. •On-premise Hadoop, even with simple resilient clustering, will hit limits •Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc •Scale limits are encountered way beyond those for DWs… •… but future is elastically-scaled, query and compute-as-a-service On-Premise Big Data Analytics Hits Its Limits 14 Oracle Big Data Cloud Compute Edition Free $300 developer credit at:
 https://cloud.oracle.com/en_US/tryit
  • 15. •New generation of big data platform services from Google, Amazon, Oracle •Combines three key innovations from earlier technologies: •Organising of data into tables and columns (from RDBMS DWs) •Massively-scalable and distributed storage and query (from Big Data) •Elastically-scalable Platform-as-a-Service (from Cloud) Elastically-Scalable Data Warehouse-as-a-Service 15
  • 16. Example Architecture : Google BigQuery 16
  • 17. •And things come full-circle … analytics typically requires tabular data •Google BigQuery based-on DremelX massively-parallel query engine •But stores data columnar and provides SQL interface •Solves the problem of providing DW-like functionality at scale, as-a-service •This is the future … ;-) BigQuery : Big Data Meets Data Warehousing 17
  • 18. DATAFLOW PIPELINES 
 ARE THE NEW ETL… 18
  • 19. New ways to do BI
  • 20. New ways to do BI
  • 21. MACHINE LEARNING & SEARCH FOR 
 “AUTOMAGIC” SCHEMA DISCOVERY 21
  • 22. New ways to do BI
  • 23. •By definition there's lots of data in a big data system ... so how do you find the data you want? •Google's own internal solution - GOODS ("Google Dataset Search") •Uses crawler to discover new datasets •ML classification routines to infer domain •Data provenance and lineage •Indexes and catalogs 26bn datasets •Other users, vendors also have solutions •Oracle Big Data Discovery •Datameer •Platfora •Cloudera Navigator Google GOODS - Catalog + Search At Google-Scale 23
  • 24. A NEW TAKE ON BI… 24
  • 25. •Came out if the data science movement, as a way to "show workings" •A set of reproducible steps that tell a story about the data •as well as being a better command-line environment for data analysis •One example is Jupyter, evolution of iPython notebook •supports pySpark, Pandas etc •See also Apache Zepplin Web-Based Data Analysis Notebooks 25
  • 26. AND EMERGING OPEN-SOURCE
 BI TOOLS AND PLATFORMS 26
  • 27. And Emerging Open-Source
 BI Tools and Platforms wp-content/uploads/2016/05/paper.pdf
  • 29. And Emerging Open-Source
 BI Tools and Platforms
  • 30. … Which Is What I’m Working On Right Now 30
  • 32. Mark Rittman, Independent Analyst, MJR Analytics DATA INTEGRATION AND DATA WAREHOUSING FOR CLOUD, BIG DATA AND IOT: 
 WHAT’S NEW, WHAT’S COMING … AND WHAT’S MISSING RIGHT NOW BIG DATA WORLD, LONDON London, March 2017