SlideShare a Scribd company logo
Enterprise Data Science
Frank KienleBig Data Overview
1. Understand the business
2. Understand data
3. Prepare data
4. Modell
5. Evaluation
6. Deployment
CRISP Value Process
Frank Kienle
Data are individual units of
information
We store more and more data which
leads to
Big Data
Data to Big Data
Frank Kienle
Erik Larson, Harper’s magazine:
‘The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes
other than originally intended.’
(Reality today: private data is becoming commoditized)
Big Data definitions 1989
Frank Kienle
Doug Laney, Gartner,2001:
,3-D Data Management:
Controlling Data
Volume, Velocity and Variety’
Big data definition 2001
Source: avyamuthanna.files.wordpress.com/2013/01/picture-11.png
Frank Kienle
Big Data is any data that is expensive to manage and hard to extract value from
(Souce: Michael Franklin, Dirctor of the Algorithms, Machines and Computer Science, Unverisity of Berkeley)
Extracting value out of big data is all about predicting the futures based on
observation of the past
Big Data today: it’s all about value
Frank Kienle
Big Data: the four V’s https://www.ibmbigdatahub.com/infographic/four-vs-big-data
Frank Kienle
handling (big) data is an art - not a value
§ up to 75 control devices in each BMW
§ ~ 1.000 individual configurations possible
§ ~1 GByte functional software, 15 GByte data in the
car
§ ~ 2.000 customer functions implemented
§ ~ 12.000 error storage memories for onboard
§ daily up to 60.000 diagnoses processes world
wide
§ centralized data storage and organization
§ data fusion and data mining for quality insurance
and better understanding of realistic
environments
Source: Bitcom BMW keynote talk
source: pixabay
Frank Kienle
Tracking the data in a car can
have benefits
but
comes with security / privacy
challenges
See lecture on ethical
challenges
Big Data Sources: Car black boxes
source: Los Angeles Times
Frank Kienle
A gas turbine has up 1000 sensors
§ Each sensor can (theoretically) processes data in the
millisecond range
§ example real live set up:
§ averages are stored per second (history kept for
one year)
§ often long history available, e.g. up to year 2000 in
5 minutes range (averages)
IoT Sensor Data example: Gas Turbine
source: pixabay
Frank Kienle
Realistic scenario store tuples: (timestamp, value)
• new sensors will be introduced, sensors might change
Theoretical data stream storage, gas turbine example
§(timestamp, value) 64 Byte X 1000 sensors à
Reality:
► 1 year stored in 1 s averages:
► 10 years stored in 5min averages:
3.2 Mbyte Time: 1 s
276Mbyte Time: 1 day
100.9 GByte Time: 1 year
64 kByte Time: 20 ms
x 100 engines in one data center à 10 TByte Time: 1 year
200 GByte Time: 1 year
~ 7 TByte Time: 10 years
Frank Kienle
Big Data Landscape - Data Lake Architecture
Components overview and terminology
mostly structuredsemi-structuredunstructured
The data lake is one part in the overall data to value path
§#123 §10101
§
Raw (Big) Data is
typically coming from
different sources and
has many different data
types
A data lake is a storage
repository that holds a vast
amount of (big) data in its
native format and provides
intelligent (semi-
structured) access until it is
needed
The value of data is
delivered via enterprise
systems / UX components
with the overall goal to
perform data driven
decisions
twitter www social
sensors mobile payments
transactions transport
video
Source Manage Value
pictures voice
Frank Kienle
Stages of Data in Data Lake – High Level Architecture
The data flow and used technology, tools, programming depend on data type and the final application layer
Business Systems
Business Systems
Business Systems
Business Systems
Data Sources
Delivery
Applications
Applications
ApplicationsApplications
&
Visualizations
Enriched
Data
Raw
Data
Ingestion
Transform /
Curate
File Transfer,
RDB Import
REST APIs
Stream or
batch transfer
WhatHow
Initial raw
raw data
storage
Distributed
Storage (e.g.
Hadoop)
Cleansing /
transform for
purpose
Distributed
Storage (e.g.
Hadoop)
add semantic,
searchable,
anonymized, …
Data bases for
purpose
semantic
data access
On request
data services
simplified data lake data path
Exemplary high-level walk through to extract, store and deliver trend information
Clean, structured data(Semi) Unstructured
or raw data
Mining big data
Information
retrieving
Data Lake
storing and mining relevant information
Final PresentationData Source
Drill down boards
WWW sources
Large-scaled
Web crawlers
download all
links found
Saved
webpages
Search and mine
data to extract
semantic
(relevance)
Structured
(graph)
database of
trends to
allow for easy
access
Relevant
Internet
Webpages
for topic
Trend Report
source: trends.google
A data to value architecture is composed of many building blocks
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
A data lake is often a fundamental part of the data to value stack and focuses on
the technical management of big data
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
(often) focus of
data lake
architectures
Data Lake High level architecture with different possibilities to store, process, and
deliver valuable information
Text, emails
documents
Video,
Media
Voice, Music,
Sound
Unstructured
XML, JSON Sensor
Semi-structured data
Databases ERP core
Structured data
Data Sources
Stream
Batch
Hybrid
Data
Ingestion
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Depending on the data type and
final business application
different elements are utilized
IoT
Prescriptive
Business Application
Availability
Data Security
Compliance &
Controls
Data Governance
Functional Layer
…
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
Data Requirements
Which data are needed?
The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Prescriptive
Business Application
Availability
Data Security
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-
functional
requirements
What
constraints?
The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements – example questions to be unswered
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-functional
requirements
What constraints?
Who is the customer (internal, external)?
How does it help in which situation / process?
Which value do we expect?
When we improve quality by x% which benefit do we expect?
How to visualize / serve the results / back integration?
Which service level has the solution (on request, 99%uptime)?
Where is the data allowed to be stored, e.g. GDPR?
Who has access to the application / data?
How is the support organized?
Which security level is granted?
How does the application provide the result, e.g. which technical interface?
How is the data stored, what are the latency requirements for read / write?
How to ensure a test / productive setup?
Where do we compute and which libraries?
Which algorithms serve best the requirements?
For each layer in the data stack many different vendors and applications exist
Data Storage
Data Access / Pipelines
Value Delivery
Business Application
Functional Layer
Deployment / Physical
Managing
big Data and
data pipelines
• Infrastructure und Hardware for Big Data
• Big Data Distributions (e.g.. Hadoop)
• Components for data management
(distributed data systems,
• in memory data bases,…)
Focus
Extracting
value
• Full business SaaS Services
• Tool boxes visualization
• Workflow enablement
Nearly all technical Big Data / Data Lakes are based on the (open source) Hadoop
& Ecosystem.
Com-
ponent
Description
HDFS The Hadoop Distributed File System.
Mahout Machine Learning on HDFS system
Zoo-
keeper
A centralized service for maintaining
synchronization and group services.
Yarn
Hadoop’s resource manager and job
scheduler.
HBase The Hadoop database.
Pig
A high-level data-flow language and execution
framework for parallel computation.
Spark
SQL
A module for structured and semi-structured
data processing.
Hive
A data warehouse infrastructure supporting
data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume A service for moving log data into Hadoop
Flume
Sqoop
Unstructured or semi-structured data Structured data
HDFS (Hadoop Distributed Files System)
HBase
Map Reduce Framework
Apache Oozie (Workflow)
Hive
DW System
PIG Latin
Data Analysis
Mahout
Machine Learning
Z
O
O
K
E
E
P
E
R
Data Storage
Data Access /
Pipelines
Ingestions
Functions
Focus
in stack
Components Layers
Frank Kienle
Nearly all Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. However, only Enterprise Big Data
Platforms ensure a professional management
Component Description
Ambari
An open operational framework for provisioning, managing and monitoring Apache
Hadoop clusters.
HDFS The Hadoop Distributed File System.
Zookeeper
A centralized service for maintaining configuration information and naming, and for
providing distributed synchronization and group services.
Yarn Hadoop’s resource manager and job scheduler.
HBase The Hadoop database.
Pig A high-level data-flow language and execution framework for parallel computation.
Spark SQL A module for structured and semi-structured data processing.
Hive A data warehouse infrastructure supporting data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data into Hadoop.
Kafka A high-throughput, distributed, publish-subscribe messaging system.
Frank Kienle
Visualization Tools example for Data Scientists
(some practical tools/libraries, the purpose defines the tool)
Open source programming language,
active community participation, quick results
and must know-how for a data scientist
Focus on, interactive data visualizations
in web browsersava. Script library for
manipulating documents based on data
Most often used from nearly everybody
for visualization due to its mighty capabilities
and penetration
ExcelGeneral
Purpose Example
Web D3.js + derivates
Description
Rapid
Prototyping
Python (Matplotlib)
R (Shiny)
Professional
Visual Exploration
Tableau, Qlik
MS PowerBI
Professional interactive visualization tools
with focus on quick insights, with the goal
to provide business intelligence (BI) for an
enterprise
Focus
in stack
Visualization
Frank Kienle
Libraries/Algorithms/Programming/Tools
(some practical tools/libraries, the purpose defines the selection)
Query Languages and stream/batch processing
programming paradigms with ease access to
managed big data (there exist many more)
The two most important languages
for data science (there exist many more)
World wide most used tool for data
processing/calculation purposes with
mighty capabilities (mostly not know)
ExcelGeneral
Purpose Example
Statistics /
Machine Learning
Python + R
Description
(Big) Data
Processing
Spark + SQL
Tool Providers
Statistics/ML
SAS, Rapid Miner,
Knime, Matlab, …
Professional tools with the goal to provide
packaged, maintained and easy consumable
analytics for professional and citizen data
scientists
Focus
in stack
Functional Layer
Data Pipelines
Frank Kienle
Big Data Landscape 2012
Frank Kienle
Frank Kienle
Frank Kienle
Big Data Landscape v 3.0 by Sub-Categories (source kdnuggets.com)
Frank Kienle

More Related Content

PDF
Enterprise Data Science Introduction
PDF
Data Bases - Introduction to data science
PDF
DevOps - Introduction to data science
PPTX
Great Expectations Presentation
PPTX
Data Mining - The Big Picture!
PDF
Big Data Tech Stack
PDF
From hadoop to spark
PDF
Overview of big data in cloud computing
Enterprise Data Science Introduction
Data Bases - Introduction to data science
DevOps - Introduction to data science
Great Expectations Presentation
Data Mining - The Big Picture!
Big Data Tech Stack
From hadoop to spark
Overview of big data in cloud computing

What's hot (20)

PDF
An introduction to Big Data
ODP
BigData Hadoop
PDF
Introduction to Big Data
ODP
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
PDF
Business proposal (2) (1)
PPT
Big Data: An Overview
PDF
Lecture4 big data technology foundations
PDF
Big Data Analytics for Real Time Systems
PPTX
Big Data Use Cases
PPTX
Operationalizing Data Science St. Louis Big Data IDEA
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
PPTX
BigData
PDF
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
PPTX
Are you ready for BIG DATA?
PDF
Big Data Analytics
PPTX
BDaas- BigData as a service
PPTX
(The life of a) Data engineer
PDF
Big Data Architecture
PPTX
Big Data - A brief introduction
An introduction to Big Data
BigData Hadoop
Introduction to Big Data
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Business proposal (2) (1)
Big Data: An Overview
Lecture4 big data technology foundations
Big Data Analytics for Real Time Systems
Big Data Use Cases
Operationalizing Data Science St. Louis Big Data IDEA
Владимир Слободянюк «DWH & BigData – architecture approaches»
Big Data Analysis Patterns with Hadoop, Mahout and Solr
BigData
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Are you ready for BIG DATA?
Big Data Analytics
BDaas- BigData as a service
(The life of a) Data engineer
Big Data Architecture
Big Data - A brief introduction
Ad

Similar to Introduction Big Data (20)

PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
How to build and run a big data platform in the 21st century
PDF
Big data data lake and beyond
PPTX
Big Data Practice_Planning_steps_RK
PPTX
Big data4businessusers
PDF
The role of data engineering in data science and analytics practice
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Hadoop-based architecture approaches
PDF
Big Data overview
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PDF
Big Data Analytics M1.pdf big data analytics
PDF
INF2190_W1_2016_public
PPT
Introduction to Big Data An analogy between Sugar Cane & Big Data
PDF
Introduction to big data for the EA course at Solvay MBA
PPTX
Big Data in Action : Operations, Analytics and more
PPTX
basic of data science and big data......
PPTX
5 Things that Make Hadoop a Game Changer
PDF
The Evolving Landscape of Data Engineering
Data lake-itweekend-sharif university-vahid amiry
How to build and run a big data platform in the 21st century
Big data data lake and beyond
Big Data Practice_Planning_steps_RK
Big data4businessusers
The role of data engineering in data science and analytics practice
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Hadoop-based architecture approaches
Big Data overview
Lecture 5 - Big Data and Hadoop Intro.ppt
Big Data Analytics M1.pdf big data analytics
INF2190_W1_2016_public
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to big data for the EA course at Solvay MBA
Big Data in Action : Operations, Analytics and more
basic of data science and big data......
5 Things that Make Hadoop a Game Changer
The Evolving Landscape of Data Engineering
Ad

More from Frank Kienle (9)

PDF
AI for good summary
PDF
Machine Learning part 3 - Introduction to data science
PDF
Machine Learning part 2 - Introduction to Data Science
PDF
Machine Learning part1 - Introduction to Data Science
PDF
Business Models - Introduction to Data Science
PDF
Data Science Lecture: Overview and Information Collateral
PDF
Lecture summary: architectures for baseband signal processing of wireless com...
PDF
Lecture: Monte Carlo Methods
PDF
data scientist the sexiest job of the 21st century
AI for good summary
Machine Learning part 3 - Introduction to data science
Machine Learning part 2 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science
Business Models - Introduction to Data Science
Data Science Lecture: Overview and Information Collateral
Lecture summary: architectures for baseband signal processing of wireless com...
Lecture: Monte Carlo Methods
data scientist the sexiest job of the 21st century

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Lecture1 pattern recognition............
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Introduction to the R Programming Language
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
annual-report-2024-2025 original latest.
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Predictive modeling basics in data cleaning process
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Leprosy and NLEP programme community medicine
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Data Science and Data Analysis
Pilar Kemerdekaan dan Identi Bangsa.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Lecture1 pattern recognition............
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
ISS -ESG Data flows What is ESG and HowHow
Introduction to the R Programming Language
STERILIZATION AND DISINFECTION-1.ppthhhbx
annual-report-2024-2025 original latest.
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
[EN] Industrial Machine Downtime Prediction
Predictive modeling basics in data cleaning process
climate analysis of Dhaka ,Banglades.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Leprosy and NLEP programme community medicine

Introduction Big Data

  • 1. Enterprise Data Science Frank KienleBig Data Overview
  • 2. 1. Understand the business 2. Understand data 3. Prepare data 4. Modell 5. Evaluation 6. Deployment CRISP Value Process Frank Kienle
  • 3. Data are individual units of information We store more and more data which leads to Big Data Data to Big Data Frank Kienle
  • 4. Erik Larson, Harper’s magazine: ‘The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.’ (Reality today: private data is becoming commoditized) Big Data definitions 1989 Frank Kienle
  • 5. Doug Laney, Gartner,2001: ,3-D Data Management: Controlling Data Volume, Velocity and Variety’ Big data definition 2001 Source: avyamuthanna.files.wordpress.com/2013/01/picture-11.png Frank Kienle
  • 6. Big Data is any data that is expensive to manage and hard to extract value from (Souce: Michael Franklin, Dirctor of the Algorithms, Machines and Computer Science, Unverisity of Berkeley) Extracting value out of big data is all about predicting the futures based on observation of the past Big Data today: it’s all about value Frank Kienle
  • 7. Big Data: the four V’s https://www.ibmbigdatahub.com/infographic/four-vs-big-data Frank Kienle
  • 8. handling (big) data is an art - not a value
  • 9. § up to 75 control devices in each BMW § ~ 1.000 individual configurations possible § ~1 GByte functional software, 15 GByte data in the car § ~ 2.000 customer functions implemented § ~ 12.000 error storage memories for onboard § daily up to 60.000 diagnoses processes world wide § centralized data storage and organization § data fusion and data mining for quality insurance and better understanding of realistic environments Source: Bitcom BMW keynote talk source: pixabay Frank Kienle
  • 10. Tracking the data in a car can have benefits but comes with security / privacy challenges See lecture on ethical challenges Big Data Sources: Car black boxes source: Los Angeles Times Frank Kienle
  • 11. A gas turbine has up 1000 sensors § Each sensor can (theoretically) processes data in the millisecond range § example real live set up: § averages are stored per second (history kept for one year) § often long history available, e.g. up to year 2000 in 5 minutes range (averages) IoT Sensor Data example: Gas Turbine source: pixabay Frank Kienle
  • 12. Realistic scenario store tuples: (timestamp, value) • new sensors will be introduced, sensors might change Theoretical data stream storage, gas turbine example §(timestamp, value) 64 Byte X 1000 sensors à Reality: ► 1 year stored in 1 s averages: ► 10 years stored in 5min averages: 3.2 Mbyte Time: 1 s 276Mbyte Time: 1 day 100.9 GByte Time: 1 year 64 kByte Time: 20 ms x 100 engines in one data center à 10 TByte Time: 1 year 200 GByte Time: 1 year ~ 7 TByte Time: 10 years Frank Kienle
  • 13. Big Data Landscape - Data Lake Architecture Components overview and terminology
  • 14. mostly structuredsemi-structuredunstructured The data lake is one part in the overall data to value path §#123 §10101 § Raw (Big) Data is typically coming from different sources and has many different data types A data lake is a storage repository that holds a vast amount of (big) data in its native format and provides intelligent (semi- structured) access until it is needed The value of data is delivered via enterprise systems / UX components with the overall goal to perform data driven decisions twitter www social sensors mobile payments transactions transport video Source Manage Value pictures voice Frank Kienle
  • 15. Stages of Data in Data Lake – High Level Architecture The data flow and used technology, tools, programming depend on data type and the final application layer Business Systems Business Systems Business Systems Business Systems Data Sources Delivery Applications Applications ApplicationsApplications & Visualizations Enriched Data Raw Data Ingestion Transform / Curate File Transfer, RDB Import REST APIs Stream or batch transfer WhatHow Initial raw raw data storage Distributed Storage (e.g. Hadoop) Cleansing / transform for purpose Distributed Storage (e.g. Hadoop) add semantic, searchable, anonymized, … Data bases for purpose semantic data access On request data services simplified data lake data path
  • 16. Exemplary high-level walk through to extract, store and deliver trend information Clean, structured data(Semi) Unstructured or raw data Mining big data Information retrieving Data Lake storing and mining relevant information Final PresentationData Source Drill down boards WWW sources Large-scaled Web crawlers download all links found Saved webpages Search and mine data to extract semantic (relevance) Structured (graph) database of trends to allow for easy access Relevant Internet Webpages for topic Trend Report source: trends.google
  • 17. A data to value architecture is composed of many building blocks Data sources and data ingestion Data Storage Data Access / Pipelines Value DeliveryDepending on the data type and final business application different elements are utilized Business Application Data Governance Functional Layer Deployment / Physical raw data input valuable data output
  • 18. A data lake is often a fundamental part of the data to value stack and focuses on the technical management of big data Data sources and data ingestion Data Storage Data Access / Pipelines Value DeliveryDepending on the data type and final business application different elements are utilized Business Application Data Governance Functional Layer Deployment / Physical raw data input valuable data output (often) focus of data lake architectures
  • 19. Data Lake High level architecture with different possibilities to store, process, and deliver valuable information Text, emails documents Video, Media Voice, Music, Sound Unstructured XML, JSON Sensor Semi-structured data Databases ERP core Structured data Data Sources Stream Batch Hybrid Data Ingestion Row Based Column Based Relational Graph DB Document DB Non-Relational Key-Value Data Storage Stream Batch Interactive Data Access / Pipelines Descriptive Predictive Value Delivery Visualizations Interfaces Operational Depending on the data type and final business application different elements are utilized IoT Prescriptive Business Application Availability Data Security Compliance & Controls Data Governance Functional Layer … Roles & Responsibility Data Quality Reporting's Tactical Strategic Deployment On premise Cloud Hybrid Application Life cycle
  • 20. Data Requirements Which data are needed? The design of a data pipeline / data lake depends on the business, technical, non- functional requirements Row Based Column Based Relational Graph DB Document DB Non-Relational Key-Value Data Storage Stream Batch Interactive Data Access / Pipelines Descriptive Predictive Value Delivery Visualizations Interfaces Operational Prescriptive Business Application Availability Data Security Roles & Responsibility Data Quality Reporting's Tactical Strategic Deployment On premise Cloud Hybrid Application Life cycle Technical Requirements How to realize? Business Requirements Why we need this? Non- functional requirements What constraints?
  • 21. The design of a data pipeline / data lake depends on the business, technical, non- functional requirements – example questions to be unswered Technical Requirements How to realize? Business Requirements Why we need this? Non-functional requirements What constraints? Who is the customer (internal, external)? How does it help in which situation / process? Which value do we expect? When we improve quality by x% which benefit do we expect? How to visualize / serve the results / back integration? Which service level has the solution (on request, 99%uptime)? Where is the data allowed to be stored, e.g. GDPR? Who has access to the application / data? How is the support organized? Which security level is granted? How does the application provide the result, e.g. which technical interface? How is the data stored, what are the latency requirements for read / write? How to ensure a test / productive setup? Where do we compute and which libraries? Which algorithms serve best the requirements?
  • 22. For each layer in the data stack many different vendors and applications exist Data Storage Data Access / Pipelines Value Delivery Business Application Functional Layer Deployment / Physical Managing big Data and data pipelines • Infrastructure und Hardware for Big Data • Big Data Distributions (e.g.. Hadoop) • Components for data management (distributed data systems, • in memory data bases,…) Focus Extracting value • Full business SaaS Services • Tool boxes visualization • Workflow enablement
  • 23. Nearly all technical Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. Com- ponent Description HDFS The Hadoop Distributed File System. Mahout Machine Learning on HDFS system Zoo- keeper A centralized service for maintaining synchronization and group services. Yarn Hadoop’s resource manager and job scheduler. HBase The Hadoop database. Pig A high-level data-flow language and execution framework for parallel computation. Spark SQL A module for structured and semi-structured data processing. Hive A data warehouse infrastructure supporting data summarization, query, and analysis. Sqoop A tool to move data from RDBMS to Hadoop. Flume A service for moving log data into Hadoop Flume Sqoop Unstructured or semi-structured data Structured data HDFS (Hadoop Distributed Files System) HBase Map Reduce Framework Apache Oozie (Workflow) Hive DW System PIG Latin Data Analysis Mahout Machine Learning Z O O K E E P E R Data Storage Data Access / Pipelines Ingestions Functions Focus in stack Components Layers Frank Kienle
  • 24. Nearly all Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. However, only Enterprise Big Data Platforms ensure a professional management Component Description Ambari An open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. HDFS The Hadoop Distributed File System. Zookeeper A centralized service for maintaining configuration information and naming, and for providing distributed synchronization and group services. Yarn Hadoop’s resource manager and job scheduler. HBase The Hadoop database. Pig A high-level data-flow language and execution framework for parallel computation. Spark SQL A module for structured and semi-structured data processing. Hive A data warehouse infrastructure supporting data summarization, query, and analysis. Sqoop A tool to move data from RDBMS to Hadoop. Flume It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into Hadoop. Kafka A high-throughput, distributed, publish-subscribe messaging system. Frank Kienle
  • 25. Visualization Tools example for Data Scientists (some practical tools/libraries, the purpose defines the tool) Open source programming language, active community participation, quick results and must know-how for a data scientist Focus on, interactive data visualizations in web browsersava. Script library for manipulating documents based on data Most often used from nearly everybody for visualization due to its mighty capabilities and penetration ExcelGeneral Purpose Example Web D3.js + derivates Description Rapid Prototyping Python (Matplotlib) R (Shiny) Professional Visual Exploration Tableau, Qlik MS PowerBI Professional interactive visualization tools with focus on quick insights, with the goal to provide business intelligence (BI) for an enterprise Focus in stack Visualization Frank Kienle
  • 26. Libraries/Algorithms/Programming/Tools (some practical tools/libraries, the purpose defines the selection) Query Languages and stream/batch processing programming paradigms with ease access to managed big data (there exist many more) The two most important languages for data science (there exist many more) World wide most used tool for data processing/calculation purposes with mighty capabilities (mostly not know) ExcelGeneral Purpose Example Statistics / Machine Learning Python + R Description (Big) Data Processing Spark + SQL Tool Providers Statistics/ML SAS, Rapid Miner, Knime, Matlab, … Professional tools with the goal to provide packaged, maintained and easy consumable analytics for professional and citizen data scientists Focus in stack Functional Layer Data Pipelines Frank Kienle
  • 27. Big Data Landscape 2012 Frank Kienle
  • 30. Big Data Landscape v 3.0 by Sub-Categories (source kdnuggets.com) Frank Kienle