SlideShare a Scribd company logo
Designing Big Data Pipelines
Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
Big Data
• A huge amount of data are generated and
collected every minute (sensors)
• 1.7 million billion bytes of data, over 6
megabytes for each human (2016)
• 2.5 quintillion bytes of data created each day
• The trend is rapidly accelerating with the
growth of the Internet of Things (IoT),
200 billions of connected devices by 2020
• Low latency access to huge distributed data
sources has become a value proposition
• Business intelligence applications require
proper big data analysis and management
functionalities
What
to do
with
these
data?
Aggregation and
Statistics
Data
warehouse
and OLAP
Indexing,
Searching, and
Querying
Keyword
based search
Pattern
matching
(XML/RDF)
Knowledge
discovery
Data Mining
Statistical
Modeling
Machine
Learning
The Big
Data
difference
• Classic analytics assume:
• Standard data
models/formats
• Reasonable volumes
• Loose deadlines
• Problem: The five Vs
jeopardise these assumptions
– (unless we sample or
summarize)
The data analytics pipeline
Processing Models: batch vs stream
• Batch
• Receive, accumulate then
compute (data lake)
• Stream
• Compute while receiving
(data flow)
• Same questions, different
algorithms
• Both different from “mouse”
computations
Hurdles in
Adoption of
Big Data
Technologies
• Complex Architecture
• Lack of Standardization
• Regulatory Barriers
• Violation of Data
Access
• Sharing & Custody
Regulation
• High Cost of Legal
Clearance
Big Data
As a
Service
• A set of automatic tools
and a methodology that
allows customers to design
and deploy a full Big Data
pipeline addressing their
goals
How to
design a Big
Data Pipeline
1. Define a Business Value
2. Identify the Data Sources
3. Define the Data Flow
4. Study Data Protection Directives
5. Define Visualization, Reporting
and Interaction
6. Select Data Preparation Stages
7. Identify Processing
Requirements
8. Select Analytics
9. Define the Data Processing Flow
Big Data Pipeline Areas
Ingestion and
representation
Preparation
Processing
Analytics
Display and reporting
Specify how data are represented: NoSQL, Graph-
based, Relational, Extended relational, Markup
based, Hybrid
Specify how data will be routed and
parallelized, and how the analytics will be
computed: parallel batch, stream, hybrid
Specify the expected outcome: descriptive,
prescriptive, predictive
Specify the display and reporting of the
results: scalar, multi-dimensional
Specify how to prepare data for analitycs:
anonymize, reduce dimensions, hash
• Abstract the typical
procedural models (e.g., data
pipeline) implemented in big
data frameworks
• Develop model
transformations to translate
modelling decisions into
actual provisioning
Model Driven Approach
Declarative
Models
Procedural
Models
Deployment
Models
(Non-)Functional
Goals: Service goals
of Big Data Pipeline
What the BDA should
achieve and how to
achieve objectives
How the BDA process
should work
Declarative
Model
• Specify non-functional/functional
goals
• Single model addressing all
aspects of big data pipelines:
preparation, representation,
analytics, processing, display
and reporting
• Aspects of different areas
may impact on the same
procedural model template
• Some goals map directly to Service
Level Agreements (SLAs), others
need a transformation function to
map to SLAs
Procedural
Model
• Contain all information needed for
running the analytics
• Simple to map declarative goals on
procedures
• Platform independent
• Specified procedural templates
(alternatives)
• Procedural templates correspond to
defined goals
• May need additional input from final
users of big data services
• Templates express competences of data
scientist and data technologist
• Declarative models used to select the
(set of) proper templates
Deployment
Model
• Specify how procedural
models are to be incarnated
in a ready-to-be-deployed
architecture
• Drive analytics execution in
real scenarios
• To be defined for each
application
• Platform dependent
Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
Tocode-based
Torecipies

More Related Content

PPTX
3 Ways Tableau Improves Predictive Analytics
PDF
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
PDF
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
PDF
Mastering in Data Warehousing and Business Intelligence
PPTX
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
PPTX
AzureDay - Introduction Big Data Analytics.
PPTX
Introduction to BIG DATA
PDF
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
3 Ways Tableau Improves Predictive Analytics
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Mastering in Data Warehousing and Business Intelligence
IT Architectures for Handling Big Data in Official Statistics: the Case of Sc...
AzureDay - Introduction Big Data Analytics.
Introduction to BIG DATA
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...

What's hot (18)

PDF
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
PPTX
Evolution of big data
PPTX
Business Innovations Through Big Data Analytics - 30th November 2017
PPTX
Machine Learning in the Data Science Context
PPTX
Global IT Outsourcing case study
PDF
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
PPTX
Prcn 2019 stage 1264-question-presentation_poster file_id-15
PPTX
Data analytics
PPTX
Introduction to data science
PPTX
Big data in Food sector
PPTX
001 More introduction to big data analytics
PDF
TUW-ASE Summer 2015: Advanced service-based data analytics: Models, Elasticit...
PDF
Introduction to BigData
PDF
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
PPTX
Big data
PPTX
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
PPTX
Solution Architecture US healthcare
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Evolution of big data
Business Innovations Through Big Data Analytics - 30th November 2017
Machine Learning in the Data Science Context
Global IT Outsourcing case study
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Data analytics
Introduction to data science
Big data in Food sector
001 More introduction to big data analytics
TUW-ASE Summer 2015: Advanced service-based data analytics: Models, Elasticit...
Introduction to BigData
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big data
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Solution Architecture US healthcare
Ad

Similar to BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani) (20)

PDF
What is Big Data Pipe?
PPTX
Big data in action
PPTX
Pptbig data4
PDF
Hadoop-based architecture approaches
PDF
Data Pipelines and Tools to Integrate with Power BI and Spotfire.pdf
PDF
Why Should Data Pipelines be Automated for Effective and Continuous Delivery_...
PDF
Building Scalable Big Data Pipelines
PPTX
Building a Big Data Pipeline
PDF
Governing Big Data : Principles and practices
PPTX
Designing Data Pipelines for Automous and Trusted Analytics
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
PDF
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PPTX
Guide to big data analytics
PDF
Big Data Analytics
PDF
Introduction Big Data
PDF
Big data and oracle
PDF
Big data pipelines
PDF
Big Data Architecture
PDF
Traditional data word
What is Big Data Pipe?
Big data in action
Pptbig data4
Hadoop-based architecture approaches
Data Pipelines and Tools to Integrate with Power BI and Spotfire.pdf
Why Should Data Pipelines be Automated for Effective and Continuous Delivery_...
Building Scalable Big Data Pipelines
Building a Big Data Pipeline
Governing Big Data : Principles and practices
Designing Data Pipelines for Automous and Trusted Analytics
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger download pdf
Streaming Data Pipelines with Kafka (MEAP) Stefan Sprenger
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Guide to big data analytics
Big Data Analytics
Introduction Big Data
Big data and oracle
Big data pipelines
Big Data Architecture
Traditional data word
Ad

More from Big Data Value Association (20)

PDF
Data Privacy, Security in personal data sharing
PDF
Key Modules for a trsuted and privacy preserving personal data marketplace
PDF
GDPR and Data Ethics considerations in personal data sharing
PPTX
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
PPTX
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
PPTX
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
PDF
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
PDF
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
PDF
BDV Skills Accreditation - EIT labels for professionals
PDF
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
PDF
BDV Skills Accreditation - Objectives of the workshop
PDF
BDV Skills Accreditation - Welcome introduction to the workshop
PDF
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
PDF
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
PDF
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
PPTX
Virtual BenchLearning - Data Bench Framework
PPTX
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
PPTX
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
PDF
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
PDF
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Data Privacy, Security in personal data sharing
Key Modules for a trsuted and privacy preserving personal data marketplace
GDPR and Data Ethics considerations in personal data sharing
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
Virtual BenchLearning - Data Bench Framework
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to machine learning and Linear Models
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Business Analytics and business intelligence.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
annual-report-2024-2025 original latest.
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Foundation of Data Science unit number two notes
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to machine learning and Linear Models
.pdf is not working space design for the following data for the following dat...
Business Analytics and business intelligence.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Supervised vs unsupervised machine learning algorithms
annual-report-2024-2025 original latest.
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IB Computer Science - Internal Assessment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction-to-Cloud-ComputingFinal.pptx
Foundation of Data Science unit number two notes

BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)

  • 1. Designing Big Data Pipelines Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
  • 2. Big Data • A huge amount of data are generated and collected every minute (sensors) • 1.7 million billion bytes of data, over 6 megabytes for each human (2016) • 2.5 quintillion bytes of data created each day • The trend is rapidly accelerating with the growth of the Internet of Things (IoT), 200 billions of connected devices by 2020 • Low latency access to huge distributed data sources has become a value proposition • Business intelligence applications require proper big data analysis and management functionalities
  • 3. What to do with these data? Aggregation and Statistics Data warehouse and OLAP Indexing, Searching, and Querying Keyword based search Pattern matching (XML/RDF) Knowledge discovery Data Mining Statistical Modeling Machine Learning
  • 4. The Big Data difference • Classic analytics assume: • Standard data models/formats • Reasonable volumes • Loose deadlines • Problem: The five Vs jeopardise these assumptions – (unless we sample or summarize)
  • 6. Processing Models: batch vs stream • Batch • Receive, accumulate then compute (data lake) • Stream • Compute while receiving (data flow) • Same questions, different algorithms • Both different from “mouse” computations
  • 7. Hurdles in Adoption of Big Data Technologies • Complex Architecture • Lack of Standardization • Regulatory Barriers • Violation of Data Access • Sharing & Custody Regulation • High Cost of Legal Clearance
  • 8. Big Data As a Service • A set of automatic tools and a methodology that allows customers to design and deploy a full Big Data pipeline addressing their goals
  • 9. How to design a Big Data Pipeline 1. Define a Business Value 2. Identify the Data Sources 3. Define the Data Flow 4. Study Data Protection Directives 5. Define Visualization, Reporting and Interaction 6. Select Data Preparation Stages 7. Identify Processing Requirements 8. Select Analytics 9. Define the Data Processing Flow
  • 10. Big Data Pipeline Areas Ingestion and representation Preparation Processing Analytics Display and reporting Specify how data are represented: NoSQL, Graph- based, Relational, Extended relational, Markup based, Hybrid Specify how data will be routed and parallelized, and how the analytics will be computed: parallel batch, stream, hybrid Specify the expected outcome: descriptive, prescriptive, predictive Specify the display and reporting of the results: scalar, multi-dimensional Specify how to prepare data for analitycs: anonymize, reduce dimensions, hash
  • 11. • Abstract the typical procedural models (e.g., data pipeline) implemented in big data frameworks • Develop model transformations to translate modelling decisions into actual provisioning Model Driven Approach Declarative Models Procedural Models Deployment Models (Non-)Functional Goals: Service goals of Big Data Pipeline What the BDA should achieve and how to achieve objectives How the BDA process should work
  • 12. Declarative Model • Specify non-functional/functional goals • Single model addressing all aspects of big data pipelines: preparation, representation, analytics, processing, display and reporting • Aspects of different areas may impact on the same procedural model template • Some goals map directly to Service Level Agreements (SLAs), others need a transformation function to map to SLAs
  • 13. Procedural Model • Contain all information needed for running the analytics • Simple to map declarative goals on procedures • Platform independent • Specified procedural templates (alternatives) • Procedural templates correspond to defined goals • May need additional input from final users of big data services • Templates express competences of data scientist and data technologist • Declarative models used to select the (set of) proper templates
  • 14. Deployment Model • Specify how procedural models are to be incarnated in a ready-to-be-deployed architecture • Drive analytics execution in real scenarios • To be defined for each application • Platform dependent