SlideShare a Scribd company logo
Bloor Group Hosted Webinar:
How Do You Build Data Pipelines
that Are Agile, Automated, and Accurate?
Dave Wells, Eckerson
dwells@eckerson.com
Fernanda Tavares, Syncsort
ftavares@syncsort.com
© Eckerson Group, 2019 www.eckerson.com
How Do You Build Data Pipelines
that Are Agile, Automated, and Accurate?
Dave Wells
dwells@eckerson.com
© Eckerson Group, 2019 www.eckerson.com
Data Pipelines – Moving Data through the Ecosystem
3
Data
Lake
Data
Warehouse
Analytic
Sandboxes
Legacy
Data
OLTP
and ERP
Big Data
Sources
Data
Scientists
and Data
Analysts
Business
Analysts
and
BI Users
Report
Writers
selfservicemanagedservices
Data
Marts
• We have a lot of data and lots of variety
• We move data around a lot
• We do a lot of processing
• We store data in many places
• We use data in many ways
• Data is used by many people and many applications
• Demand for data engineers and data pipeline developers outstrips the supply
© Eckerson Group, 2019 www.eckerson.com
Data Silos – Data, Data Everywhere …
4
mainframe legacy systems
unix & .net legacy
legacy data warehouses
© Eckerson Group, 2019 www.eckerson.com
Data Value – The Demand for Fast Data
5
Fast ingestion, processing, and delivery of large data volumes is a key to data value.
As data ages its value for business decisions and actions may diminish rapidly.
valueofdata
age of data
low value
high value
real time high latency
real time
micro-batch
mini-batch
daily batch
weekly batch
monthly batch
© Eckerson Group, 2019 www.eckerson.com
Data Value – The Demand for Fast Data
4
Fast ingestion, processing, and delivery of large data volumes is a key to data value.
As data ages its value for business decisions and actions may diminish rapidly.
valueofdata
age of data
low value
high value
real time high latency
real time
micro-batch
mini-batch
daily batch
weekly batch
monthly batch
© Eckerson Group, 2019 www.eckerson.com
Fast Data – Real Time and Streaming Data
6
data acquisition data processing data archiving publishing & visualization
Reports
(historical analysis & trends)
Sensor Data Stream
connect
parseandfilter
Event Data
enricheventdata
Event Data
with Context
Dashboards
(real time monitoring)
archive
Data Warehouse
analyze
Alerts
(mobile & email)
© Eckerson Group, 2019 www.eckerson.com
Fast Data – Streaming vs. ETL
7
CDCorigin
data stream
of changes
destination
parse events &
select changes
changes
message queue
of
changes
destinationselect
messages
apply
changes
extract transform loadorigin destination
ETL – Scheduled batch processing. Inherently high latency.
CDC with queue – Scheduled or triggered mini- or micro-batch. Low to very low latency.
CDC with streaming – Triggered by data change events. Very low latency to real time.
© Eckerson Group, 2019 www.eckerson.com
Data Pipelines – A Macro View
8
Data Processing
• ETL / ELT
• Stream Processing
• CDC + Streaming
• etc.
Data Persistence
• Data Lakes
• Warehouses
• Sandboxes
• etc.
Data Services
• API / RPC
• SOAP / REST
• Virtualization
• etc.
RPC
API
virtualization
SOAP/REST
Etc.
relational, NoSQL,
geospatial, etc.
XML, JSON,
CSV, etc.
pipe
(data flow)
operation
(transformation)
Applications
Data Analysts
Data Scientists
Data
software engineering database engineering software engineering
© Eckerson Group, 2019 www.eckerson.com
Pipeline Components
Destination
Dataflow
origin
Storage Processing
Workflow Monitoring
Origin
Technology
© Eckerson Group, 2019 www.eckerson.com
Destination – Purpose and End Point
Destination
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Dataflow
Storage Processing
Workflow Monitoring
Technology
Origin
Challenges of cloud, multi-cloud, and
hybrid data destinations.
Challenges of differently structured
data and NoSQL databases.
© Eckerson Group, 2019 www.eckerson.com
Origin – Data Supply and Begin Point
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Mainframe
• OLTP
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
TechnologyTechnology
DestinationOrigin
Technology
Dataflow
Storage Processing
Workflow Monitoring
Challenges of data silos on different
platforms – multi-cloud, hybrid, etc.
Complexity of legacy data sources –
mainframe, VSAM, ERP systems, etc.
© Eckerson Group, 2019 www.eckerson.com
Data Flow – Data in Motion
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Legacy
• Transaction
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
Storage
temporary files
staging tables
data warehouse
data mart
operational data store
master data repository
Processing
extract transform load
map reduce
extract load transform
connect abstract publish
sample blend format
Workflow
scheduling execution
failoverdistribution verification
Monitoring
health check
performance logging
debugging
Technology
Storage Processing
Workflow Monitoring
Technology
Dataflow
origin
destination
DestinationOrigin
Storage Processing
Workflow Monitoring
Technology
Need to collect metadata and support
data lineage and traceability.
Reuse of workflows, dataflows, and
consistent use of patterns.
© Eckerson Group, 2019 www.eckerson.com
Data Storage – Data at Rest
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Legacy
• Transaction
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
Storage
temporary files
staging tables
data lake
data warehouse
master data repository
analytics sandbox
Processing
extract transform load
map reduce
extract load transform
connect abstract publish
sample blend format
Workflow
scheduling execution
failoverdistribution verification
Monitoring
health check
performance logging
debugging
Technology
Processing
Workflow Monitoring
Technology
Dataflow
origin
destination
DestinationOrigin
Dataflow
Processing
Workflow Monitoring
Technology
Challenges of data storage on different
platforms – multi-cloud, hybrid, etc.
Challenges of differently structured
data and NoSQL databases.
© Eckerson Group, 2019 www.eckerson.com
Processing – Adding Value and Creating Data Products
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Legacy
• Transaction
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
Storage
temporary files
staging tables
data warehouse
data mart
operational data store
master data repository
Workflow
scheduling execution
failoverdistribution verification
Monitoring
health check
performance logging
debugging
Technology
Workflow Monitoring
Technology
Dataflow
origin
destination
DestinationOrigin
Dataflow
Storage
Workflow Monitoring
Technology
Processing
extract transform load
map reduce
extract load transform
connect abstract publish
sample blend format
connect
& ingest
parse
& filter
enrich
events persist
alert
Processing at data locations.
Processing at the edge of the network.
Technology changes & future proofing.
© Eckerson Group, 2019 www.eckerson.com
Workflow – Managing Process Execution
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Legacy
• Transaction
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
Storage
temporary files
staging tables
data warehouse
data mart
operational data store
master data repository
Workflow
scheduling execution
failoverdistribution verification
Monitoring
health check
performance logging
debugging
Technology
Workflow Monitoring
Technology
Dataflow
origin
destination
DestinationOrigin
Dataflow
Storage
Workflow Monitoring
Technology
Workflow
scheduling execution
failoverdistribution verification
Processing
Operationalization
Orchestration
Real Time Execution
© Eckerson Group, 2019 www.eckerson.com
Monitoring – Managing Pipeline Health
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Legacy
• Transaction
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
Storage
temporary files
staging tables
data warehouse
data mart
operational data store
master data repository
Workflow
scheduling execution
failoverdistribution verification
Monitoring
health check
performance logging
debugging
Technology
Workflow Monitoring
Technology
Dataflow
origin
destination
DestinationOrigin
Dataflow
Storage
Workflow Monitoring
Technology
Processing
Monitoring
health check
performance logging
debugging
Speed and throughput.
Audits, balancing, and controls.
Fault tolerance and error handling.
Oversight and administration.
© Eckerson Group, 2019 www.eckerson.com
Summary – Data Pipeline Scope and Complexity
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Applications:
• Reporting
• OLAP
• Scorecards
• Dashboards
• Exploration
• Analytics
Destination
Sources:
• Legacy
• Transaction
• Web
• 3rd Party
• Social Media
• Machine
• Geospatial
Stores:
• Staging
• Warehouse
• Data Mart
• MDM
• ODS
• Data Lake
• Sandbox
Origin
Dataflow
origin
destination
Storage
temporary files
staging tables
data lake
data warehouse
master data repository
analytics sandbox
Workflow
scheduling execution
failoverdistribution verification
Monitoring
health check
performance logging
debugging
Technology: Hadoop, Databases, ETL, Automation, Virtualization, Analytics, Cataloging, Data Preparation …
Processing
extract transform load
map reduce
extract load transform
connect abstract publish
sample blend format
connect
& ingest
parse
& filter
enrich
events persist
alert
© Eckerson Group, 2019 www.eckerson.com
Modern Analytics – Survival of Analytic Models
18
frame the
problem
get &
prepare
data
train the
model
test the
model
deploy
the
model
operate
the
model
act on
analytics
Analytics Survival Rate
numberofmodels
lifecycle stage
data pipeline challenges
data quality & data understanding
high error rate / many false positives
data issues / lack of acceptance
reliable data supply / adapting to changes
lack of trust / lack of understanding
By some estimates, 80% of analytic
models that are built and tested are
never deployed and operationalized.Data pipelines and data preparation
are often the bottlenecks that delay
delivery of analytics. Speed and agility
are essential pipeline characteristics.
© Eckerson Group, 2019 www.eckerson.com
The Data Pipeline Imperative
19
Data Silos
Cloud, Multi-Cloud, and Hybrid
Complex Legacy Data Sources
Streaming and Fast Data
Shortage of Data Engineers
Analytics Failure to Launch
AGILE
AUTOMATED
ACCURATE
© Eckerson Group, 2019 www.eckerson.com
Fast Data Matters!
20
Source: Nucleus Research, Guidebook: Measuring the Half Life of Data
• The value of data diminishes rapidly
as the age of the data increases
• Tactical and operational value decline
by 50% within a few minutes
• Strategic value is diminished by 50%
in just 1 hour
• Fast data and real-time information
are critical for decision makers at
all levels of the business
(minutes of data latency)
(rateofdiminishedvalue)
© Eckerson Group, 2019 www.eckerson.com
Dave Wells
dwells@eckerson.com
Achieving Success Building
Agile Data Pipelines
Fernanda Tavares
23
Fact:
Successful data pipelines are fueled by data
your organization trusts and that is delivered
when they need it in real-time
Financial services company
Improving customer experiences
About
Multinational financial services
company offering:
• Banking
• Group plans
• Insurance
• Investment solutions
Business Drivers
Top down mandate to improve
customer experience for competitive
differentiation
Why was this mandate triggered?
• End customer: Requesting real-time
access to account value changes
• Business: Improved visibility and
access to customer financial data
Goal
Deliver on customer experience
mandate using advanced analytics
24
Choosing the right
technology
Needed to quickly integrate new
technologies to improve CX
Ensuring data delivered is
trusted and complete
Complicated environments made it
difficult to trust data that was delivered
Capturing changes to data
in real-time
Valuable data trapped in disparate
systems hindering real-time delivery
Identifying the right data
Important mainframe data had to be
incorporated into agile data pipelines
feeding the customer experienceChallenges to
achieving
customer
experience
mandate
25
Identifying the right data
• Connect applications together, leveraging the existing
transactional capabilities of the current application
platform, and the wealth of new capabilities of the cloud
• Feed analytics with up-to-date information so your
business runs on current insight
• Port workloads to less-expensive, strategic platforms
26
Legacy data can provide a treasure-trove of information that can
transform your business when leveraged via a streaming paradigm
The importance
of legacy data
Your traditional systems
– including mainframes, IBM i
servers & data warehouses –
adapt and deliver increasing
value with each new
technology wave
91%of executives predict long-term
viability of the mainframe as the
platform continues evolving to
meet digital business demands
>100kcompanies today use IBM i
technology to run significant
workloads & power critical
business applications
$1.65trillion
invested by enterprise IT
to support data warehouse &
analytics workloads over the
past decade
27
BMC 12th Annual Mainframe Research Results – Nov. 2017
Wikibon “10-Year Worldwide Enterprise IT Spending 2008-2017”
Capturing changes to data in real-time
• Consider that faster data delivery may break current data pipeline structures
• Look for solutions that insulate your organization against the underlying
complexities of your technology stack
28
Tracking and detection of data changes needs to be
presented to as close to real-time as possible
• Select solutions that guarantee data delivery and have reliable
transfer of information
• Look at how streaming platforms such as Kafka could be
leveraged to feed real-time delivery
• Assess how your overall cloud strategy can support real-time
data delivery
Ensuring data delivered is trusted and complete
• Big data projects require massive scalability and low latency
• Data transformations such as mapping, matching, linking, merging and
deduplication, actionable data are key providing data pipelines with
trusted data
29
As issues with trust continue to grow, it is imperative to build
data quality into your agile data pipelines
• Establish data processes that reinforce the business trust
the data they are working with
• Make sure data quality solutions provide a single,
complete and accurate view of customer
• Compliance – Know your data, and ensure its accuracy to
meet industry and government regulations
Choosing the right technology
• Guaranteed data delivery and fault tolerance
• Scalability and performance
• As your business grows, so will the demands on your
streaming pipeline.
30
Keep in mind the following must-haves:
• Lightweight footprint and impact
• When replicating from the systems that run your business,
ensure the solution won’t negatively impact your running
applications
• Agile deployment and design capabilities
• Look for solutions with a design once, deploy anywhere
approach whether on-prem, the cloud or in hybrid
environments
Stream real-time application data from traditional systems to mission
critical business applications and analytics platforms that demand
the most up-to-date information for accurate insights
• Traditional systems examples: mainframes, EDWs, IBM i
• Business application examples: fraud detection, hotel reservations,
mobile banking, etc.
Syncsort’s Connect CDC brings together
existing and new investments
31
Self-service data integration through browser based interface
• Design, deploy and monitor real-time change replication from a variety of
traditional systems (mainframe, RDBMS, EDW) to next-generation
distributed streaming platforms like Apache Kafka
Guaranteed data delivery
• Fault tolerant
• Protects against loss of data if a connection is temporarily lost
• Keeps track of exactly where data transfer left off and automatically
restarts at that exact point – no manual intervention
• No missing or duplicate data!
Maintain metadata integrity
• Integrates with Kafka schema registry
32
Connect CDC product highlights
Key Outcomes
• Reduce the time for business analysts to discover and understand
data on Big Data platforms
• Profile in place, no need to move/copy of data
• Scalable
• Secure
33
Trillium Discovery for Big Data
Key Outcomes
• Run natively on distributed Big Data frameworks including Spark,
Hadoop, and MapReduce
• Design once, deploy anywhere
• Match and link any data entity –trusted single view
• No coding or tuning required
• Out of the box best in class quality and enrichment
34
Trillium Quality for Big Data
36
Data pipeline architecture
Mainframe
Sources
Destinations
Big Data Cluster,
Enterprise Data Hub, Data Lake
Cloud Platforms
GOALS ACHIEVED
IBM i
Traditional
Enterprise Data
Warehouse
(EDW)
RDBMS
• Centralization of multiple disparate
data assets into one repository for
analytics at scale (building the data
lake)
• Legacy system retirement, data and
process offloading for cost savings,
data archiving, modernizing legacy
applications
Syncsort Data Integration
and Data Quality Solutions
Key Capabilities:
• Profile and validate data
• Cleanse, enrich, & match data
• Build real-time data pipelines
• Change data capture
How Do You Build Data Pipelines that Are Agile, Automated, and Accurate?

More Related Content

PPTX
5 Things that Make Hadoop a Game Changer
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
PDF
Creating a Modern Data Architecture
PPTX
Solving Performance Problems on Hadoop
PPTX
How to build a successful Data Lake
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
PDF
A Reference Architecture for ETL 2.0
5 Things that Make Hadoop a Game Changer
Data Lakehouse Symposium | Day 1 | Part 2
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Creating a Modern Data Architecture
Solving Performance Problems on Hadoop
How to build a successful Data Lake
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
A Reference Architecture for ETL 2.0

What's hot (20)

PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
PDF
Hadoop and the Data Warehouse: When to Use Which
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PDF
Designing the Next Generation Data Lake
PDF
Building the Enterprise Data Lake: A look at architecture
PDF
5 Steps for Architecting a Data Lake
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
PPTX
Hadoop Powers Modern Enterprise Data Architectures
PPTX
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
PPTX
Data Vault Automation at the Bijenkorf
PDF
So You Want to Build a Data Lake?
PDF
Planing and optimizing data lake architecture
PPTX
The Future of Data Warehousing and Data Integration
PDF
The Hidden Value of Hadoop Migration
PDF
Big Data for Managers: From hadoop to streaming and beyond
PDF
Webinar future dataintegration-datamesh-and-goldengatekafka
PPTX
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
PDF
Data Lake Architecture
PPTX
Deploying a Governed Data Lake
PPTX
The Future of Data Warehousing: ETL Will Never be the Same
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Hadoop and the Data Warehouse: When to Use Which
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Designing the Next Generation Data Lake
Building the Enterprise Data Lake: A look at architecture
5 Steps for Architecting a Data Lake
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Hadoop Powers Modern Enterprise Data Architectures
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Data Vault Automation at the Bijenkorf
So You Want to Build a Data Lake?
Planing and optimizing data lake architecture
The Future of Data Warehousing and Data Integration
The Hidden Value of Hadoop Migration
Big Data for Managers: From hadoop to streaming and beyond
Webinar future dataintegration-datamesh-and-goldengatekafka
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Data Lake Architecture
Deploying a Governed Data Lake
The Future of Data Warehousing: ETL Will Never be the Same
Ad

Similar to How Do You Build Data Pipelines that Are Agile, Automated, and Accurate? (20)

PPTX
Bridging Legacy Systems and Cloud Data Platforms to Unlock Valuable Enterpris...
PDF
Best practices in data ops
PPTX
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
PPTX
Building the enterprise data architecture
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PDF
Accelerate and modernize your data pipelines
PDF
Trends in Enterprise Advanced Analytics
PDF
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PDF
What makes an effective data team?
PDF
Performance management capability
PDF
Driving Business Value Through Agile Data Assets
PPTX
Data summit connect fall 2020 - rise of data ops
PPTX
The Data Warehouse is NOT Dead
PDF
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
PPTX
Who changed my data? Need for data governance and provenance in a streaming w...
PPTX
Navigating the World of User Data Management and Data Discovery
PDF
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
PDF
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
PDF
How 3 trends are shaping analytics and data management
Bridging Legacy Systems and Cloud Data Platforms to Unlock Valuable Enterpris...
Best practices in data ops
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Building the enterprise data architecture
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Accelerate and modernize your data pipelines
Trends in Enterprise Advanced Analytics
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
What makes an effective data team?
Performance management capability
Driving Business Value Through Agile Data Assets
Data summit connect fall 2020 - rise of data ops
The Data Warehouse is NOT Dead
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...
Who changed my data? Need for data governance and provenance in a streaming w...
Navigating the World of User Data Management and Data Discovery
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Big Data LDN 2018: AGILE DATA MASTERING: THE RIGHT APPROACH FOR DATAOPS
How 3 trends are shaping analytics and data management
Ad

More from Precisely (20)

PDF
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
PDF
Unlock new opportunities with location data.pdf
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Introducing Syncsort™ Storage Management.pdf
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
PDF
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
PDF
The 2025 Guide on What's Next for Automation.pdf
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
PDF
The Changing Compliance Landscape in 2025.pdf
The Future of Automation: AI, APIs, and Cloud Modernization.pdf
Unlock new opportunities with location data.pdf
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Introducing Syncsort™ Storage Management.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
The 2025 Guide on What's Next for Automation.pdf
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
The Changing Compliance Landscape in 2025.pdf

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
NewMind AI Weekly Chronicles - August'25-Week II
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release

How Do You Build Data Pipelines that Are Agile, Automated, and Accurate?

  • 1. Bloor Group Hosted Webinar: How Do You Build Data Pipelines that Are Agile, Automated, and Accurate? Dave Wells, Eckerson dwells@eckerson.com Fernanda Tavares, Syncsort ftavares@syncsort.com
  • 2. © Eckerson Group, 2019 www.eckerson.com How Do You Build Data Pipelines that Are Agile, Automated, and Accurate? Dave Wells dwells@eckerson.com
  • 3. © Eckerson Group, 2019 www.eckerson.com Data Pipelines – Moving Data through the Ecosystem 3 Data Lake Data Warehouse Analytic Sandboxes Legacy Data OLTP and ERP Big Data Sources Data Scientists and Data Analysts Business Analysts and BI Users Report Writers selfservicemanagedservices Data Marts • We have a lot of data and lots of variety • We move data around a lot • We do a lot of processing • We store data in many places • We use data in many ways • Data is used by many people and many applications • Demand for data engineers and data pipeline developers outstrips the supply
  • 4. © Eckerson Group, 2019 www.eckerson.com Data Silos – Data, Data Everywhere … 4 mainframe legacy systems unix & .net legacy legacy data warehouses
  • 5. © Eckerson Group, 2019 www.eckerson.com Data Value – The Demand for Fast Data 5 Fast ingestion, processing, and delivery of large data volumes is a key to data value. As data ages its value for business decisions and actions may diminish rapidly. valueofdata age of data low value high value real time high latency real time micro-batch mini-batch daily batch weekly batch monthly batch © Eckerson Group, 2019 www.eckerson.com Data Value – The Demand for Fast Data 4 Fast ingestion, processing, and delivery of large data volumes is a key to data value. As data ages its value for business decisions and actions may diminish rapidly. valueofdata age of data low value high value real time high latency real time micro-batch mini-batch daily batch weekly batch monthly batch
  • 6. © Eckerson Group, 2019 www.eckerson.com Fast Data – Real Time and Streaming Data 6 data acquisition data processing data archiving publishing & visualization Reports (historical analysis & trends) Sensor Data Stream connect parseandfilter Event Data enricheventdata Event Data with Context Dashboards (real time monitoring) archive Data Warehouse analyze Alerts (mobile & email)
  • 7. © Eckerson Group, 2019 www.eckerson.com Fast Data – Streaming vs. ETL 7 CDCorigin data stream of changes destination parse events & select changes changes message queue of changes destinationselect messages apply changes extract transform loadorigin destination ETL – Scheduled batch processing. Inherently high latency. CDC with queue – Scheduled or triggered mini- or micro-batch. Low to very low latency. CDC with streaming – Triggered by data change events. Very low latency to real time.
  • 8. © Eckerson Group, 2019 www.eckerson.com Data Pipelines – A Macro View 8 Data Processing • ETL / ELT • Stream Processing • CDC + Streaming • etc. Data Persistence • Data Lakes • Warehouses • Sandboxes • etc. Data Services • API / RPC • SOAP / REST • Virtualization • etc. RPC API virtualization SOAP/REST Etc. relational, NoSQL, geospatial, etc. XML, JSON, CSV, etc. pipe (data flow) operation (transformation) Applications Data Analysts Data Scientists Data software engineering database engineering software engineering
  • 9. © Eckerson Group, 2019 www.eckerson.com Pipeline Components Destination Dataflow origin Storage Processing Workflow Monitoring Origin Technology
  • 10. © Eckerson Group, 2019 www.eckerson.com Destination – Purpose and End Point Destination Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Dataflow Storage Processing Workflow Monitoring Technology Origin Challenges of cloud, multi-cloud, and hybrid data destinations. Challenges of differently structured data and NoSQL databases.
  • 11. © Eckerson Group, 2019 www.eckerson.com Origin – Data Supply and Begin Point Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Mainframe • OLTP • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin TechnologyTechnology DestinationOrigin Technology Dataflow Storage Processing Workflow Monitoring Challenges of data silos on different platforms – multi-cloud, hybrid, etc. Complexity of legacy data sources – mainframe, VSAM, ERP systems, etc.
  • 12. © Eckerson Group, 2019 www.eckerson.com Data Flow – Data in Motion Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Legacy • Transaction • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin Storage temporary files staging tables data warehouse data mart operational data store master data repository Processing extract transform load map reduce extract load transform connect abstract publish sample blend format Workflow scheduling execution failoverdistribution verification Monitoring health check performance logging debugging Technology Storage Processing Workflow Monitoring Technology Dataflow origin destination DestinationOrigin Storage Processing Workflow Monitoring Technology Need to collect metadata and support data lineage and traceability. Reuse of workflows, dataflows, and consistent use of patterns.
  • 13. © Eckerson Group, 2019 www.eckerson.com Data Storage – Data at Rest Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Legacy • Transaction • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin Storage temporary files staging tables data lake data warehouse master data repository analytics sandbox Processing extract transform load map reduce extract load transform connect abstract publish sample blend format Workflow scheduling execution failoverdistribution verification Monitoring health check performance logging debugging Technology Processing Workflow Monitoring Technology Dataflow origin destination DestinationOrigin Dataflow Processing Workflow Monitoring Technology Challenges of data storage on different platforms – multi-cloud, hybrid, etc. Challenges of differently structured data and NoSQL databases.
  • 14. © Eckerson Group, 2019 www.eckerson.com Processing – Adding Value and Creating Data Products Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Legacy • Transaction • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin Storage temporary files staging tables data warehouse data mart operational data store master data repository Workflow scheduling execution failoverdistribution verification Monitoring health check performance logging debugging Technology Workflow Monitoring Technology Dataflow origin destination DestinationOrigin Dataflow Storage Workflow Monitoring Technology Processing extract transform load map reduce extract load transform connect abstract publish sample blend format connect & ingest parse & filter enrich events persist alert Processing at data locations. Processing at the edge of the network. Technology changes & future proofing.
  • 15. © Eckerson Group, 2019 www.eckerson.com Workflow – Managing Process Execution Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Legacy • Transaction • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin Storage temporary files staging tables data warehouse data mart operational data store master data repository Workflow scheduling execution failoverdistribution verification Monitoring health check performance logging debugging Technology Workflow Monitoring Technology Dataflow origin destination DestinationOrigin Dataflow Storage Workflow Monitoring Technology Workflow scheduling execution failoverdistribution verification Processing Operationalization Orchestration Real Time Execution
  • 16. © Eckerson Group, 2019 www.eckerson.com Monitoring – Managing Pipeline Health Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Legacy • Transaction • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin Storage temporary files staging tables data warehouse data mart operational data store master data repository Workflow scheduling execution failoverdistribution verification Monitoring health check performance logging debugging Technology Workflow Monitoring Technology Dataflow origin destination DestinationOrigin Dataflow Storage Workflow Monitoring Technology Processing Monitoring health check performance logging debugging Speed and throughput. Audits, balancing, and controls. Fault tolerance and error handling. Oversight and administration.
  • 17. © Eckerson Group, 2019 www.eckerson.com Summary – Data Pipeline Scope and Complexity Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Applications: • Reporting • OLAP • Scorecards • Dashboards • Exploration • Analytics Destination Sources: • Legacy • Transaction • Web • 3rd Party • Social Media • Machine • Geospatial Stores: • Staging • Warehouse • Data Mart • MDM • ODS • Data Lake • Sandbox Origin Dataflow origin destination Storage temporary files staging tables data lake data warehouse master data repository analytics sandbox Workflow scheduling execution failoverdistribution verification Monitoring health check performance logging debugging Technology: Hadoop, Databases, ETL, Automation, Virtualization, Analytics, Cataloging, Data Preparation … Processing extract transform load map reduce extract load transform connect abstract publish sample blend format connect & ingest parse & filter enrich events persist alert
  • 18. © Eckerson Group, 2019 www.eckerson.com Modern Analytics – Survival of Analytic Models 18 frame the problem get & prepare data train the model test the model deploy the model operate the model act on analytics Analytics Survival Rate numberofmodels lifecycle stage data pipeline challenges data quality & data understanding high error rate / many false positives data issues / lack of acceptance reliable data supply / adapting to changes lack of trust / lack of understanding By some estimates, 80% of analytic models that are built and tested are never deployed and operationalized.Data pipelines and data preparation are often the bottlenecks that delay delivery of analytics. Speed and agility are essential pipeline characteristics.
  • 19. © Eckerson Group, 2019 www.eckerson.com The Data Pipeline Imperative 19 Data Silos Cloud, Multi-Cloud, and Hybrid Complex Legacy Data Sources Streaming and Fast Data Shortage of Data Engineers Analytics Failure to Launch AGILE AUTOMATED ACCURATE
  • 20. © Eckerson Group, 2019 www.eckerson.com Fast Data Matters! 20 Source: Nucleus Research, Guidebook: Measuring the Half Life of Data • The value of data diminishes rapidly as the age of the data increases • Tactical and operational value decline by 50% within a few minutes • Strategic value is diminished by 50% in just 1 hour • Fast data and real-time information are critical for decision makers at all levels of the business (minutes of data latency) (rateofdiminishedvalue)
  • 21. © Eckerson Group, 2019 www.eckerson.com Dave Wells dwells@eckerson.com
  • 22. Achieving Success Building Agile Data Pipelines Fernanda Tavares
  • 23. 23 Fact: Successful data pipelines are fueled by data your organization trusts and that is delivered when they need it in real-time
  • 24. Financial services company Improving customer experiences About Multinational financial services company offering: • Banking • Group plans • Insurance • Investment solutions Business Drivers Top down mandate to improve customer experience for competitive differentiation Why was this mandate triggered? • End customer: Requesting real-time access to account value changes • Business: Improved visibility and access to customer financial data Goal Deliver on customer experience mandate using advanced analytics 24
  • 25. Choosing the right technology Needed to quickly integrate new technologies to improve CX Ensuring data delivered is trusted and complete Complicated environments made it difficult to trust data that was delivered Capturing changes to data in real-time Valuable data trapped in disparate systems hindering real-time delivery Identifying the right data Important mainframe data had to be incorporated into agile data pipelines feeding the customer experienceChallenges to achieving customer experience mandate 25
  • 26. Identifying the right data • Connect applications together, leveraging the existing transactional capabilities of the current application platform, and the wealth of new capabilities of the cloud • Feed analytics with up-to-date information so your business runs on current insight • Port workloads to less-expensive, strategic platforms 26 Legacy data can provide a treasure-trove of information that can transform your business when leveraged via a streaming paradigm
  • 27. The importance of legacy data Your traditional systems – including mainframes, IBM i servers & data warehouses – adapt and deliver increasing value with each new technology wave 91%of executives predict long-term viability of the mainframe as the platform continues evolving to meet digital business demands >100kcompanies today use IBM i technology to run significant workloads & power critical business applications $1.65trillion invested by enterprise IT to support data warehouse & analytics workloads over the past decade 27 BMC 12th Annual Mainframe Research Results – Nov. 2017 Wikibon “10-Year Worldwide Enterprise IT Spending 2008-2017”
  • 28. Capturing changes to data in real-time • Consider that faster data delivery may break current data pipeline structures • Look for solutions that insulate your organization against the underlying complexities of your technology stack 28 Tracking and detection of data changes needs to be presented to as close to real-time as possible • Select solutions that guarantee data delivery and have reliable transfer of information • Look at how streaming platforms such as Kafka could be leveraged to feed real-time delivery • Assess how your overall cloud strategy can support real-time data delivery
  • 29. Ensuring data delivered is trusted and complete • Big data projects require massive scalability and low latency • Data transformations such as mapping, matching, linking, merging and deduplication, actionable data are key providing data pipelines with trusted data 29 As issues with trust continue to grow, it is imperative to build data quality into your agile data pipelines • Establish data processes that reinforce the business trust the data they are working with • Make sure data quality solutions provide a single, complete and accurate view of customer • Compliance – Know your data, and ensure its accuracy to meet industry and government regulations
  • 30. Choosing the right technology • Guaranteed data delivery and fault tolerance • Scalability and performance • As your business grows, so will the demands on your streaming pipeline. 30 Keep in mind the following must-haves: • Lightweight footprint and impact • When replicating from the systems that run your business, ensure the solution won’t negatively impact your running applications • Agile deployment and design capabilities • Look for solutions with a design once, deploy anywhere approach whether on-prem, the cloud or in hybrid environments
  • 31. Stream real-time application data from traditional systems to mission critical business applications and analytics platforms that demand the most up-to-date information for accurate insights • Traditional systems examples: mainframes, EDWs, IBM i • Business application examples: fraud detection, hotel reservations, mobile banking, etc. Syncsort’s Connect CDC brings together existing and new investments 31
  • 32. Self-service data integration through browser based interface • Design, deploy and monitor real-time change replication from a variety of traditional systems (mainframe, RDBMS, EDW) to next-generation distributed streaming platforms like Apache Kafka Guaranteed data delivery • Fault tolerant • Protects against loss of data if a connection is temporarily lost • Keeps track of exactly where data transfer left off and automatically restarts at that exact point – no manual intervention • No missing or duplicate data! Maintain metadata integrity • Integrates with Kafka schema registry 32 Connect CDC product highlights
  • 33. Key Outcomes • Reduce the time for business analysts to discover and understand data on Big Data platforms • Profile in place, no need to move/copy of data • Scalable • Secure 33 Trillium Discovery for Big Data
  • 34. Key Outcomes • Run natively on distributed Big Data frameworks including Spark, Hadoop, and MapReduce • Design once, deploy anywhere • Match and link any data entity –trusted single view • No coding or tuning required • Out of the box best in class quality and enrichment 34 Trillium Quality for Big Data
  • 35. 36 Data pipeline architecture Mainframe Sources Destinations Big Data Cluster, Enterprise Data Hub, Data Lake Cloud Platforms GOALS ACHIEVED IBM i Traditional Enterprise Data Warehouse (EDW) RDBMS • Centralization of multiple disparate data assets into one repository for analytics at scale (building the data lake) • Legacy system retirement, data and process offloading for cost savings, data archiving, modernizing legacy applications Syncsort Data Integration and Data Quality Solutions Key Capabilities: • Profile and validate data • Cleanse, enrich, & match data • Build real-time data pipelines • Change data capture