SlideShare a Scribd company logo
Democratizing Data
Democratizing Data
Architecting Terabyte Common Data Models and
Configuration Driven Pipelines For AI Platforms
Cindy Mottershead, AI Architect, Blackbaud
Shiran Algai, Senior Manager of Software Development, Blackbaud
Agenda
Shiran Algai
▪ Problem Statement
▪ Architecture Journey
Cindy Mottershead
▪ Architecture decisions
▪ Common Data Model
▪ Configuration Driven Pipeline
▪ Transformation building blocks
▪ AI Feedback Loop
We are the world’s
leading cloud software
company powering
social good.
Millions of users in
over 100 countries
The world’s 18th largest
SaaS applications provider*
Fortune 56 Companies
Changing the World*
*2017
Problem
▪ Data is very siloed
▪ Similar entities are described entirely differently by every product
▪ Bringing on new sources continues to compound the issue
▪ Data is frequently entered slightly differently for same entity
▪ Engineering teams are unable to leverage data to drive insight and
help our customers solve their problems.
▪ AI ETL cycle far too long
▪ AI ability to explore data extremely limited
First Steps
▪ Had beginnings of a few data lake projects, but scattered a bit
throughout organization
▪ Built consensus and momentum toward a common delta lake
▪ Started on MS tooling in Azure (data factory, U-SQL run by data lake
analytics jobs, etc.)
▪ Leverage as many Azure PaaS tools as possible
▪ Batch only
▪ Picked a small "bore hole through the mountain" approach
First Steps
Pivoting
▪ Painful adding new readers for different
sources not natively supported (Avro, Parquet)
▪ Gaps in Azure data tooling for our specific use
cases
▪ Desire for batch AND streaming through a
similar path
▪ Need the ability to compact records,
recreating legacy datasets in the platform
▪ Ability to hire data engineers in the market
easily
Solution
Data Platform Ecosystem
▪ Delta Lake
▪ Azure Data Lake Store
▪ Data Catalog Service
▪ Lake Authorization Service
▪ Ingestion Service
▪ Output service
▪ Async messaging contract
broker service
Service A
Service B
Data Catalog
Uses ACB as a source
for new catalog entries
Async Contract Broker
Service
Stores message schemas
Prevents breaking schema
changes
Ingestion Service
Automatically
subscribes to new and
existing topics
…
82
more
Lake
Staging Zone Raw Zone
Compacted daily
Trusted Zone
CDM tables
Service Bus
Topic
Common Data Models
▪ Downstream services + Data Scientists all leverage same common
models, accelerating development
▪ Common defined structure
▪ Consistent Naming of tables, structures, fields
▪ Consistent across all applications and application types
▪ Manage multiple data sources
▪ Remove complexities & specifics of source systems
▪ Shows the data “As is” (natural values)
▪ Provides common groupings & coding of data values (derived values)
▪ Integrated with Value-Added Services
CDM_Person
Common Data Model Input
▪ Thousands of relational tables
▪ Csv, json, parquet, avro, etc formatted input files
▪ Normalized and denormalized input
▪ Nested objects
▪ SQL Server, Mariadb, Oracle, flat files
▪ Change events
Configuration Driven Pipeline
▪ Common Id
▪ Metadata Map
▪ Pipeline
▪ Transformations
Transformation building
blocks
a) Filters
b) View
c) One to One (with SQL transform,
with Lookup)
d) One row to Many Rows (unpivot)
e) Many rows to array in one
column
f) Aggregations
ML Feedback Loops
Full cycle of model
deployment, tying
actions taken back
into model PROVIDE FULL CYCLE
OF DATA FROM
PRESENTATION, USER
INTERACTION, RESULT
S
ALLOWS MONITORING
AND TUNING OF ML
MODELS
PROVIDES METRICS
FOR ROADMAP
PRIORITIZATION
PROVIDES METRICS
FOR A/B TESTING
Tying It All Together
▪ Data flows from various products
▪ Ingested
▪ Transformed via Configuration Driven Pipelines
▪ One Common Data Model
▪ Data flows out of common data models back into ecosystem
▪ Baked in feedback loops
Democratized Data
▪ Data Scientists can access data directly from the CDM
▪ CDM is a Delta table
▪ Views are created for security access (no access to PII)
▪ Access is controlled at the view level
▪ Data is projected (using Schema on Read) to any destination location
(blob, SQL Server, Cosmos, etc)
▪ Data Scientists and Engineers can request any dataset they need by specifying metadata
▪ Requested data is transformed based on the metadata description
▪ Data is streamed or batched out to destination based on metadata frequency info
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Democratizing Data

More Related Content

PPTX
Databricks Fundamentals
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Getting Started with Databricks SQL Analytics
PDF
DMBOK and Data Governance
PPTX
Visual Analytics Best Practices
PPTX
Data Visualization Design Best Practices Workshop
PDF
Time to Talk about Data Mesh
Databricks Fundamentals
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Mesh Part 4 Monolith to Mesh
Getting Started with Databricks SQL Analytics
DMBOK and Data Governance
Visual Analytics Best Practices
Data Visualization Design Best Practices Workshop
Time to Talk about Data Mesh

What's hot (20)

PPTX
Data Visualization & Data Storytelling
PDF
Modern Data architecture Design
PDF
Building End-to-End Delta Pipelines on GCP
PDF
Databricks: A Tool That Empowers You To Do More With Data
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Introduction to Data Engineering
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
PPTX
What is Enterprise Architecture?
PPTX
Azure Data Factory Data Flow
PDF
Data Discovery at Databricks with Amundsen
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PDF
Future of Data Engineering
PDF
DAS Slides: Best Practices in Metadata Management
PDF
Five Things to Consider About Data Mesh and Data Governance
PPTX
Snowflake + Power BI: Cloud Analytics for Everyone
PDF
Building Data Quality pipelines with Apache Spark and Delta Lake
PPTX
1- Introduction of Azure data factory.pptx
PDF
Getting Started with Delta Lake on Databricks
PPTX
Azure data bricks by Eugene Polonichko
PDF
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
Data Visualization & Data Storytelling
Modern Data architecture Design
Building End-to-End Delta Pipelines on GCP
Databricks: A Tool That Empowers You To Do More With Data
Unified Big Data Processing with Apache Spark (QCON 2014)
Introduction to Data Engineering
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
What is Enterprise Architecture?
Azure Data Factory Data Flow
Data Discovery at Databricks with Amundsen
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Future of Data Engineering
DAS Slides: Best Practices in Metadata Management
Five Things to Consider About Data Mesh and Data Governance
Snowflake + Power BI: Cloud Analytics for Everyone
Building Data Quality pipelines with Apache Spark and Delta Lake
1- Introduction of Azure data factory.pptx
Getting Started with Delta Lake on Databricks
Azure data bricks by Eugene Polonichko
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
Ad

Similar to Democratizing Data (20)

PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PPTX
Data Modernization_Harinath Susairaj.pptx
PPTX
Building the enterprise data architecture
PDF
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
 
PPTX
JOSA TechTalk: Metadata Management
in Big Data
PDF
Technical Documentation 101 for Data Engineers.pdf
PDF
Data Platform in the Cloud
PDF
Intro to big data and applications - day 2
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
PDF
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
PPTX
Big Data_Architecture.pptx
PDF
Platforming the Major Analytic Use Cases for Modern Engineering
PDF
Democratization of Data @Indix
PDF
So You Want to Build a Data Lake?
PDF
Understanding Metadata: Why it's essential to your big data solution and how ...
PDF
Architecting Agile Data Applications for Scale
PDF
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
PDF
5 Steps To Master Data Management
PDF
Achieve data democracy in data lake with data integration
PDF
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Data Modernization_Harinath Susairaj.pptx
Building the enterprise data architecture
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
 
JOSA TechTalk: Metadata Management
in Big Data
Technical Documentation 101 for Data Engineers.pdf
Data Platform in the Cloud
Intro to big data and applications - day 2
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
Big Data_Architecture.pptx
Platforming the Major Analytic Use Cases for Modern Engineering
Democratization of Data @Indix
So You Want to Build a Data Lake?
Understanding Metadata: Why it's essential to your big data solution and how ...
Architecting Agile Data Applications for Scale
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
5 Steps To Master Data Management
Achieve data democracy in data lake with data integration
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
annual-report-2024-2025 original latest.
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Business Analytics and business intelligence.pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Database Infoormation System (DBIS).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
annual-report-2024-2025 original latest.
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Knowledge Engineering Part 1
Business Analytics and business intelligence.pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to machine learning and Linear Models
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx

Democratizing Data

  • 2. Democratizing Data Architecting Terabyte Common Data Models and Configuration Driven Pipelines For AI Platforms Cindy Mottershead, AI Architect, Blackbaud Shiran Algai, Senior Manager of Software Development, Blackbaud
  • 3. Agenda Shiran Algai ▪ Problem Statement ▪ Architecture Journey Cindy Mottershead ▪ Architecture decisions ▪ Common Data Model ▪ Configuration Driven Pipeline ▪ Transformation building blocks ▪ AI Feedback Loop
  • 4. We are the world’s leading cloud software company powering social good. Millions of users in over 100 countries The world’s 18th largest SaaS applications provider* Fortune 56 Companies Changing the World* *2017
  • 5. Problem ▪ Data is very siloed ▪ Similar entities are described entirely differently by every product ▪ Bringing on new sources continues to compound the issue ▪ Data is frequently entered slightly differently for same entity ▪ Engineering teams are unable to leverage data to drive insight and help our customers solve their problems. ▪ AI ETL cycle far too long ▪ AI ability to explore data extremely limited
  • 6. First Steps ▪ Had beginnings of a few data lake projects, but scattered a bit throughout organization ▪ Built consensus and momentum toward a common delta lake ▪ Started on MS tooling in Azure (data factory, U-SQL run by data lake analytics jobs, etc.) ▪ Leverage as many Azure PaaS tools as possible ▪ Batch only ▪ Picked a small "bore hole through the mountain" approach
  • 8. Pivoting ▪ Painful adding new readers for different sources not natively supported (Avro, Parquet) ▪ Gaps in Azure data tooling for our specific use cases ▪ Desire for batch AND streaming through a similar path ▪ Need the ability to compact records, recreating legacy datasets in the platform ▪ Ability to hire data engineers in the market easily
  • 10. Data Platform Ecosystem ▪ Delta Lake ▪ Azure Data Lake Store ▪ Data Catalog Service ▪ Lake Authorization Service ▪ Ingestion Service ▪ Output service ▪ Async messaging contract broker service
  • 11. Service A Service B Data Catalog Uses ACB as a source for new catalog entries Async Contract Broker Service Stores message schemas Prevents breaking schema changes Ingestion Service Automatically subscribes to new and existing topics … 82 more Lake Staging Zone Raw Zone Compacted daily Trusted Zone CDM tables Service Bus Topic
  • 12. Common Data Models ▪ Downstream services + Data Scientists all leverage same common models, accelerating development ▪ Common defined structure ▪ Consistent Naming of tables, structures, fields ▪ Consistent across all applications and application types ▪ Manage multiple data sources ▪ Remove complexities & specifics of source systems ▪ Shows the data “As is” (natural values) ▪ Provides common groupings & coding of data values (derived values) ▪ Integrated with Value-Added Services
  • 14. Common Data Model Input ▪ Thousands of relational tables ▪ Csv, json, parquet, avro, etc formatted input files ▪ Normalized and denormalized input ▪ Nested objects ▪ SQL Server, Mariadb, Oracle, flat files ▪ Change events
  • 15. Configuration Driven Pipeline ▪ Common Id ▪ Metadata Map ▪ Pipeline ▪ Transformations
  • 16. Transformation building blocks a) Filters b) View c) One to One (with SQL transform, with Lookup) d) One row to Many Rows (unpivot) e) Many rows to array in one column f) Aggregations
  • 17. ML Feedback Loops Full cycle of model deployment, tying actions taken back into model PROVIDE FULL CYCLE OF DATA FROM PRESENTATION, USER INTERACTION, RESULT S ALLOWS MONITORING AND TUNING OF ML MODELS PROVIDES METRICS FOR ROADMAP PRIORITIZATION PROVIDES METRICS FOR A/B TESTING
  • 18. Tying It All Together ▪ Data flows from various products ▪ Ingested ▪ Transformed via Configuration Driven Pipelines ▪ One Common Data Model ▪ Data flows out of common data models back into ecosystem ▪ Baked in feedback loops
  • 19. Democratized Data ▪ Data Scientists can access data directly from the CDM ▪ CDM is a Delta table ▪ Views are created for security access (no access to PII) ▪ Access is controlled at the view level ▪ Data is projected (using Schema on Read) to any destination location (blob, SQL Server, Cosmos, etc) ▪ Data Scientists and Engineers can request any dataset they need by specifying metadata ▪ Requested data is transformed based on the metadata description ▪ Data is streamed or batched out to destination based on metadata frequency info
  • 20. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.