Democratizing Data

Democratizing Data
Architecting Terabyte Common Data Models and
Configuration Driven Pipelines For AI Platforms
Cindy Mottershead, AI Architect, Blackbaud
Shiran Algai, Senior Manager of Software Development, Blackbaud

Agenda
Shiran Algai
▪ Problem Statement
▪ Architecture Journey
Cindy Mottershead
▪ Architecture decisions
▪ Common Data Model
▪ Configuration Driven Pipeline
▪ Transformation building blocks
▪ AI Feedback Loop

We are the world’s
leading cloud software
company powering
social good.
Millions of users in
over 100 countries
The world’s 18th largest
SaaS applications provider*
Fortune 56 Companies
Changing the World*
*2017

Problem
▪ Data is very siloed
▪ Similar entities are described entirely differently by every product
▪ Bringing on new sources continues to compound the issue
▪ Data is frequently entered slightly differently for same entity
▪ Engineering teams are unable to leverage data to drive insight and
help our customers solve their problems.
▪ AI ETL cycle far too long
▪ AI ability to explore data extremely limited

First Steps
▪ Had beginnings of a few data lake projects, but scattered a bit
throughout organization
▪ Built consensus and momentum toward a common delta lake
▪ Started on MS tooling in Azure (data factory, U-SQL run by data lake
analytics jobs, etc.)
▪ Leverage as many Azure PaaS tools as possible
▪ Batch only
▪ Picked a small "bore hole through the mountain" approach

Pivoting
▪ Painful adding new readers for different
sources not natively supported (Avro, Parquet)
▪ Gaps in Azure data tooling for our specific use
cases
▪ Desire for batch AND streaming through a
similar path
▪ Need the ability to compact records,
recreating legacy datasets in the platform
▪ Ability to hire data engineers in the market
easily

Data Platform Ecosystem
▪ Delta Lake
▪ Azure Data Lake Store
▪ Data Catalog Service
▪ Lake Authorization Service
▪ Ingestion Service
▪ Output service
▪ Async messaging contract
broker service

Service A
Service B
Data Catalog
Uses ACB as a source
for new catalog entries
Async Contract Broker
Service
Stores message schemas
Prevents breaking schema
changes
Ingestion Service
Automatically
subscribes to new and
existing topics
…
82
more
Lake
Staging Zone Raw Zone
Compacted daily
Trusted Zone
CDM tables
Service Bus
Topic

Common Data Models
▪ Downstream services + Data Scientists all leverage same common
models, accelerating development
▪ Common defined structure
▪ Consistent Naming of tables, structures, fields
▪ Consistent across all applications and application types
▪ Manage multiple data sources
▪ Remove complexities & specifics of source systems
▪ Shows the data “As is” (natural values)
▪ Provides common groupings & coding of data values (derived values)
▪ Integrated with Value-Added Services

Common Data Model Input
▪ Thousands of relational tables
▪ Csv, json, parquet, avro, etc formatted input files
▪ Normalized and denormalized input
▪ Nested objects
▪ SQL Server, Mariadb, Oracle, flat files
▪ Change events

Configuration Driven Pipeline
▪ Common Id
▪ Metadata Map
▪ Pipeline
▪ Transformations

Transformation building
blocks
a) Filters
b) View
c) One to One (with SQL transform,
with Lookup)
d) One row to Many Rows (unpivot)
e) Many rows to array in one
column
f) Aggregations

ML Feedback Loops
Full cycle of model
deployment, tying
actions taken back
into model PROVIDE FULL CYCLE
OF DATA FROM
PRESENTATION, USER
INTERACTION, RESULT
S
ALLOWS MONITORING
AND TUNING OF ML
MODELS
PROVIDES METRICS
FOR ROADMAP
PRIORITIZATION
PROVIDES METRICS
FOR A/B TESTING

Tying It All Together
▪ Data flows from various products
▪ Ingested
▪ Transformed via Configuration Driven Pipelines
▪ One Common Data Model
▪ Data flows out of common data models back into ecosystem
▪ Baked in feedback loops

Democratized Data
▪ Data Scientists can access data directly from the CDM
▪ CDM is a Delta table
▪ Views are created for security access (no access to PII)
▪ Access is controlled at the view level
▪ Data is projected (using Schema on Read) to any destination location
(blob, SQL Server, Cosmos, etc)
▪ Data Scientists and Engineers can request any dataset they need by specifying metadata
▪ Requested data is transformed based on the metadata description
▪ Data is streamed or batched out to destination based on metadata frequency info

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Democratizing Data

More Related Content

What's hot (20)

Similar to Democratizing Data (20)

More from Databricks (20)

Recently uploaded (20)

Democratizing Data