SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
@Nubank
Andre Midea
Rodrigo Ney
Data Democratization
#UnifiedDataAnalytics #SparkAISummit
Andre

Midea
Engineer @ Nubank
/in/andremidea/ 
andremidea
Rodrigo
Ney
Engineer @ Nubank
rodrigoney
/in/rodrigoney/ 
18
Styleguide Illustration Aplications
18
Credit card supported by
a fully digital and
branchless experience.
2014
2017
Our own version of a bank
account, the simplest and
most intelligent solution yet.
2019
International
Expansion
 Data Democratization at Nubank
Growing Quickly
A bank from scratch using Clojure
Roles making decisions
Business Analyst
Financial Analyst
Legal
Data-scientists
Customer Support
1
2
3
4
5
We need to build a data
platform
The
Report
Lifecycle
Where good ideas perish
#UnifiedDataAnalytics #SparkAISummit
Nooooo!
Have you ever faced/seen this situation?
Business
users
engineers are too slow
and don’t implement
their ideas correctly
and fatigue to
implement new ideas
because the process is
too painful
Engineers
business users
underestimate the
challenges to
implement
something with a
good level of quality
Frustrations
Goal
A data-platform where the incentives are aligned so
that people are empowered to make more informed
decisions, achieving that by reducing the friction for
nontechnical people to create creative solutions using
data.
the absence of hereditary or arbitrary class
distinctions or privileges
/democracy/
Agile DevOps DataOps1
Movements in the software
community
2 3
Customer
collaboration over
contract negotiation.
Relationships increases
the sense of safety
when we work together
as partners.
Decentralizing the
maintenance of services,
so every team is
responsible for
maintaining their
services running,
creating alerts, having
SLA and SLOs
Reduce heroism:
As the pace and breadth of
need for analytic insights
ever increases, we believe
analytic teams should strive
to reduce heroism and create
sustainable and scalable
data analytic teams and
processes.
a principled approach to
data-engineering
Our Stack
•Functional
•JVM
•LISP
• Accumulate-only/
Immutable
• Git for your data
• Transaction log as high
level API
• Lazy
• Declarative
• FP inspired
330+ 315 400
engineers Micro-Services deploys per week
80 13+ 2100
TB Data SHARDS
datomic transactors
In production
Resources: Architecting a Modern Financial Institution (InfoQ),
Challenges and Benefits of an Immutable Database, Nubank Talk Youtube Playlist
 Data Democratization at Nubank
 Data Democratization at Nubank
Principle 1
Having data
coverage is
important.
I mean, ALL
the data!
Love your logs
Datomic EAVT
e Entity Long
a Attribute String
v Value Any
t Transaction point in time Long
tx Transaction entity id Long
txInstant Transaction wall-clock time java.util.Date
op Operation (assertion / retraction) Boolean
 
28
[ 28 ':name' 'john' ]
29
entity
attribute value
[ 28 ':name' 'john' ]
30
[ 28 ':name' 'john' ]
'lennon'
entity
attribute value
31
[ 28 ':name' 'john' Tx₁ true]
[ 28 ':name' 'john' Tx₂ false]
[ 28 ':name' 'lennon' Tx₂ true]
entity
attribute value transaction op
32
(let [log (datomic.api/log datomic)
t1 10000
t2 50000]
(datomic.api/tx-range log t1 t2))
=> [Datom(:e 1234,
:a :transaction/value
:value 10.00
:t 10000)...]
Infrastructure as
code
{:name :diablo
 :canary {:type :shard}
 :datastores
 {:datomic {:databases
   [{:transactor “diablo”
     :name “diablo”}]}
  :kafka {:enabled? true}}
 :environments
 {… :prod #nu/prototypes-for [:prod :sharded]}
…
 :pipelines
 #{{:type :clojure-service
    :prod {:promotion :automatic}
    :cdc-test-frameworks #{:sachem}}}
{:name :diablo
 :canary {:type :shard}
 :datastores
 {:datomic {:databases
   [{:transactor “diablo”
     :name “diablo”}]}
  :kafka {:enabled? true}}
 :environments
 {… :prod #nu/prototypes-for [:prod :sharded]}
 :pipelines
 #{{:type :clojure-service
    :prod {:promotion :automatic}
    :cdc-test-frameworks #{:sachem}}}}
prod-schema analytical-schema
Principle 2
No need for
excess of features
Pave the right
roads!
- Nathan Marz.
“Big Data: Principles and best practices
of scalable realtime data systems”.
batch view = function(all data)
“The portion of the Lambda Architecture that implements the:
equation is called the batch layer. The batch layer stores the master
copy of the dataset and precomputes batch views on that master
dataset. The master dataset can be thought of as a very large list
of records.”
val report: (allTheData: Map[String,
DataFrame]) => DataFrame
Make datasets
reusable by default
Extend the same
abstraction“ “
f(1,2)
f(3,4)
Dataset 1
Dataset 2
Dataset 4
Dataset 3
Dataset 5
trait SparkOperation {
val name: String
val inputs: Set[String]
val definition: (inputs: Map[String,
DataFrame]) => DataFrame
}
object OmbudsmanCalls extends SparkOp{
override val name: DataFrame = "dataset/ombudsman-calls"
override val inputs: Set[String] = Set(callsName)
override def definition(datasets: Map[String, DataFrame]):
DataFrame = {
val calls = datasets(callsName)
(calls
where ($"started_at".isNotNull
and $"our_number" === phoneNumber)
select ($"started_at", $"call__id", $"our_number"))
}
def phoneNumber: String = "999-9999"
def callsName: String = “contract-phone/calls”
}
Yeah, but nobody will use this…
It’s hard! Spark and Scala…
We managed to have more than
+300 business users working with our
abstraction.
More than 15 Pull Requests per day.
~500k LOC (only datasets)
Principle 3
KISS
Keep it Simple
Stupid!
B
Batch
Donald Knuth
“… 97% of the time: premature
optimisation is the root of all evil.
Yet we should not pass up our
opportunities in that critical 3%”
TYPE USER CAN USE?
RAW
CONTRACT
DATASET
SUPPORT
Platform Maintained
Platform Maintained
&
User Generated
User Defined
45
raw/log
raw/log
raw/log
contract
contract
contract
user
data
User Land
user
data
user
data
user
data
User Land
user
data
user
data
cache
cache
cache
where t > t’'
where t > t’'
where t > t’'
t’'
t’'
t’'
raw/log
raw/log
raw/log
contract
contract
contract
raw/log
user
dataset
Immutable/Append only
user
dataset
user
dataset
user
dataset
user
dataset
user
dataset
Daily Effects
Introduce a Change
The dataset breaks
raw/log
raw/log
raw/log
raw/log
raw/log
user
dataset
Immutable/Append only
user
dataset
user
dataset
user
dataset
user
dataset
user
dataset
Next Day
Introduce a fix
Safety Nets
Keep your DAG
valid at all
times!
Principle 4
Complex
Spark Dataframe
Raw layer
Test DataFrame
eavt schema
We know the schema of all raw datasets!
Spark Dataframe
Raw layer
We can pipe the DataFrame from one
operation downstream in the lineage
R’
R’’
R’’’
R’’’’eavt schema
Test DataFrame
f(R’)
f(R’’)
f(R’’’)
f(R’’’’)
User Land
It's lazy!
u’
u’’
u’’’
u’’’’
Spark Dataframe
Raw layer
We can pipe the DataFrame from one
operation downstream in the lineage
R’
R’’
R’’’
R’’’’eavt schema
Test DataFrame
f(R’)
f(R’’)
f(R’’’)
f(R’’’’)
User Land
It's lazy!
u’
u’’
u’’’
u’’’’
f(U’,U’’)
2u’
2u’’
f(U’’’,U’’’’)
Spark Dataframe
Raw layer
All Spark transformations are valid!
R’
R’’
R’’’
R’’’’eavt schema
Test DataFrame
f(R’)
f(R’’)
f(R’’’)
f(R’’’’)
User Land
u’
u’’
u’’’
u’’’’
f(U’,U’’)
2u’
2u’’
f(U’’’,U’’’’)
3u’
3u’’
4u’
f(…)
f(…)
f(…)
• Integration Tests
• Consumer Driven Contracts
• Integrity Checks
• Anomaly Detection based on
a dataset's statics over time
Think, Iterate,
Deploy…
Think, Iterate
Deploy
Easily!
Principle 5
REPL environment closer to production
new(service + db)
logs
pivoted
contract
user
data
dataset
data warehousebi tool
back to production…
automatically!
Meanwhile….
Results
125
250
375
500
3000
6000
9000
12000
2016,Q3
2016,Q4
2017,Q1
2017,Q2
2017,Q3
2017,Q4
2018,Q1
2018,Q2
2018,Q3
2018,Q4
2019,Q1
2019,Q2
2019,Q3
2019,Q4
Datasets
dataset per data engineer
11523
9246.283
7247.033
5686.479
2946.159
3804.225
2835.279
2289.881
1707.904
1034.355761.992
240.07759.7333.3333.333 59.733 240.077
761.992 1034.355
1707.904
2289.881
2835.279
3804.225
2946.159
5686.479
7247.033
9246.283
11523
Data Engineer vs Datasets
That discrepancy is even more pronounced when we look at the number of datasets generated
per data-engineer
“ “ “1. Think carefully in the
roads you want to
pave, and pave only
a handful of them
2. Constraints are a good
thing, if they hide
complexity away users will
be glad by them
3. Leverage the Transaction
Log
KEY TAKEAWAYS
“ “ “4. Automate Data-
Ingestion to the
extreme
5. Have a Dev environment
closer to production
6. Validate the DAG in test
time, and create
invariants for the
datasets.
we are hiring
nubank.com.br/en/careers/
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Data Infra and Data Access in Nubank
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
PDF
Time to Talk about Data Mesh
PDF
[Pcamp19] - Escalando o uso de dados no Nubank - André Tavares | Nubank
PDF
Snowflake Data Science and AI/ML at Scale
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PPTX
Data mesh
Data Infra and Data Access in Nubank
Building Modern Data Platform with Microsoft Azure
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
Time to Talk about Data Mesh
[Pcamp19] - Escalando o uso de dados no Nubank - André Tavares | Nubank
Snowflake Data Science and AI/ML at Scale
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data mesh

What's hot (20)

PPTX
Customer-Centric Data Management for Better Customer Experiences
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PDF
Evolution of Data at Nubank - Product.io Meetup 2019-01-29
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Data Mesh 101
PPTX
Data Lake Overview
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Data Mesh for Dinner
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
PDF
Five Things to Consider About Data Mesh and Data Governance
PDF
Learn to Use Databricks for Data Science
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
DataOps - The Foundation for Your Agile Data Architecture
PPTX
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
PPTX
Cloudera SDX
Customer-Centric Data Management for Better Customer Experiences
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Evolution of Data at Nubank - Product.io Meetup 2019-01-29
DW Migration Webinar-March 2022.pptx
Data Mesh 101
Data Lake Overview
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Data Mesh Part 4 Monolith to Mesh
Data Mesh for Dinner
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Five Things to Consider About Data Mesh and Data Governance
Learn to Use Databricks for Data Science
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Making Data Timelier and More Reliable with Lakehouse Technology
Announcing Databricks Cloud (Spark Summit 2014)
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
DataOps - The Foundation for Your Agile Data Architecture
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Cloudera SDX
Ad

Similar to Data Democratization at Nubank (20)

PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPTX
Software architecture for data applications
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PDF
Introduction to Spark Training
PDF
Data processing platforms with SMACK: Spark and Mesos internals
PPTX
Intro to Spark development
PDF
Big Data and Fast Data combined – is it possible?
PDF
Simple, Modular and Extensible Big Data Platform Concept
PPSX
Big Data
PDF
Using Spark over Cassandra
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
PPTX
Is Spark the right choice for data analysis ?
PDF
Big Data Computing Architecture
PDF
Scala like distributed collections - dumping time-series data with apache spark
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Data Architectures for Robust Decision Making
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PPT
Bhupeshbansal bigdata
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Software architecture for data applications
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark.pptx to knowledge gaining in wdm days ago
Introduction to Spark Training
Data processing platforms with SMACK: Spark and Mesos internals
Intro to Spark development
Big Data and Fast Data combined – is it possible?
Simple, Modular and Extensible Big Data Platform Concept
Big Data
Using Spark over Cassandra
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Is Spark the right choice for data analysis ?
Big Data Computing Architecture
Scala like distributed collections - dumping time-series data with apache spark
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Data Architectures for Robust Decision Making
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Bhupeshbansal bigdata
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
modul_python (1).pptx for professional and student
PPT
Predictive modeling basics in data cleaning process
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Managing Community Partner Relationships
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Database Infoormation System (DBIS).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Transcultural that can help you someday.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
Introduction-to-Cloud-ComputingFinal.pptx
modul_python (1).pptx for professional and student
Predictive modeling basics in data cleaning process
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
ISS -ESG Data flows What is ESG and HowHow
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
[EN] Industrial Machine Downtime Prediction
Introduction to Knowledge Engineering Part 1
Managing Community Partner Relationships
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Database Infoormation System (DBIS).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Transcultural that can help you someday.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Data Democratization at Nubank