Data Democratization at Nubank

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

@Nubank
Andre Midea
Rodrigo Ney
Data Democratization
#UnifiedDataAnalytics #SparkAISummit

Andre 
Midea
Engineer @ Nubank
/in/andremidea/
andremidea

Rodrigo
Ney
Engineer @ Nubank
rodrigoney
/in/rodrigoney/

18
Styleguide Illustration Aplications
18

Credit card supported by
a fully digital and
branchless experience.
2014
2017
Our own version of a bank
account, the simplest and
most intelligent solution yet.

Data Democratization at Nubank

Growing Quickly
A bank from scratch using Clojure

Roles making decisions
Business Analyst
Financial Analyst
Legal
Data-scientists
Customer Support
1
2
3
4
5

We need to build a data
platform

The
Report
Lifecycle
Where good ideas perish

#UnifiedDataAnalytics #SparkAISummit
Nooooo!
Have you ever faced/seen this situation?

Business
users
engineers are too slow
and don’t implement
their ideas correctly
and fatigue to
implement new ideas
because the process is
too painful
Engineers
business users
underestimate the
challenges to
implement
something with a
good level of quality
Frustrations

Goal
A data-platform where the incentives are aligned so
that people are empowered to make more informed
decisions, achieving that by reducing the friction for
nontechnical people to create creative solutions using
data.

the absence of hereditary or arbitrary class
distinctions or privileges
/democracy/

Agile DevOps DataOps1
Movements in the software
community
2 3
Customer
collaboration over
contract negotiation.
Relationships increases
the sense of safety
when we work together
as partners.
Decentralizing the
maintenance of services,
so every team is
responsible for
maintaining their
services running,
creating alerts, having
SLA and SLOs
Reduce heroism:
As the pace and breadth of
need for analytic insights
ever increases, we believe
analytic teams should strive
to reduce heroism and create
sustainable and scalable
data analytic teams and
processes.

a principled approach to
data-engineering

• Accumulate-only/
Immutable
• Git for your data
• Transaction log as high
level API

• Lazy
• Declarative
• FP inspired

330+ 315 400
engineers Micro-Services deploys per week
80 13+ 2100
TB Data SHARDS
datomic transactors
In production
Resources: Architecting a Modern Financial Institution (InfoQ),
Challenges and Benefits of an Immutable Database, Nubank Talk Youtube Playlist

Principle 1
Having data
coverage is
important.
I mean, ALL
the data!

Datomic EAVT
e Entity Long
a Attribute String
v Value Any
t Transaction point in time Long
tx Transaction entity id Long
txInstant Transaction wall-clock time java.util.Date
op Operation (assertion / retraction) Boolean

28

entity
attribute value
[ 28 ':name' 'john' ]
30

[ 28 ':name' 'john' ]
'lennon'
entity
attribute value
31

[ 28 ':name' 'john' Tx₁ true]
[ 28 ':name' 'john' Tx₂ false]
[ 28 ':name' 'lennon' Tx₂ true]
entity
attribute value transaction op
32

(let [log (datomic.api/log datomic)
t1 10000
t2 50000]
(datomic.api/tx-range log t1 t2))
=> [Datom(:e 1234,
:a :transaction/value
:value 10.00
:t 10000)...]

Infrastructure as
code
{:name :diablo
:canary {:type :shard}
:datastores
{:datomic {:databases
   [{:transactor “diablo”
     :name “diablo”}]}
  :kafka {:enabled? true}}
:environments
{… :prod #nu/prototypes-for [:prod :sharded]}
…
:pipelines
#{{:type :clojure-service
    :prod {:promotion :automatic}
    :cdc-test-frameworks #{:sachem}}}
{:name :diablo
:canary {:type :shard}
:datastores
{:datomic {:databases
   [{:transactor “diablo”
     :name “diablo”}]}
  :kafka {:enabled? true}}
:environments
{… :prod #nu/prototypes-for [:prod :sharded]}
:pipelines
#{{:type :clojure-service
    :prod {:promotion :automatic}
    :cdc-test-frameworks #{:sachem}}}}

Principle 2
No need for
excess of features
Pave the right
roads!

- Nathan Marz.
“Big Data: Principles and best practices
of scalable realtime data systems”.
batch view = function(all data)
“The portion of the Lambda Architecture that implements the:
equation is called the batch layer. The batch layer stores the master
copy of the dataset and precomputes batch views on that master
dataset. The master dataset can be thought of as a very large list
of records.”

val report: (allTheData: Map[String,
DataFrame]) => DataFrame

Make datasets
reusable by default
Extend the same
abstraction“ “
f(1,2)
f(3,4)
Dataset 1
Dataset 2
Dataset 4
Dataset 3
Dataset 5

trait SparkOperation {
val name: String
val inputs: Set[String]
val definition: (inputs: Map[String,
DataFrame]) => DataFrame
}

object OmbudsmanCalls extends SparkOp{
override val name: DataFrame = "dataset/ombudsman-calls"
override val inputs: Set[String] = Set(callsName)
override def definition(datasets: Map[String, DataFrame]):
DataFrame = {
val calls = datasets(callsName)
(calls
where ($"started_at".isNotNull
and $"our_number" === phoneNumber)
select ($"started_at", $"call__id", $"our_number"))
}
def phoneNumber: String = "999-9999"
def callsName: String = “contract-phone/calls”
}

Yeah, but nobody will use this…
It’s hard! Spark and Scala…
We managed to have more than
+300 business users working with our
abstraction.
More than 15 Pull Requests per day.
~500k LOC (only datasets)

Principle 3
KISS
Keep it Simple
Stupid!
B
Batch

Donald Knuth
“… 97% of the time: premature
optimisation is the root of all evil.
Yet we should not pass up our
opportunities in that critical 3%”

TYPE USER CAN USE?
RAW
CONTRACT
DATASET
SUPPORT
Platform Maintained
Platform Maintained
&
User Generated
User Defined
45

raw/log
raw/log
raw/log
contract
contract
contract
user
data
User Land
user
data
user
data

user
data
User Land
user
data
user
data
cache
cache
cache
where t > t’'
where t > t’'
where t > t’'
t’'
t’'
t’'
raw/log
raw/log
raw/log
contract
contract
contract

raw/log
user
dataset
Immutable/Append only
user
dataset
user
dataset
user
dataset
user
dataset
user
dataset
Daily Effects
Introduce a Change
The dataset breaks
raw/log
raw/log

raw/log
raw/log
raw/log
user
dataset
Immutable/Append only
user
dataset
user
dataset
user
dataset
user
dataset
user
dataset
Next Day
Introduce a fix

Safety Nets
Keep your DAG
valid at all
times!
Principle 4

Spark Dataframe
Raw layer
Test DataFrame
eavt schema
We know the schema of all raw datasets!

Spark Dataframe
Raw layer
We can pipe the DataFrame from one
operation downstream in the lineage
R’
R’’
R’’’
R’’’’eavt schema
Test DataFrame
f(R’)
f(R’’)
f(R’’’)
f(R’’’’)
User Land
It's lazy!
u’
u’’
u’’’
u’’’’

Spark Dataframe
Raw layer
We can pipe the DataFrame from one
operation downstream in the lineage
R’
R’’
R’’’
Test DataFrame
f(R’)
f(R’’)
f(R’’’)
f(R’’’’)
User Land
It's lazy!
u’
u’’
u’’’
u’’’’
f(U’,U’’)
2u’
2u’’
f(U’’’,U’’’’)

Spark Dataframe
Raw layer
All Spark transformations are valid!
R’
R’’
R’’’
Test DataFrame
f(R’)
f(R’’)
f(R’’’)
f(R’’’’)
User Land
u’
u’’
u’’’
u’’’’
f(U’,U’’)
2u’
2u’’
f(U’’’,U’’’’)
3u’
3u’’
4u’
f(…)
f(…)
f(…)

• Integration Tests
• Consumer Driven Contracts
• Integrity Checks
• Anomaly Detection based on
a dataset's statics over time

Think, Iterate,
Deploy…
Think, Iterate
Deploy
Easily!
Principle 5

REPL environment closer to production

new(service + db)
logs
pivoted
contract
user
data
dataset
data warehousebi tool
back to production…
automatically!
Meanwhile….

125
250
375
500
3000
6000
9000
12000
2016,Q3
2016,Q4
2017,Q1
2017,Q2
2017,Q3
2017,Q4
2018,Q1
2018,Q2
2018,Q3
2018,Q4
2019,Q1
2019,Q2
2019,Q3
2019,Q4
Datasets
dataset per data engineer
11523
9246.283
7247.033
5686.479
2946.159
3804.225
2835.279
2289.881
1707.904
1034.355761.992
240.07759.7333.3333.333 59.733 240.077
761.992 1034.355
1707.904
2289.881
2835.279
3804.225
2946.159
5686.479
7247.033
9246.283
11523
Data Engineer vs Datasets
That discrepancy is even more pronounced when we look at the number of datasets generated
per data-engineer

“ “ “1. Think carefully in the
roads you want to
pave, and pave only
a handful of them
2. Constraints are a good
thing, if they hide
complexity away users will
be glad by them
3. Leverage the Transaction
Log
KEY TAKEAWAYS
“ “ “4. Automate Data-
Ingestion to the
extreme
5. Have a Dev environment
closer to production
6. Validate the DAG in test
time, and create
invariants for the
datasets.

we are hiring
nubank.com.br/en/careers/

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Data Democratization at Nubank

More Related Content

What's hot (20)

Similar to Data Democratization at Nubank (20)

More from Databricks (20)

Recently uploaded (20)

Data Democratization at Nubank