[Research] deploying predictive models with the actor framework - Brian Gawalt

PAPIs 2015
Akka & Data Science:
Making real-time
predictions
Brian Gawalt
2nd International Conference on Predictive APIs and Apps
August 7, 2015

PAPIs 2015
[A]
Sometimes, data
scientists need to worry
about throughput.
2

PAPIs 2015
[B]
One way to increase
throughput is with
concurrency.
3

PAPIs 2015
[C]
The Actor Model is an
easy way to build a
concurrent system.
4

PAPIs 2015
[D]
Scala+Akka provides an
easy-to-use Actor Model
context.
5

PAPIs 2015
[A + B + C + D ⇒ E]
Data scientists should
check out Scala+Akka.
6

PAPIs 2015
Consider:
● building a model,
● vs. using a model
7

PAPIs 2015
Lots of ways to practice
building a model
8

PAPIs 2015
The Classic Process
1. Load your data set’s raw materials
2. Produce feature vectors:
o Training,
o Validation,
o Testing
3. Build the model with training and validation
vectors
9

PAPIs 2015
The Classic Process:
One-time Testing
10
Load train/valid./test
materials
Make train/valid./test
feature vectors
Train Model
Make test predictions
Build
Use

PAPIs 2015
The Classic Process:
Repeated Testing
11
Load train/valid. materials
Make train/valid.
feature vectors
Train Model
Load test/new materials
Make test/new
feature vectors
Make test/new predictions
(saved model)
(repeat every K minutes)
Build
Use

PAPIs 2015
Sometimes my tasks
work like that, too!
12

PAPIs 2015
But this talk is about the
other kind of tasks.
13

PAPIs 2015
[A]
Sometimes, data
about throughput.
14

PAPIs 2015
Example:
Freelancer availability on
15

PAPIs 2015
Hiring Freelancers on Upwork
1. Post a job
2. Search for freelancers
3. Find someone you like
4. Ask them to interview
o Request Accepted!
o or rejected/ignored...
16
THE TASK:
Look at recent
freelancer behavior,
and predict, at time
Step 2, who’s likely
to accept an invite
at time Step 4

PAPIs 2015
Building this model is
business as usual:
17

PAPIs 2015
Building Availability Model
1. Load raw materials:
o Examples of accepts/rejects
o Histories of freelancer site activity
 Job applications sent or received
 Hours worked
 Click logs
 Profile updates
2. Produce feature vectors: 18
Greenplum
Amazon S3
Internal
Service

PAPIs 2015
Using Availability Model
19
Load train/valid. materials
Make train/valid.
feature vectors
Train Model
Make test/new
feature vectors
(saved model)
(repeat every 60 minutes)

PAPIs 2015
20
Make test/new
feature vectors
(saved model)
Load job app data
(4 min.)
Load click log data
(30 min.)
Load work hours data
(5 min.)
Load profile data
(20 ms/profile)

PAPIs 2015
21
Load job app data
(4 min.)
Load click log data
(30 min.)
Load work hours data
(5 min.)
Load profile data
(20 ms/profile)
● Left with under 21 minutes to
collect profile data
○ Rate limit: 20 ms/profile
○ At most, 63K profiles per
hour
● Six Million freelancers who
need avail. predictions: expect
~90 hours between re-scoring
any individual
● Still need to spend time
actually building vectors and
exporting scores!

PAPIs 2015
[B]
One way to increase
throughput is with
concurrency.
22

PAPIs 2015
Expensive Option:
Major infrastructure
overhaul
23

PAPIs 2015
… but that takes a lot of
time, attention, and
cooperation…
24

PAPIs 2015
Simpler Option:
The Actor Model
25

PAPIs 2015
[C]
easy way to build a
concurrent system.
26

PAPIs 2015
● Imagine a mailbox with a brain
● Computation only begins when/if a
message arrives
● Keeps its thoughts private:
○ No other actor can actively read this
actor’s state
○ Other actors will have to wait to hear a
message from this actor
An Actor
27

PAPIs 2015
● Lots of Actors, and each has:
○ Private message queue
○ Private state, shared only sending more
messages
● Execution context:
○ Manages threading of each Actor’s
computation
○ Handles asynch. message routing
○ Can send prescheduled messages
● Each received message’s
computation is fully completed
before Actor moves on to next
message in queue
The Actor Model of Concurrency
28

PAPIs 2015
The Actor Model of Concurrency
29
Execution Context

PAPIs 2015
Parallelizing predictions
30
Refresh work hours
Vectorizer:
● Keep copies of raw data
● Emit vector for each new
profile received
Refresh job apps
Refresh click log Fetch 10 profiles
Apply model;
export
prediction
raw data
raw data
Schedule: Fetch once per hour Schedule: Fetch once per hour
Schedule: Fetch once per hour Schedule: Fetch every 300ms

PAPIs 2015
Serial processing
31
Refresh job apps
Make feature vectors
Export predictions
Refresh work hours
Refresh click log
Fetch ~50K profiles
...
55 min
5 min
4 min
5 min
30 min
55 - 4 - 5 - 30
= 16 min...

PAPIs 2015
Serial processing
32
Refresh job apps
Make feature vectors
Export predictions
Refresh work hours
Refresh click log
Fetch ~50K profiles
...
55 min
5 min
4 min
5 min
30 min
55 - 4 - 5 - 30
= 16 min...
Throughput:
48K users/hr

PAPIs 2015
Parallel Processing with Actors
33
Refresh job
apps
...
Refresh
click log
Refresh
work hrs.
Rx data
Fetch pro.
Export
Rx data
Fetch pro.
Fetch pro.
Fetch pro.
Fetch pro.= msg. sent
= msg. rx’d
1/hr.
1/hr.
1/hr. 3/sec. (as rx’ed)
Store
Store
Vectorize
Vectorize
Store
1/hr.
Thr. 1 Thr. 2 Thr. 3 Thr. 4
Vectorize
Fetch pro.
Fetch pro.
(msg. processing time
not to scale)
Rx data
Vectorize
...

PAPIs 2015
Parallel Processing with Actors
34
Refresh job
apps
...
Refresh
click log
Refresh
work hrs.
Rx data
Fetch pro.
Export
Rx data
Fetch pro.
Fetch pro.
Fetch pro.
Fetch pro.= msg. sent
= msg. rx’d
1/hr.
1/hr.
1/hr. 3/sec. (as rx’ed)
Store
Store
Vectorize
Vectorize
Store
1/hr.
Thr. 1 Thr. 2 Thr. 3 Thr. 4
Vectorize
Fetch pro.
Fetch pro.
Throughput:
180K users/hr
Rx data
Vectorize
...

PAPIs 2015
[D]
context.
35

PAPIs 2015
Message passing,
scheduling, &
computation behavior
defined in 445 lines.
36

PAPIs 2015
Scala+Akka Actors
● Create Scala class, mix in Actor trait
● Implement the required partial function: receive:
PartialFunction[Any, Unit]
● Define family of message objects this actor’s
planning to handle
● Define behavior for each message case in receive
37

PAPIs 2015
Scala+Akka Actors
38
Mixin same code used for
export in non-Actor
version
Private, mutable state:
stored scores
Private, mutable state: time
of last export
If receiving new scores:
store them!
If storing lots of scores, or if
it’s been awhile: upload
what’s stored, then erase
them
If told to shut down, stop
accepting new scores

PAPIs 2015
Scala+Akka Pros
● Easy to get productive in the Scala
language
● SBT dependency management makes it
easy to move to any box with a JRE
● No global interpreter lock!
39

PAPIs 2015
Scala+Akka Cons
● Moderate Scala learning curve
● Object representation on the JVM has
pretty lousy memory efficiency
● Not a lot of great options for building
models in Scala (compared to R, Python,
Julia)
40

PAPIs 2015
[A]
Sometimes, data
about throughput.
41

PAPIs 2015
[B]
One way to increase
throughput is with
concurrency.
42

PAPIs 2015
[C]
easy way to build a
concurrent system.
43

PAPIs 2015
[D]
context.
44

PAPIs 2015
[A + B + C + D ⇒ Z]
Data scientists should
check out Scala+Akka
45

PAPIs 2015
Thanks!
Questions?
bgawalt@{upwork, gmail}.com
twitter.com/bgawalt

[Research] deploying predictive models with the actor framework - Brian Gawalt

More Related Content

Viewers also liked (20)

Similar to [Research] deploying predictive models with the actor framework - Brian Gawalt (20)

More from PAPIs.io (20)

Recently uploaded (20)

[Research] deploying predictive models with the actor framework - Brian Gawalt