SlideShare a Scribd company logo
PAPIs 2015
Akka & Data Science:
Making real-time
predictions
Brian Gawalt
2nd International Conference on Predictive APIs and Apps
August 7, 2015
PAPIs 2015
[A]
Sometimes, data
scientists need to worry
about throughput.
2
PAPIs 2015
[B]
One way to increase
throughput is with
concurrency.
3
PAPIs 2015
[C]
The Actor Model is an
easy way to build a
concurrent system.
4
PAPIs 2015
[D]
Scala+Akka provides an
easy-to-use Actor Model
context.
5
PAPIs 2015
[A + B + C + D ⇒ E]
Data scientists should
check out Scala+Akka.
6
PAPIs 2015
Consider:
● building a model,
● vs. using a model
7
PAPIs 2015
Lots of ways to practice
building a model
8
PAPIs 2015
The Classic Process
1. Load your data set’s raw materials
2. Produce feature vectors:
o Training,
o Validation,
o Testing
3. Build the model with training and validation
vectors
9
PAPIs 2015
The Classic Process:
One-time Testing
10
Load train/valid./test
materials
Make train/valid./test
feature vectors
Train Model
Make test predictions
Build
Use
PAPIs 2015
The Classic Process:
Repeated Testing
11
Load train/valid. materials
Make train/valid.
feature vectors
Train Model
Load test/new materials
Make test/new
feature vectors
Make test/new predictions
(saved model)
(repeat every K minutes)
Build
Use
PAPIs 2015
Sometimes my tasks
work like that, too!
12
PAPIs 2015
But this talk is about the
other kind of tasks.
13
PAPIs 2015
[A]
Sometimes, data
scientists need to worry
about throughput.
14
PAPIs 2015
Example:
Freelancer availability on
15
PAPIs 2015
Hiring Freelancers on Upwork
1. Post a job
2. Search for freelancers
3. Find someone you like
4. Ask them to interview
o Request Accepted!
o or rejected/ignored...
16
THE TASK:
Look at recent
freelancer behavior,
and predict, at time
Step 2, who’s likely
to accept an invite
at time Step 4
PAPIs 2015
Building this model is
business as usual:
17
PAPIs 2015
Building Availability Model
1. Load raw materials:
o Examples of accepts/rejects
o Histories of freelancer site activity
 Job applications sent or received
 Hours worked
 Click logs
 Profile updates
2. Produce feature vectors: 18
Greenplum
Amazon S3
Internal
Service
PAPIs 2015
Using Availability Model
19
Load train/valid. materials
Make train/valid.
feature vectors
Train Model
Load test/new materials
Make test/new
feature vectors
Make test/new predictions
(saved model)
(repeat every 60 minutes)
PAPIs 2015
Using Availability Model
20
Load test/new materials
Make test/new
feature vectors
Make test/new predictions
(saved model)
(repeat every 60 minutes)
Load job app data
(4 min.)
Load click log data
(30 min.)
Load work hours data
(5 min.)
Load profile data
(20 ms/profile)
PAPIs 2015
Using Availability Model
21
Load job app data
(4 min.)
Load click log data
(30 min.)
Load work hours data
(5 min.)
Load profile data
(20 ms/profile)
● Left with under 21 minutes to
collect profile data
○ Rate limit: 20 ms/profile
○ At most, 63K profiles per
hour
● Six Million freelancers who
need avail. predictions: expect
~90 hours between re-scoring
any individual
● Still need to spend time
actually building vectors and
exporting scores!
PAPIs 2015
[B]
One way to increase
throughput is with
concurrency.
22
PAPIs 2015
Expensive Option:
Major infrastructure
overhaul
23
PAPIs 2015
… but that takes a lot of
time, attention, and
cooperation…
24
PAPIs 2015
Simpler Option:
The Actor Model
25
PAPIs 2015
[C]
The Actor Model is an
easy way to build a
concurrent system.
26
PAPIs 2015
● Imagine a mailbox with a brain
● Computation only begins when/if a
message arrives
● Keeps its thoughts private:
○ No other actor can actively read this
actor’s state
○ Other actors will have to wait to hear a
message from this actor
An Actor
27
PAPIs 2015
● Lots of Actors, and each has:
○ Private message queue
○ Private state, shared only sending more
messages
● Execution context:
○ Manages threading of each Actor’s
computation
○ Handles asynch. message routing
○ Can send prescheduled messages
● Each received message’s
computation is fully completed
before Actor moves on to next
message in queue
The Actor Model of Concurrency
28
PAPIs 2015
The Actor Model of Concurrency
29
Execution Context
PAPIs 2015
Parallelizing predictions
30
Refresh work hours
Vectorizer:
● Keep copies of raw data
● Emit vector for each new
profile received
Refresh job apps
Refresh click log Fetch 10 profiles
Apply model;
export
prediction
raw data
raw data
Schedule: Fetch once per hour Schedule: Fetch once per hour
Schedule: Fetch once per hour Schedule: Fetch every 300ms
PAPIs 2015
Serial processing
31
Refresh job apps
Make feature vectors
Export predictions
(repeat every 60 minutes)
Refresh work hours
Refresh click log
Fetch ~50K profiles
...
55 min
5 min
4 min
5 min
30 min
55 - 4 - 5 - 30
= 16 min...
PAPIs 2015
Serial processing
32
Refresh job apps
Make feature vectors
Export predictions
(repeat every 60 minutes)
Refresh work hours
Refresh click log
Fetch ~50K profiles
...
55 min
5 min
4 min
5 min
30 min
55 - 4 - 5 - 30
= 16 min...
Throughput:
48K users/hr
PAPIs 2015
Parallel Processing with Actors
33
Refresh job
apps
...
Refresh
click log
Refresh
work hrs.
Rx data
Fetch pro.
Export
Rx data
Fetch pro.
Fetch pro.
Fetch pro.
Fetch pro.= msg. sent
= msg. rx’d
1/hr.
1/hr.
1/hr. 3/sec. (as rx’ed)
Store
Store
Vectorize
Vectorize
Store
1/hr.
Thr. 1 Thr. 2 Thr. 3 Thr. 4
Vectorize
Fetch pro.
Fetch pro.
(msg. processing time
not to scale)
Rx data
Vectorize
...
PAPIs 2015
Parallel Processing with Actors
34
Refresh job
apps
...
Refresh
click log
Refresh
work hrs.
Rx data
Fetch pro.
Export
Rx data
Fetch pro.
Fetch pro.
Fetch pro.
Fetch pro.= msg. sent
= msg. rx’d
1/hr.
1/hr.
1/hr. 3/sec. (as rx’ed)
Store
Store
Vectorize
Vectorize
Store
1/hr.
Thr. 1 Thr. 2 Thr. 3 Thr. 4
Vectorize
Fetch pro.
Fetch pro.
Throughput:
180K users/hr
Rx data
Vectorize
...
PAPIs 2015
[D]
Scala+Akka provides an
easy-to-use Actor Model
context.
35
PAPIs 2015
Message passing,
scheduling, &
computation behavior
defined in 445 lines.
36
PAPIs 2015
Scala+Akka Actors
● Create Scala class, mix in Actor trait
● Implement the required partial function: receive:
PartialFunction[Any, Unit]
● Define family of message objects this actor’s
planning to handle
● Define behavior for each message case in receive
37
PAPIs 2015
Scala+Akka Actors
38
Mixin same code used for
export in non-Actor
version
Private, mutable state:
stored scores
Private, mutable state: time
of last export
If receiving new scores:
store them!
If storing lots of scores, or if
it’s been awhile: upload
what’s stored, then erase
them
If told to shut down, stop
accepting new scores
PAPIs 2015
Scala+Akka Pros
● Easy to get productive in the Scala
language
● SBT dependency management makes it
easy to move to any box with a JRE
● No global interpreter lock!
39
PAPIs 2015
Scala+Akka Cons
● Moderate Scala learning curve
● Object representation on the JVM has
pretty lousy memory efficiency
● Not a lot of great options for building
models in Scala (compared to R, Python,
Julia)
40
PAPIs 2015
[A]
Sometimes, data
scientists need to worry
about throughput.
41
PAPIs 2015
[B]
One way to increase
throughput is with
concurrency.
42
PAPIs 2015
[C]
The Actor Model is an
easy way to build a
concurrent system.
43
PAPIs 2015
[D]
Scala+Akka provides an
easy-to-use Actor Model
context.
44
PAPIs 2015
[A + B + C + D ⇒ Z]
Data scientists should
check out Scala+Akka
45
PAPIs 2015
Thanks!
Questions?
bgawalt@{upwork, gmail}.com
twitter.com/bgawalt

More Related Content

PDF
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
PDF
Map r chicago_advanalytics_oct_meetup
PDF
Simple machine learning for the masses - Konstantin Davydov
PDF
Big wins with small data. PredictionIO in ecommerce - David Jones
PPTX
Measuring the benefit effect for customers with Bayesian predictive modeling
PDF
잉여의 잉여력 관리
PDF
Command Line으로 분석하는 사용자 패턴
PDF
통계분석연구회 2016년 여름 맞이 추천 도서와 영상
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Map r chicago_advanalytics_oct_meetup
Simple machine learning for the masses - Konstantin Davydov
Big wins with small data. PredictionIO in ecommerce - David Jones
Measuring the benefit effect for customers with Bayesian predictive modeling
잉여의 잉여력 관리
Command Line으로 분석하는 사용자 패턴
통계분석연구회 2016년 여름 맞이 추천 도서와 영상

Viewers also liked (20)

DOCX
[통계분석연구회] 2016년 겨울 맞이 추천 도서와 영상
PPTX
Lean Analytics_cojette
PDF
Offering 효과 분석-시계열 예측 모델 활용
PDF
꿈꾸는 데이터 디자이너 시즌2 교육설명회
PDF
통계분석연구회 2015년 겨울 맞이 추천 도서와 영상
PDF
METRIC - 린 분석의 데이터 사용법
PDF
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
PPTX
R & big data analysis 20120531
PDF
[우리가 데이터를 쓰는 법] 좋다는 건 알겠는데 좀 써보고 싶소. 데이터! - 넘버웍스 하용호 대표
PDF
2011 H3 컨퍼런스-파이썬으로 클라우드 하고 싶어요
PPTX
데이터분석의 길 2: “고수는 최고의 연장을 사용한다” (툴채인)
PPTX
데이터분석의 길 3 “r 워크플로우 (스토리텔링)”
PPTX
데이터분석의 길 5: “고수는 큰자료를 두려워하지 않는다” (클릭확률예측 상편)
PDF
SK플래닛_README_마이크로서비스 아키텍처로 개발하기
PPTX
기술적 변화를 이끌어가기
PPTX
데이터분석의 길 4: “고수는 통계학습의 달인이다”
PDF
오픈소스 SW 라이선스 - 박은정님
PDF
어떻게 하면 데이터 사이언티스트가 될 수 있나요?
PDF
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
PDF
데이터는 차트가 아니라 돈이 되어야 한다.
[통계분석연구회] 2016년 겨울 맞이 추천 도서와 영상
Lean Analytics_cojette
Offering 효과 분석-시계열 예측 모델 활용
꿈꾸는 데이터 디자이너 시즌2 교육설명회
통계분석연구회 2015년 겨울 맞이 추천 도서와 영상
METRIC - 린 분석의 데이터 사용법
빅데이터 분석을 위한 스파크 2 프로그래밍 : 대용량 데이터 처리부터 머신러닝까지
R & big data analysis 20120531
[우리가 데이터를 쓰는 법] 좋다는 건 알겠는데 좀 써보고 싶소. 데이터! - 넘버웍스 하용호 대표
2011 H3 컨퍼런스-파이썬으로 클라우드 하고 싶어요
데이터분석의 길 2: “고수는 최고의 연장을 사용한다” (툴채인)
데이터분석의 길 3 “r 워크플로우 (스토리텔링)”
데이터분석의 길 5: “고수는 큰자료를 두려워하지 않는다” (클릭확률예측 상편)
SK플래닛_README_마이크로서비스 아키텍처로 개발하기
기술적 변화를 이끌어가기
데이터분석의 길 4: “고수는 통계학습의 달인이다”
오픈소스 SW 라이선스 - 박은정님
어떻게 하면 데이터 사이언티스트가 될 수 있나요?
스타트업은 데이터를 어떻게 바라봐야 할까? (개정판)
데이터는 차트가 아니라 돈이 되어야 한다.
Ad

Similar to [Research] deploying predictive models with the actor framework - Brian Gawalt (20)

PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PDF
Tuning for Systematic Trading: Talk 2: Deep Learning
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
PDF
Exploratory Analysis of Spark Structured Streaming
PDF
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
PDF
An Architecture for Agile Machine Learning in Real-Time Applications
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
PDF
Uber Business Metrics Generation and Management Through Apache Flink
PDF
GraphQL Advanced
PDF
Monitoring AI with AI
PDF
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
AnalyticOps - Chicago PAW 2016
PDF
Automatic Performance Modelling from Application Performance Management (APM)...
PDF
February'16 SDG - Spring'16 new features
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PDF
Machine Learning Infrastructure
PDF
Continuous delivery for machine learning
PPTX
Splunk Ninjas: New Features, Pivot, and Search Dojo
PDF
Denys Kovalenko "Scaling Data Science at Bolt"
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Tuning for Systematic Trading: Talk 2: Deep Learning
Serverless ML Workshop with Hopsworks at PyData Seattle
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming, Todor Ivanov, Jason Taafe...
An Architecture for Agile Machine Learning in Real-Time Applications
End-to-end pipeline agility - Berlin Buzzwords 2024
Uber Business Metrics Generation and Management Through Apache Flink
GraphQL Advanced
Monitoring AI with AI
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Production ready big ml workflows from zero to hero daniel marcous @ waze
AnalyticOps - Chicago PAW 2016
Automatic Performance Modelling from Application Performance Management (APM)...
February'16 SDG - Spring'16 new features
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Machine Learning Infrastructure
Continuous delivery for machine learning
Splunk Ninjas: New Features, Pivot, and Search Dojo
Denys Kovalenko "Scaling Data Science at Bolt"
Ad

More from PAPIs.io (20)

PDF
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
PDF
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
PDF
Extracting information from images using deep learning and transfer learning ...
PDF
Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...
PDF
Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PDF
Building machine learning applications locally with Spark — Joel Pinho Lucas ...
PDF
Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...
PDF
A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...
PDF
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
PDF
Real-world applications of AI - Daniel Hulme @ PAPIs Connect
PDF
Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...
PDF
Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...
PDF
Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect
PDF
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
PDF
Microdecision making in financial services - Greg Lamp @ PAPIs Connect
PDF
Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...
PDF
Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...
PDF
How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
PDF
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Extracting information from images using deep learning and transfer learning ...
Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...
Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning applications locally with Spark — Joel Pinho Lucas ...
Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...
A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Real-world applications of AI - Daniel Hulme @ PAPIs Connect
Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...
Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...
Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
Microdecision making in financial services - Greg Lamp @ PAPIs Connect
Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...
Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...
How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Lecture1 pattern recognition............
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Lecture1 pattern recognition............
Data_Analytics_and_PowerBI_Presentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Analytics and business intelligence.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
climate analysis of Dhaka ,Banglades.pptx
ISS -ESG Data flows What is ESG and HowHow
STUDY DESIGN details- Lt Col Maksud (21).pptx

[Research] deploying predictive models with the actor framework - Brian Gawalt

  • 1. PAPIs 2015 Akka & Data Science: Making real-time predictions Brian Gawalt 2nd International Conference on Predictive APIs and Apps August 7, 2015
  • 2. PAPIs 2015 [A] Sometimes, data scientists need to worry about throughput. 2
  • 3. PAPIs 2015 [B] One way to increase throughput is with concurrency. 3
  • 4. PAPIs 2015 [C] The Actor Model is an easy way to build a concurrent system. 4
  • 5. PAPIs 2015 [D] Scala+Akka provides an easy-to-use Actor Model context. 5
  • 6. PAPIs 2015 [A + B + C + D ⇒ E] Data scientists should check out Scala+Akka. 6
  • 7. PAPIs 2015 Consider: ● building a model, ● vs. using a model 7
  • 8. PAPIs 2015 Lots of ways to practice building a model 8
  • 9. PAPIs 2015 The Classic Process 1. Load your data set’s raw materials 2. Produce feature vectors: o Training, o Validation, o Testing 3. Build the model with training and validation vectors 9
  • 10. PAPIs 2015 The Classic Process: One-time Testing 10 Load train/valid./test materials Make train/valid./test feature vectors Train Model Make test predictions Build Use
  • 11. PAPIs 2015 The Classic Process: Repeated Testing 11 Load train/valid. materials Make train/valid. feature vectors Train Model Load test/new materials Make test/new feature vectors Make test/new predictions (saved model) (repeat every K minutes) Build Use
  • 12. PAPIs 2015 Sometimes my tasks work like that, too! 12
  • 13. PAPIs 2015 But this talk is about the other kind of tasks. 13
  • 14. PAPIs 2015 [A] Sometimes, data scientists need to worry about throughput. 14
  • 16. PAPIs 2015 Hiring Freelancers on Upwork 1. Post a job 2. Search for freelancers 3. Find someone you like 4. Ask them to interview o Request Accepted! o or rejected/ignored... 16 THE TASK: Look at recent freelancer behavior, and predict, at time Step 2, who’s likely to accept an invite at time Step 4
  • 17. PAPIs 2015 Building this model is business as usual: 17
  • 18. PAPIs 2015 Building Availability Model 1. Load raw materials: o Examples of accepts/rejects o Histories of freelancer site activity  Job applications sent or received  Hours worked  Click logs  Profile updates 2. Produce feature vectors: 18 Greenplum Amazon S3 Internal Service
  • 19. PAPIs 2015 Using Availability Model 19 Load train/valid. materials Make train/valid. feature vectors Train Model Load test/new materials Make test/new feature vectors Make test/new predictions (saved model) (repeat every 60 minutes)
  • 20. PAPIs 2015 Using Availability Model 20 Load test/new materials Make test/new feature vectors Make test/new predictions (saved model) (repeat every 60 minutes) Load job app data (4 min.) Load click log data (30 min.) Load work hours data (5 min.) Load profile data (20 ms/profile)
  • 21. PAPIs 2015 Using Availability Model 21 Load job app data (4 min.) Load click log data (30 min.) Load work hours data (5 min.) Load profile data (20 ms/profile) ● Left with under 21 minutes to collect profile data ○ Rate limit: 20 ms/profile ○ At most, 63K profiles per hour ● Six Million freelancers who need avail. predictions: expect ~90 hours between re-scoring any individual ● Still need to spend time actually building vectors and exporting scores!
  • 22. PAPIs 2015 [B] One way to increase throughput is with concurrency. 22
  • 23. PAPIs 2015 Expensive Option: Major infrastructure overhaul 23
  • 24. PAPIs 2015 … but that takes a lot of time, attention, and cooperation… 24
  • 26. PAPIs 2015 [C] The Actor Model is an easy way to build a concurrent system. 26
  • 27. PAPIs 2015 ● Imagine a mailbox with a brain ● Computation only begins when/if a message arrives ● Keeps its thoughts private: ○ No other actor can actively read this actor’s state ○ Other actors will have to wait to hear a message from this actor An Actor 27
  • 28. PAPIs 2015 ● Lots of Actors, and each has: ○ Private message queue ○ Private state, shared only sending more messages ● Execution context: ○ Manages threading of each Actor’s computation ○ Handles asynch. message routing ○ Can send prescheduled messages ● Each received message’s computation is fully completed before Actor moves on to next message in queue The Actor Model of Concurrency 28
  • 29. PAPIs 2015 The Actor Model of Concurrency 29 Execution Context
  • 30. PAPIs 2015 Parallelizing predictions 30 Refresh work hours Vectorizer: ● Keep copies of raw data ● Emit vector for each new profile received Refresh job apps Refresh click log Fetch 10 profiles Apply model; export prediction raw data raw data Schedule: Fetch once per hour Schedule: Fetch once per hour Schedule: Fetch once per hour Schedule: Fetch every 300ms
  • 31. PAPIs 2015 Serial processing 31 Refresh job apps Make feature vectors Export predictions (repeat every 60 minutes) Refresh work hours Refresh click log Fetch ~50K profiles ... 55 min 5 min 4 min 5 min 30 min 55 - 4 - 5 - 30 = 16 min...
  • 32. PAPIs 2015 Serial processing 32 Refresh job apps Make feature vectors Export predictions (repeat every 60 minutes) Refresh work hours Refresh click log Fetch ~50K profiles ... 55 min 5 min 4 min 5 min 30 min 55 - 4 - 5 - 30 = 16 min... Throughput: 48K users/hr
  • 33. PAPIs 2015 Parallel Processing with Actors 33 Refresh job apps ... Refresh click log Refresh work hrs. Rx data Fetch pro. Export Rx data Fetch pro. Fetch pro. Fetch pro. Fetch pro.= msg. sent = msg. rx’d 1/hr. 1/hr. 1/hr. 3/sec. (as rx’ed) Store Store Vectorize Vectorize Store 1/hr. Thr. 1 Thr. 2 Thr. 3 Thr. 4 Vectorize Fetch pro. Fetch pro. (msg. processing time not to scale) Rx data Vectorize ...
  • 34. PAPIs 2015 Parallel Processing with Actors 34 Refresh job apps ... Refresh click log Refresh work hrs. Rx data Fetch pro. Export Rx data Fetch pro. Fetch pro. Fetch pro. Fetch pro.= msg. sent = msg. rx’d 1/hr. 1/hr. 1/hr. 3/sec. (as rx’ed) Store Store Vectorize Vectorize Store 1/hr. Thr. 1 Thr. 2 Thr. 3 Thr. 4 Vectorize Fetch pro. Fetch pro. Throughput: 180K users/hr Rx data Vectorize ...
  • 35. PAPIs 2015 [D] Scala+Akka provides an easy-to-use Actor Model context. 35
  • 36. PAPIs 2015 Message passing, scheduling, & computation behavior defined in 445 lines. 36
  • 37. PAPIs 2015 Scala+Akka Actors ● Create Scala class, mix in Actor trait ● Implement the required partial function: receive: PartialFunction[Any, Unit] ● Define family of message objects this actor’s planning to handle ● Define behavior for each message case in receive 37
  • 38. PAPIs 2015 Scala+Akka Actors 38 Mixin same code used for export in non-Actor version Private, mutable state: stored scores Private, mutable state: time of last export If receiving new scores: store them! If storing lots of scores, or if it’s been awhile: upload what’s stored, then erase them If told to shut down, stop accepting new scores
  • 39. PAPIs 2015 Scala+Akka Pros ● Easy to get productive in the Scala language ● SBT dependency management makes it easy to move to any box with a JRE ● No global interpreter lock! 39
  • 40. PAPIs 2015 Scala+Akka Cons ● Moderate Scala learning curve ● Object representation on the JVM has pretty lousy memory efficiency ● Not a lot of great options for building models in Scala (compared to R, Python, Julia) 40
  • 41. PAPIs 2015 [A] Sometimes, data scientists need to worry about throughput. 41
  • 42. PAPIs 2015 [B] One way to increase throughput is with concurrency. 42
  • 43. PAPIs 2015 [C] The Actor Model is an easy way to build a concurrent system. 43
  • 44. PAPIs 2015 [D] Scala+Akka provides an easy-to-use Actor Model context. 44
  • 45. PAPIs 2015 [A + B + C + D ⇒ Z] Data scientists should check out Scala+Akka 45