SlideShare a Scribd company logo
Scala and Python
Integrating scikit-learn into a Scala Stack to build
realtime predictive models
Dan Chiao
VP Engineering
Why it was necessary
We pivoted
The original product
• Social data append
– PeopleGraph: match email addresses
to public demographics and social
profiles
– BrandGraph: match company URLs to
public firmographics and social
profiles
• Requirements
– Integrate a large (and expanding)
number of web data sources (REST,
SOAP, flat files)
– Realtime processing of large volumes
of contacts (60 queries/s)
The original technology stack
• Scala
– Best of both worlds
• Concise functional syntax
• Java libraries and deployment architecture
• Scala-specific libraries (Dispatch, Lift Web Framework)
• Twitter (soon to be Apache) Storm
– Streaming intake and normalization of large amounts of data
• MongoDB
– Expanding data sources = constantly updating schema
– Most sophisticated query syntax of NoSQL options
• AWS and Azure
– Well, duh
The new product
• Moving up the application stack
– Focus on the most compelling single-use case for our data
– Fliptop SpendScore
• Predictive analytics for sales and marketing teams
• “Machine learning for Salesforce”
The updated technology stack
• Still need to wrangle large amounts of data, so no changes
there
• New requirement: fast, scalable machine learning
Why not Scala (Java) native?
• The options
– Apache Mahout
• Only skeleton implementations for most sophicated machine
learning techniques (e.g. Random Forest, Adaboost)
• Customer-specific models – don’t need Big Data
– Weka – GPL
– Scala-native libraries – Too early to use in production
Why Python?
• scikit-learn
– Mature – around since 2006
– Actively-developed – Last stable release Aug 2013
– Sophisticated – Random Forest and Adaboost classifier show
comparable performance to R
• Why not R? Not really production grade.
Requirements
• APIs to exploit Python’s modeling power
– Train, predict, model info query, etc.
• Scalability
– On demand Python serving nodes
Tools for Scala-Python Integration
• Reimplementation of Python
– Jython (JPython)
• Communication through JNI
– Jepp
• Communication through IPC
– Apache Thrift
• Communication through REST API calls
– Bottle
Jython
• Re-Implementation of Python in Java
• Can import and use any Java class.
• Includes almost all of the modules in the standard Python
distribution
– Except some of the modules implemented originally in C.
• Compiles to Java bytecode
– either on demand or statically.
1
1
Jython
1
2
JVM
Scala Code
Python Code
Jython
Jython
• Lacks support for lots of extensions for scientific computing
– Numpy, Scipy, etc.
• JyNI (Jython Native Interface) to the rescue?
– Specifically designed to support CPython extensions like
Numpy, Scipy
– Still in alpha
1
3
Communication through JNI
• Jepp (Java Embedded Python)
– Embeds CPython in Java
– Runs Python code in CPython
– Leverages both JNI and Python/C for integration
Python Interpreter
Jepp
1
5
JVM
Scala Code
Python Code
JNI Jepp
Jepp
1
6
object TestJepp extends App {
val jep = new Jep()
jep.runScript("python_util.py")
val a = (2).asInstanceOf[AnyRef]
val b = (3).asInstanceOf[AnyRef]
val sumByPython = jep.invoke("python_add", a, b)
println(sumByPython.asInstanceOf[Int])
}
def python_add(a, b):
return a + b
python_util.py
TestJepp.scala
Communication through IPC
• Apache Thrift
– Developed & open-sourced by Facebook
– More community support than Protobuf, Avro
– IDL-based (Interface Definition Language)
– Generates server/client code in specified languages
– Take care of protocol and transport layer details
– Comes with generators for Java, Python, C++, etc.
• No Scala generator
• Scrooge (Twitter) to the rescue!
1
7
Thrift – IDL
1
8
namespace java python_service_test
namespace py python_service_test
service PythonAddService
{
i32 pythonAdd (1:i32 a, 2:i32 b),
}
TestThrift.thrift
$ thrift --gen java --gen py TestThrift.thrift
Thrift – Python Server
1
9
class ExampleHandler(python_service_test.PythonAddService.Iface):
def pythonAdd(self, a, b):
return a + b
handler = ExampleHandler()
processor = Example.Processor(handler)
transport = TSocket.TServerSocket(9090)
tfactory = TTransport.TBufferedTransportFactory()
pfactory = TBinaryProtocol.TBinaryProtocolFactory()
server = TServer.TThreadedServer(processor, transport, tfactory, pfactory)
server.serve()
PythonAddServer.py
class Iface:
def pythonAdd(self, a, b):
pass
PythonAddService.p
y
Thrift – Scala Client
2
0
object PythonAddClient extends App {
val transport: TTransport = new TSocket("localhost", 9090)
val protocol: TProtocol = new TBinaryProtocol(transport)
val client = new PythonAddService.Client(protocol)
transport.open()
val sumByPython = client.python_add(3, 5)
println("3 + 5 = " + sumByPython)
transport.close()
}
PythonAddClient.sc
ala
Thrift
2
1
JVM Scala Code
Thrift
Python Code
Python Interpreter
Thrift
Python Code
Python Interpreter
Thrift
…
Auto Balancing、
Built-in Encryption
REST API Architecture
2
2
…Bottle
Python Code
Bottle
Python Code
Bottle
Python Code
JVM
Scala Code
Auto Balancer?
Encoding?
Thrift v.s. REST
Thrift REST
Load Balancer
✔
Encode/Decode
✔
Low Learning Curve
✔
No Dependency
✔
Does it matter?
No
(AWS & Azure)
No
(We’re already doing
it)
Yes
Yes
Fliptop’s Architecture
2
4
Load Balancer
…Bottle
Python Code
Bottle
Python Code
Bottle
Python Code
JVM Scala Code
5 Python servers
~5,000 requests/sec
Summary
• Jython
• (✓) Tight integration with Scala/Java
• (✗) Lack support for C extensions (JyNI might help in the future)
• Jepp
• (✓) Access high quality Python extensions with CPython speed
• (✗) Two runtime environments
• Thrift, REST
• (✓) Language-independent development
• (✗) Bigger communication overhead
2
5
Questions?
Ask this guy
Thank You
2
7

More Related Content

PPT
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
PPT
Communication between Java and Python
PDF
Seattle useR Group - R + Scala
PPT
Mixing Python and Java
PDF
Jython: Integrating Python and Java
PPT
What do you mean it needs to be Java based? How jython saved the day.
PPTX
Introduction to Python Programing
PPTX
Python Introduction | JNTUA | R19 | UNIT 1
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
Communication between Java and Python
Seattle useR Group - R + Scala
Mixing Python and Java
Jython: Integrating Python and Java
What do you mean it needs to be Java based? How jython saved the day.
Introduction to Python Programing
Python Introduction | JNTUA | R19 | UNIT 1

What's hot (20)

PDF
Python Programming - XIII. GUI Programming
PPTX
Python Programming
PPTX
Why Python?
ODP
Python and Machine Learning
PDF
Python final ppt
PPTX
Introduction to Python Basics Programming
PPT
Rifartek Robot Training Course - How to use ClientRobot
PDF
Numba: Array-oriented Python Compiler for NumPy
PDF
Programming with Python - Basic
PDF
Ekon 25 Python4Delphi_MX475
PPTX
Introduction about Python by JanBask Training
PDF
Getting started with Linux and Python by Caffe
PDF
Extending Python with ctypes
ODP
C Types - Extending Python
PDF
Python 3.5: An agile, general-purpose development language.
PPTX
Ctypes
PDF
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
PPTX
Sour Pickles
PPTX
Python Seminar PPT
Python Programming - XIII. GUI Programming
Python Programming
Why Python?
Python and Machine Learning
Python final ppt
Introduction to Python Basics Programming
Rifartek Robot Training Course - How to use ClientRobot
Numba: Array-oriented Python Compiler for NumPy
Programming with Python - Basic
Ekon 25 Python4Delphi_MX475
Introduction about Python by JanBask Training
Getting started with Linux and Python by Caffe
Extending Python with ctypes
C Types - Extending Python
Python 3.5: An agile, general-purpose development language.
Ctypes
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Sour Pickles
Python Seminar PPT
Ad

Viewers also liked (20)

PDF
Python to scala
PDF
Python y Flink
PDF
Piazza 2 lecture
PDF
Neural networks with python
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Data Engineering with Solr and Spark
PDF
Scala for Machine Learning
PDF
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...
PDF
Machine Learning with Spark MLlib
PDF
PredictionIO – A Machine Learning Server in Scala – SF Scala
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PPTX
Hidden markov model
PPT
Step-by-Step Introduction to Apache Flink
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPT
Neural Networks
PDF
Machine Learning using Apache Spark MLlib
PPTX
MLlib and Machine Learning on Spark
PDF
Jython 2.7 and techniques for integrating with Java - Frank Wierzbicki
PPTX
Neural network & its applications
PPT
Hidden Markov Model & Stock Prediction
Python to scala
Python y Flink
Piazza 2 lecture
Neural networks with python
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Data Engineering with Solr and Spark
Scala for Machine Learning
Akka: Simpler Scalability, Fault-Tolerance, Concurrency & Remoting through Ac...
Machine Learning with Spark MLlib
PredictionIO – A Machine Learning Server in Scala – SF Scala
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Hidden markov model
Step-by-Step Introduction to Apache Flink
Apache Flink: Real-World Use Cases for Streaming Analytics
Neural Networks
Machine Learning using Apache Spark MLlib
MLlib and Machine Learning on Spark
Jython 2.7 and techniques for integrating with Java - Frank Wierzbicki
Neural network & its applications
Hidden Markov Model & Stock Prediction
Ad

Similar to How to integrate python into a scala stack (20)

PDF
Apache thrift-RPC service cross languages
PDF
Uber, Netflix, &YouTube are Built with Python. Here’s WHY.pdf
KEY
The Why and How of Scala at Twitter
PDF
Uber, Netflix, &YouTube are Built with Python. Here’s WHY?
PPTX
MOBILE APP DEVELOPMENT USING PYTHON
PDF
Detailed Guide on Python for Web, AI, and Data Use
PPTX
Apache Thrift, a brief introduction
PDF
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
ZIP
An Introduction to PyPy
PPT
A Complete Guide for Equipping Python for Modern Software Development.ppt
PDF
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
PDF
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
PDF
Samsung SDS OpeniT - The possibility of Python
PDF
Python Django Intro V0.1
PDF
Big data beyond the JVM - DDTX 2018
KEY
Java to Scala: Why & How
PDF
Python Website Development: The Ultimate Guide for 2025.pdf
PPTX
Multi-Lingual Accumulo Communications
PPTX
First of all, what is Python? According t
PDF
Scala at Treasure Data
Apache thrift-RPC service cross languages
Uber, Netflix, &YouTube are Built with Python. Here’s WHY.pdf
The Why and How of Scala at Twitter
Uber, Netflix, &YouTube are Built with Python. Here’s WHY?
MOBILE APP DEVELOPMENT USING PYTHON
Detailed Guide on Python for Web, AI, and Data Use
Apache Thrift, a brief introduction
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
An Introduction to PyPy
A Complete Guide for Equipping Python for Modern Software Development.ppt
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Samsung SDS OpeniT - The possibility of Python
Python Django Intro V0.1
Big data beyond the JVM - DDTX 2018
Java to Scala: Why & How
Python Website Development: The Ultimate Guide for 2025.pdf
Multi-Lingual Accumulo Communications
First of all, what is Python? According t
Scala at Treasure Data

More from Fliptop (17)

PPTX
Webinar: Predictive Digital Advertising
PPTX
Fliptop - NewVoiceMedia Looks to Predictive Marketing to Meet Aggressive Grow...
PPTX
Fliptop Customer Showcase Webinar - How InsideView Doubled Their Lead to MQL ...
PPTX
Webinar: Integrating Predictive Lead Scoring in your Marketo
PPTX
Dreamforce Presentation - Fliptop + InsideView
PPTX
The Quest for the Holy Grail: Driving Predictable Revenue
PPTX
Predict 2014, Doug Camplejohn Welcome to Predict
PPTX
Predict 2014, Norman Happ Precision Marketing in a Sea of Opportunity
PPTX
Predict 2014, Sean Ellis Growth Hacking for B2B Marketers
PPTX
Predict 2014, SiriusDecisions Kerry Cunningham
PPTX
Predict 2014, Brian Kelly of InsideView, Marketing to Marketers - How We Do It
PPTX
Predict 2014 - Account Based Marketing with Peter Isaacson of Demandbase
PPTX
Webinar: Predictive Lead Scoring - What Makes It So Predictive?
PPTX
Webinar: True Cost of Calling Every Lead
PPTX
Webinar: The Science of Predictive Lead Scoring
PPTX
Big Data Will Change Our World
PPTX
Marketer's Time Saving Survey
Webinar: Predictive Digital Advertising
Fliptop - NewVoiceMedia Looks to Predictive Marketing to Meet Aggressive Grow...
Fliptop Customer Showcase Webinar - How InsideView Doubled Their Lead to MQL ...
Webinar: Integrating Predictive Lead Scoring in your Marketo
Dreamforce Presentation - Fliptop + InsideView
The Quest for the Holy Grail: Driving Predictable Revenue
Predict 2014, Doug Camplejohn Welcome to Predict
Predict 2014, Norman Happ Precision Marketing in a Sea of Opportunity
Predict 2014, Sean Ellis Growth Hacking for B2B Marketers
Predict 2014, SiriusDecisions Kerry Cunningham
Predict 2014, Brian Kelly of InsideView, Marketing to Marketers - How We Do It
Predict 2014 - Account Based Marketing with Peter Isaacson of Demandbase
Webinar: Predictive Lead Scoring - What Makes It So Predictive?
Webinar: True Cost of Calling Every Lead
Webinar: The Science of Predictive Lead Scoring
Big Data Will Change Our World
Marketer's Time Saving Survey

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Modernizing your data center with Dell and AMD
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Understanding_Digital_Forensics_Presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

How to integrate python into a scala stack

  • 1. Scala and Python Integrating scikit-learn into a Scala Stack to build realtime predictive models Dan Chiao VP Engineering
  • 2. Why it was necessary We pivoted
  • 3. The original product • Social data append – PeopleGraph: match email addresses to public demographics and social profiles – BrandGraph: match company URLs to public firmographics and social profiles • Requirements – Integrate a large (and expanding) number of web data sources (REST, SOAP, flat files) – Realtime processing of large volumes of contacts (60 queries/s)
  • 4. The original technology stack • Scala – Best of both worlds • Concise functional syntax • Java libraries and deployment architecture • Scala-specific libraries (Dispatch, Lift Web Framework) • Twitter (soon to be Apache) Storm – Streaming intake and normalization of large amounts of data • MongoDB – Expanding data sources = constantly updating schema – Most sophisticated query syntax of NoSQL options • AWS and Azure – Well, duh
  • 5. The new product • Moving up the application stack – Focus on the most compelling single-use case for our data – Fliptop SpendScore • Predictive analytics for sales and marketing teams • “Machine learning for Salesforce”
  • 6. The updated technology stack • Still need to wrangle large amounts of data, so no changes there • New requirement: fast, scalable machine learning
  • 7. Why not Scala (Java) native? • The options – Apache Mahout • Only skeleton implementations for most sophicated machine learning techniques (e.g. Random Forest, Adaboost) • Customer-specific models – don’t need Big Data – Weka – GPL – Scala-native libraries – Too early to use in production
  • 8. Why Python? • scikit-learn – Mature – around since 2006 – Actively-developed – Last stable release Aug 2013 – Sophisticated – Random Forest and Adaboost classifier show comparable performance to R • Why not R? Not really production grade.
  • 9. Requirements • APIs to exploit Python’s modeling power – Train, predict, model info query, etc. • Scalability – On demand Python serving nodes
  • 10. Tools for Scala-Python Integration • Reimplementation of Python – Jython (JPython) • Communication through JNI – Jepp • Communication through IPC – Apache Thrift • Communication through REST API calls – Bottle
  • 11. Jython • Re-Implementation of Python in Java • Can import and use any Java class. • Includes almost all of the modules in the standard Python distribution – Except some of the modules implemented originally in C. • Compiles to Java bytecode – either on demand or statically. 1 1
  • 13. Jython • Lacks support for lots of extensions for scientific computing – Numpy, Scipy, etc. • JyNI (Jython Native Interface) to the rescue? – Specifically designed to support CPython extensions like Numpy, Scipy – Still in alpha 1 3
  • 14. Communication through JNI • Jepp (Java Embedded Python) – Embeds CPython in Java – Runs Python code in CPython – Leverages both JNI and Python/C for integration
  • 16. Jepp 1 6 object TestJepp extends App { val jep = new Jep() jep.runScript("python_util.py") val a = (2).asInstanceOf[AnyRef] val b = (3).asInstanceOf[AnyRef] val sumByPython = jep.invoke("python_add", a, b) println(sumByPython.asInstanceOf[Int]) } def python_add(a, b): return a + b python_util.py TestJepp.scala
  • 17. Communication through IPC • Apache Thrift – Developed & open-sourced by Facebook – More community support than Protobuf, Avro – IDL-based (Interface Definition Language) – Generates server/client code in specified languages – Take care of protocol and transport layer details – Comes with generators for Java, Python, C++, etc. • No Scala generator • Scrooge (Twitter) to the rescue! 1 7
  • 18. Thrift – IDL 1 8 namespace java python_service_test namespace py python_service_test service PythonAddService { i32 pythonAdd (1:i32 a, 2:i32 b), } TestThrift.thrift $ thrift --gen java --gen py TestThrift.thrift
  • 19. Thrift – Python Server 1 9 class ExampleHandler(python_service_test.PythonAddService.Iface): def pythonAdd(self, a, b): return a + b handler = ExampleHandler() processor = Example.Processor(handler) transport = TSocket.TServerSocket(9090) tfactory = TTransport.TBufferedTransportFactory() pfactory = TBinaryProtocol.TBinaryProtocolFactory() server = TServer.TThreadedServer(processor, transport, tfactory, pfactory) server.serve() PythonAddServer.py class Iface: def pythonAdd(self, a, b): pass PythonAddService.p y
  • 20. Thrift – Scala Client 2 0 object PythonAddClient extends App { val transport: TTransport = new TSocket("localhost", 9090) val protocol: TProtocol = new TBinaryProtocol(transport) val client = new PythonAddService.Client(protocol) transport.open() val sumByPython = client.python_add(3, 5) println("3 + 5 = " + sumByPython) transport.close() } PythonAddClient.sc ala
  • 21. Thrift 2 1 JVM Scala Code Thrift Python Code Python Interpreter Thrift Python Code Python Interpreter Thrift … Auto Balancing、 Built-in Encryption
  • 22. REST API Architecture 2 2 …Bottle Python Code Bottle Python Code Bottle Python Code JVM Scala Code Auto Balancer? Encoding?
  • 23. Thrift v.s. REST Thrift REST Load Balancer ✔ Encode/Decode ✔ Low Learning Curve ✔ No Dependency ✔ Does it matter? No (AWS & Azure) No (We’re already doing it) Yes Yes
  • 24. Fliptop’s Architecture 2 4 Load Balancer …Bottle Python Code Bottle Python Code Bottle Python Code JVM Scala Code 5 Python servers ~5,000 requests/sec
  • 25. Summary • Jython • (✓) Tight integration with Scala/Java • (✗) Lack support for C extensions (JyNI might help in the future) • Jepp • (✓) Access high quality Python extensions with CPython speed • (✗) Two runtime environments • Thrift, REST • (✓) Language-independent development • (✗) Bigger communication overhead 2 5