SlideShare a Scribd company logo
Michelle Casbon
January 12, 2016 – Advanced Apache Spark Meetup
Training & Serving NLP
Models in a Distributed Cloud-
based Infrastructure
2
What do we do?
• Idibon creates
adaptive machine
intelligence that can
analyze text in any
language
natural language text
social media
structured insights
3
• Background
• Platform description
• Why we chose Spark
• How we’re using Spark ML & MLlib
• Challenges of adopting Spark in a distributed NLP system
Agenda
4
What are our use cases?
Intent to purchase
Global health
trends
Interactive Voice
Response
Multilingual news
SMS Prioritization
Supply Chain Risk
Change
reception
How do we do it?
• Fewer annotations
• Lower costs
• Less time spent training
• Higher accuracy
• Improves over time
labeled training set
human annotation intelligent queuing
&
machine learning
unlabeled pool
Adaptive learning
7
How do we do it?
Dataset
Models
Identification2
Cleansing3
Training data
creation4
Quality Control5
Creation6
Hyperparameter
Tuning
7
Intelligent
Queueing
8
Rule Creation9
10 Unseen Data
Prediction
Goal Definition1
What does our platform look like?
9
• Wide variety of algorithms
• Active development
• Open source
• Industry-standard algorithm implementation
• Intended for use in enterprise applications
• Scalability
Why are we using Spark?
10
• Feature Extraction
• TF-IDF
• Word2Vec
• Dimensionality reduction
• Training
• Logistic Regression
• SVM
• Naïve Bayes
• LDA
• Prediction
• Evaluation metrics
How are we using Spark?
[1.0, [1.0, 0.0, 3.0]]
Feature
Extraction
Training
Prediction
11
Feature Extraction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[1.0, 0.0, 3.0]
Vector
12
Training
LogisticRegression
WithLBFGS
[1.0, [1.0, 0.0, 3.0]]
LabeledPoint
Model
Storage
[1.0, 0.0, 3.0]
Vector
Add
classification
LogisticRegressionModel
13
Prediction
Extract
Content
Tokenize
Bigrams
Trigrams
Feature
Lookup
[0.0, 1.0, 4.0]
Vector
Model
Lookup
Predict
New tweet
[0.0, 1.0, 4.0]
Vector
Classification
Lookup
14
How do we provide online predictions with
Spark?
… if you have small data
Task Time in µs
Vector prediction 300
DataFrame prediction 7800
DataFrames are slow
...
15
How do we fit Spark into our existing system?
Core
functionality
Idibon
custom ML
…
REST API
ML persistence
layer
16
• Real-time operationalization of many, many models
• Embed within different platforms
• Single save/load framework
• Rapidly incorporate new NLP features
• Logging/monitoring standardized & abstracted
How does a persistence layer enable us to use
Spark?
17
• Analyzing human language is hard
• We’re using the most exciting parts of Spark to build
performant NLP systems that are faster & better than ever
before
Summary
18
Questions?
Michelle Casbon
michelle@idibon.com
@texasmichelle

More Related Content

PPTX
Data Day TX 2016 - Jan 16, 2016
PPTX
Real time monitoring of hadoop and spark workflows
PDF
Heterogeneous Workflows With Spark At Netflix
PDF
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
PPTX
Cloud native data platform
PDF
Build Your Own Recommendation Engine
PDF
Enterprise Data Governance and Compliance at Scale with Sri Eshasubbiah and S...
Data Day TX 2016 - Jan 16, 2016
Real time monitoring of hadoop and spark workflows
Heterogeneous Workflows With Spark At Netflix
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
Cloud native data platform
Build Your Own Recommendation Engine
Enterprise Data Governance and Compliance at Scale with Sri Eshasubbiah and S...

What's hot (20)

PPTX
Disrupting Big Data with Apache Spark in the Cloud
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PPTX
Serverless spark
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PDF
Análisis de las novedades del Elastic Stack
PDF
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
PDF
H2O Rains with Databricks Cloud - Parisoma SF
PDF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
PDF
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
PPTX
MLflow on and inside Azure
PDF
Scalable Search Analytics
PPTX
Apache Spark in Scientific Applciations
PDF
Changing the Way Viacom Looks at Video Performance with Mark Cohen and Michae...
PDF
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
PDF
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
PDF
Insights Without Tradeoffs: Using Structured Streaming
Disrupting Big Data with Apache Spark in the Cloud
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Serverless spark
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Análisis de las novedades del Elastic Stack
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
H2O Rains with Databricks Cloud - Parisoma SF
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
MLflow on and inside Azure
Scalable Search Analytics
Apache Spark in Scientific Applciations
Changing the Way Viacom Looks at Video Performance with Mark Cohen and Michae...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Deep Learning on Apache® Spark™ : Workflows and Best Practices
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
Insights Without Tradeoffs: Using Structured Streaming
Ad

Similar to Advanced Spark Meetup - Jan 12, 2016 (20)

PDF
Using PySpark to Process Boat Loads of Data
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
PPTX
IBM Strategy for Spark
PDF
Gartner Catalyst 2015 Customer Presentation - MindTouch
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Open, Secure & Transparent AI Pipelines
PDF
A Mobile-First, Cloud-First Stack at Pearson
PPTX
Ai & Data Analytics 2018 - Azure Databricks for data scientist
PDF
DevOps for DataScience
PDF
Splice Machine's use of Apache Spark and MLflow
PDF
DoneDeal - AWS Data Analytics Platform
PPTX
Building Powerful and Intelligent Applications with Azure Machine Learning
PPTX
Building Powerful and Intelligent Applications with Azure Machine Learning
PPTX
Getting Started with Splunk Breakout Session
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Turn Data Into Actionable Insights - StampedeCon 2016
PPTX
Getting Started with Splunk Breakout Session
PPTX
Splunk for Machine Learning and Analytics
Using PySpark to Process Boat Loads of Data
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
IBM Strategy for Spark
Gartner Catalyst 2015 Customer Presentation - MindTouch
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Open, Secure & Transparent AI Pipelines
A Mobile-First, Cloud-First Stack at Pearson
Ai & Data Analytics 2018 - Azure Databricks for data scientist
DevOps for DataScience
Splice Machine's use of Apache Spark and MLflow
DoneDeal - AWS Data Analytics Platform
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
Getting Started with Splunk Breakout Session
Global AI Bootcamp Madrid - Azure Databricks
Combining Machine Learning frameworks with Apache Spark
Turn Data Into Actionable Insights - StampedeCon 2016
Getting Started with Splunk Breakout Session
Splunk for Machine Learning and Analytics
Ad

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Construction Project Organization Group 2.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Project quality management in manufacturing
PDF
composite construction of structures.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Well-logging-methods_new................
PPT
Mechanical Engineering MATERIALS Selection
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Construction Project Organization Group 2.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
OOP with Java - Java Introduction (Basics)
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Foundation to blockchain - A guide to Blockchain Tech
Automation-in-Manufacturing-Chapter-Introduction.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
UNIT 4 Total Quality Management .pptx
additive manufacturing of ss316l using mig welding
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Project quality management in manufacturing
composite construction of structures.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Well-logging-methods_new................
Mechanical Engineering MATERIALS Selection

Advanced Spark Meetup - Jan 12, 2016

Editor's Notes

  • #5: Detecting intent-to-buy in twitter data Extracting insights about global health trends Making the call center iVR smarter Informing investors of company performance based on news sentiment Identifying supply chain risk in news articles Understanding user reception of code pushes in online games Prioritizing urgent SMS messages for UNICEF
  • #9: API: Document uploads, model training, cross-validation, annotation aggregation, queueing, topic modeling, prediction, IAA
  • #17: Persistence layer Docker, Lambda, offline, on-device 0 Input/output streams vs. flat files slf4j