SlideShare a Scribd company logo
Elasticsearch & Lucene for
Apache Spark and MLlib
Costin Leau (@costinl)
Mirror, mirror on the wall,
what’s the happiest team of
us all ?
Briita Weber
- Rough translation from German by yours truly -
Purpose of the talk
Improve ML pipelines through IR
Text processing
• Analysis
• Featurize/Vectorize *
* In research / poc / WIP / Experimental phase
Technical Debt
Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al
http://research.google.com/pubs/pub43146.html
Technical Debt
Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al
http://research.google.com/pubs/pub43146.html
Challenge
Challenge
Challenge: What team at Elastic is most happy?
Data: Hipchat messages
Training / Test data: http://www.sentiment140.com
Result: Kibana dashboard
ML Pipeline
Chat data
Sentiment	Model
Production	Data Apply	the	rule Predict	the	‘class’
J /	L
Data is King
Example: Word2Vec
Input snippet
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#example
it was introduced into mathematics in the book
disquisitiones arithmeticae by carl friedrich gauss in
one eight zero one ever since however modulo has gained
many meanings some exact and some imprecise
Real data is messy
originally looked like this:
https://en.wikipedia.org/wiki/Modulo_(jargon)
It was introduced into <a
href="https://en.wikipedia.org/wiki/Mathematics"
title="Mathematics">mathematics</a> in the book <i><a
href="https://en.wikipedia.org/wiki/Disquisitiones_Arithmeticae"
title="Disquisitiones Arithmeticae">Disquisitiones Arithmeticae</a></i>
by <a href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"
title="Carl Friedrich Gauss">Carl Friedrich Gauss</a> in 1801. Ever
since, however, "modulo" has gained many meanings, some exact and some
imprecise.
Feature extraction Cleaning up data
"huuuuuuunnnnnnngrrryyy",
"aaaaaamaaazinggggg",
"aaaaaamazing",
"aaaaaammm",
"aaaaaammmazzzingggg",
"aaaaaamy",
"aaaaaan",
"aaaaaand",
"aaaaaannnnnnddd",
"aaaaaanyways"
Does it help to clean that up?
see “Twitter Sentiment Classification using Distant Supervision”, Go et al.
http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
Language matters
读书须用意,一字值千金
Lucene to the rescue!
High-performance, full-featured text search library
15 years of experience
Widely recognized for its utility
• It’s a primary test bed for new JVM versions
Text processing
Character	
Filter
Tokenizer
Token	FilterToken	FilterToken	Filter
Do <b>Johnny
Depp</b> a favor and
forget you…
Do
Pos:	1
Johnny
Pos:	2
do
Pos:	1
johnny
Pos:	2
Lucene for text analysis
state of the art text processing
many extensions available for different languages, use cases,…
however…
…
import org.apache.lucene.analysis…
…
Analyzer a = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
return new TokenStreamComponents(tokenizer, tokenizer);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
};
TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>");
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
int pos = 0;
while (stream.incrementToken()) {
pos += posIncrement.getPositionIncrement();
System.out.println(term.toString() + " " + pos);
}
> some 1
> text 2
…
import org.apache.lucene.analysis…
…
Analyzer a = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new StandardTokenizer();
return new TokenStreamComponents(tokenizer, tokenizer);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
};
TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>");
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class);
stream.reset();
int pos = 0;
while (stream.incrementToken()) {
pos += posIncrement.getPositionIncrement();
System.out.println(term.toString() + " " + pos);
}
> some 1
> text 2
How about a declarative approach?
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Very quick intro to
Elasticsearch
Elasticsearch in 5 3’
Scalable, real-time search and analytics engine
Data distribution, cluster management
REST APIs
JVM based, uses Apache Lucene internally
Open-source (on Github, Apache 2 License)
Elasticsearch in 3’
Unstructured
search
Elasticsearch in 3’
Sorting / Scoring
Elasticsearch in 3’
Pagination
Elasticsearch in 3’
Enrichment
Elasticsearch in 3’
Structured
search
Elasticsearch in 3’
https://www.elastic.co/elasticon/2015/sf/unlocking-interplanetary-datasets-with-real-time-search
Machine Learning and
Elasticsearch
Machine Learning and Elasticsearch
Machine Learning and Elasticsearch
Term Analysis (tf, idf, bm25)
Graph Analysis
Co-occurrence of Terms (significant terms)
• ChiSquare
Pearson correlation (#16817)
Regression (#17154)
What about classification/clustering/ etc… ?
31
It’s not the matching data,
but the meta that lead to it
How to use Elasticsearch
from Spark ?
Somebody on Stackoverflow
Elasticsearch for Apache Hadoop ™
Elasticsearch for Apache Hadoop ™
Elasticsearch for Apache Hadoop ™
Elasticsearch Spark – Native integration
Scala & Java API
Understands Scala & Java types
– Case classes
– Java Beans
Available as Spark package
Supports Spark Core & SQL
all 1.x version (1.0-1.6)
Available for Scala 2.10 and 2.11
Elasticsearch as RDD / Dataset*
import org.elasticsearch.spark._
val sc = new SparkContext(new SparkConf())
val rdd = sc.esRDD(“buckethead/albums", "?q=pikes")
import org.elasticsearch.spark._
case class Artist(name: String, albums: Int)
val u2 = Artist("U2", 13)
val bh = Map("name"->"Buckethead","albums" -> 255, "age" -> 46)
sc.makeRDD(Seq(u2, bh)).saveToEs("radio/artists")
Elasticsearch as a DataFrame
val df = sql.read.format(“es").load("buckethead/albums")
df.filter(df("category").equalTo("pikes").and(df("year").geq(2015)))
{ "query" :
{ "bool" : { "must" : [
"match" : { "category" : "pikes" }
],
"filter" : [
{ "range" : { "year" : {"gte" : "2015" }}}
]
}}}
Partition to Partition Architecture
Putting the pieces together
Typical ML pipeline for text
Typical ML pipeline for text
Actual ML code
Typical ML pipeline for text
Pure Spark MLlib
val training = movieReviewsDataTrainingData
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
Pure Spark MLlib
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Pure Spark MLlib
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Pure Spark MLlib
val analyzer = new ESAnalyzer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Pure Spark MLlib
val analyzer = new ESAnalyzer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
Data movement
Work once – reuse multiple times
// index / analyze the data
training.saveToEs("movies/reviews")
Work once – reuse multiple times
// prepare the spec for vectorize – fast and lightweight
val spec = s"""{ "features" : [{
| "field": "text",
| "type" : "string",
| "tokens" : "all_terms",
| "number" : "occurrence",
| "min_doc_freq" : 2000
| }],
| "sparse" : "true"}""".stripMargin
ML.prepareSpec(spec, “my-spec”)
Access the vector directly
// get the features – just another query
val payload = s"""{"script_fields" : { "vector" :
| { "script" : { "id" : “my-spec","lang" : “doc_to_vector" } }
| }}""".stripMargin
// index the data
vectorRDD = sparkCtx.esRDD("ml/data", payload)
// feed the vector to the pipeline
val vectorized = vectorRDD.map ( x =>
// get indices, the vector and length
(if (x._1 == "negative") 0.0d else 1.0d, ML.getVectorFrom(x._2))
).toDF("label", "features")
Revised ML pipeline
val vectorized = vectorRDD.map...
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val model = lr.fit(vectorized)
Simplify ML pipeline
Once per dataset,
regardless of # of
pipelines
Raw data is not
required any more
Need to adjust the model? Change the spec
val spec = s"""{ "features" : [{
| "field": "text",
| "type" : "string",
| "tokens" : "given",
| "number" : "tf",
| "terms": ["term1", "term2", ...]
| }],
| "sparse" : "true"}""".stripMargin
ML.prepareSpec(spec)
Elasticsearch And Apache Lucene For Apache Spark And MLlib
All this is WIP
Not all features available (currently dictionary, vectors)
Works with data outside or inside Elasticsearch (latter is much faster)
Bind vectors to queries
Other topics WIP:
Focused on document / text classification – numeric support is next
Model importing / exporting – Spark 2.0 ML persistence
Feedback highly sought - Is this useful?
THANK YOU.
j.mp/spark-summit-west-16
elastic.co/hadoop
github.com/elastic | costin | brwe
discuss.elastic.co
@costinl

More Related Content

PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
LuceneRDD for (Geospatial) Search and Entity Linkage
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Deep Dive: Memory Management in Apache Spark
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Apache Spark 2.0: Faster, Easier, and Smarter
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
LuceneRDD for (Geospatial) Search and Entity Linkage
Robust and Scalable ETL over Cloud Storage with Apache Spark
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Combining Machine Learning Frameworks with Apache Spark
Deep Dive: Memory Management in Apache Spark

What's hot (20)

ODP
Introduction to Spark with Scala
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
Spark Summit EU talk by Nimbus Goehausen
PPTX
Introduction to Apache Spark
PDF
Apache Spark Tutorial
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PPT
Spark stream - Kafka
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
PDF
Spark on YARN
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PPTX
Tuning and Monitoring Deep Learning on Apache Spark
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PPTX
Beyond shuffling global big data tech conference 2015 sj
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Extending Spark With Java Agent (handout)
PDF
Apache Spark Introduction - CloudxLab
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Introduction to Spark with Scala
Large-Scale Data Science in Apache Spark 2.0
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit EU talk by Nimbus Goehausen
Introduction to Apache Spark
Apache Spark Tutorial
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Frustration-Reduced PySpark: Data engineering with DataFrames
Spark stream - Kafka
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark on YARN
Keeping Spark on Track: Productionizing Spark for ETL
Tuning and Monitoring Deep Learning on Apache Spark
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Beyond shuffling global big data tech conference 2015 sj
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Extending Spark With Java Agent (handout)
Apache Spark Introduction - CloudxLab
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Ad

Similar to Elasticsearch And Apache Lucene For Apache Spark And MLlib (20)

PDF
Writing Continuous Applications with Structured Streaming PySpark API
PDF
Intro to Spark and Spark SQL
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Introduction to Apache Spark
PDF
20170126 big data processing
PDF
Terrastore - A document database for developers
PDF
IoT Applications and Patterns using Apache Spark & Apache Bahir
PDF
Just one-shade-of-openstack
PPTX
ElasticSearch for .NET Developers
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
Spark what's new what's coming
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PPTX
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
PDF
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
A Tale of Two APIs: Using Spark Streaming In Production
Writing Continuous Applications with Structured Streaming PySpark API
Intro to Spark and Spark SQL
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Writing Continuous Applications with Structured Streaming in PySpark
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Introduction to Apache Spark
20170126 big data processing
Terrastore - A document database for developers
IoT Applications and Patterns using Apache Spark & Apache Bahir
Just one-shade-of-openstack
ElasticSearch for .NET Developers
Big Data Processing with .NET and Spark (SQLBits 2020)
Spark what's new what's coming
Real-Time Spark: From Interactive Queries to Streaming
Designing Structured Streaming Pipelines—How to Architect Things Right
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Spark (Structured) Streaming vs. Kafka Streams
A Tale of Two APIs: Using Spark Streaming In Production
Ad

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
PDF
Low Latency Execution For Apache Spark
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Spark on Mesos
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
Spatial Analysis On Histological Images Using Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
A Graph-Based Method For Cross-Entity Threat Detection
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Time-Evolving Graph Processing On Commodity Clusters
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Low Latency Execution For Apache Spark
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Spark And Cassandra: 2 Fast, 2 Furious
Building Custom Machine Learning Algorithms With Apache SystemML
Spark on Mesos

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
annual-report-2024-2025 original latest.
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
ISS -ESG Data flows What is ESG and HowHow
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
Clinical guidelines as a resource for EBP(1).pdf
annual-report-2024-2025 original latest.
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Business Acumen Training GuidePresentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ISS -ESG Data flows What is ESG and HowHow

Elasticsearch And Apache Lucene For Apache Spark And MLlib

  • 1. Elasticsearch & Lucene for Apache Spark and MLlib Costin Leau (@costinl)
  • 2. Mirror, mirror on the wall, what’s the happiest team of us all ? Briita Weber - Rough translation from German by yours truly -
  • 3. Purpose of the talk Improve ML pipelines through IR Text processing • Analysis • Featurize/Vectorize * * In research / poc / WIP / Experimental phase
  • 4. Technical Debt Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al http://research.google.com/pubs/pub43146.html
  • 5. Technical Debt Machine Learning: The High Interest Credit Card of Technical Debt”, Sculley et al http://research.google.com/pubs/pub43146.html
  • 7. Challenge: What team at Elastic is most happy? Data: Hipchat messages Training / Test data: http://www.sentiment140.com Result: Kibana dashboard
  • 8. ML Pipeline Chat data Sentiment Model Production Data Apply the rule Predict the ‘class’ J / L
  • 10. Example: Word2Vec Input snippet http://spark.apache.org/docs/latest/mllib-feature-extraction.html#example it was introduced into mathematics in the book disquisitiones arithmeticae by carl friedrich gauss in one eight zero one ever since however modulo has gained many meanings some exact and some imprecise
  • 11. Real data is messy originally looked like this: https://en.wikipedia.org/wiki/Modulo_(jargon) It was introduced into <a href="https://en.wikipedia.org/wiki/Mathematics" title="Mathematics">mathematics</a> in the book <i><a href="https://en.wikipedia.org/wiki/Disquisitiones_Arithmeticae" title="Disquisitiones Arithmeticae">Disquisitiones Arithmeticae</a></i> by <a href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss" title="Carl Friedrich Gauss">Carl Friedrich Gauss</a> in 1801. Ever since, however, "modulo" has gained many meanings, some exact and some imprecise.
  • 12. Feature extraction Cleaning up data "huuuuuuunnnnnnngrrryyy", "aaaaaamaaazinggggg", "aaaaaamazing", "aaaaaammm", "aaaaaammmazzzingggg", "aaaaaamy", "aaaaaan", "aaaaaand", "aaaaaannnnnnddd", "aaaaaanyways" Does it help to clean that up? see “Twitter Sentiment Classification using Distant Supervision”, Go et al. http://www-cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf
  • 14. Lucene to the rescue! High-performance, full-featured text search library 15 years of experience Widely recognized for its utility • It’s a primary test bed for new JVM versions
  • 15. Text processing Character Filter Tokenizer Token FilterToken FilterToken Filter Do <b>Johnny Depp</b> a favor and forget you… Do Pos: 1 Johnny Pos: 2 do Pos: 1 johnny Pos: 2
  • 16. Lucene for text analysis state of the art text processing many extensions available for different languages, use cases,… however…
  • 17. … import org.apache.lucene.analysis… … Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new StandardTokenizer(); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new HTMLStripCharFilter(reader); } }; TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>"); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class); stream.reset(); int pos = 0; while (stream.incrementToken()) { pos += posIncrement.getPositionIncrement(); System.out.println(term.toString() + " " + pos); } > some 1 > text 2
  • 18. … import org.apache.lucene.analysis… … Analyzer a = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new StandardTokenizer(); return new TokenStreamComponents(tokenizer, tokenizer); } @Override protected Reader initReader(String fieldName, Reader reader) { return new HTMLStripCharFilter(reader); } }; TokenStream stream = a.tokenStream(null, "<a href=...>some text</a>"); CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); PositionIncrementAttribute posIncrement = stream.addAttribute(PositionIncrementAttribute.class); stream.reset(); int pos = 0; while (stream.incrementToken()) { pos += posIncrement.getPositionIncrement(); System.out.println(term.toString() + " " + pos); } > some 1 > text 2 How about a declarative approach?
  • 20. Very quick intro to Elasticsearch
  • 21. Elasticsearch in 5 3’ Scalable, real-time search and analytics engine Data distribution, cluster management REST APIs JVM based, uses Apache Lucene internally Open-source (on Github, Apache 2 License)
  • 29. Machine Learning and Elasticsearch
  • 30. Machine Learning and Elasticsearch Term Analysis (tf, idf, bm25) Graph Analysis Co-occurrence of Terms (significant terms) • ChiSquare Pearson correlation (#16817) Regression (#17154) What about classification/clustering/ etc… ?
  • 31. 31 It’s not the matching data, but the meta that lead to it
  • 32. How to use Elasticsearch from Spark ? Somebody on Stackoverflow
  • 36. Elasticsearch Spark – Native integration Scala & Java API Understands Scala & Java types – Case classes – Java Beans Available as Spark package Supports Spark Core & SQL all 1.x version (1.0-1.6) Available for Scala 2.10 and 2.11
  • 37. Elasticsearch as RDD / Dataset* import org.elasticsearch.spark._ val sc = new SparkContext(new SparkConf()) val rdd = sc.esRDD(“buckethead/albums", "?q=pikes") import org.elasticsearch.spark._ case class Artist(name: String, albums: Int) val u2 = Artist("U2", 13) val bh = Map("name"->"Buckethead","albums" -> 255, "age" -> 46) sc.makeRDD(Seq(u2, bh)).saveToEs("radio/artists")
  • 38. Elasticsearch as a DataFrame val df = sql.read.format(“es").load("buckethead/albums") df.filter(df("category").equalTo("pikes").and(df("year").geq(2015))) { "query" : { "bool" : { "must" : [ "match" : { "category" : "pikes" } ], "filter" : [ { "range" : { "year" : {"gte" : "2015" }}} ] }}}
  • 39. Partition to Partition Architecture
  • 40. Putting the pieces together
  • 42. Typical ML pipeline for text Actual ML code
  • 44. Pure Spark MLlib val training = movieReviewsDataTrainingData val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training)
  • 45. Pure Spark MLlib val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 46. Pure Spark MLlib val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 47. Pure Spark MLlib val analyzer = new ESAnalyzer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 48. Pure Spark MLlib val analyzer = new ESAnalyzer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001)
  • 50. Work once – reuse multiple times // index / analyze the data training.saveToEs("movies/reviews")
  • 51. Work once – reuse multiple times // prepare the spec for vectorize – fast and lightweight val spec = s"""{ "features" : [{ | "field": "text", | "type" : "string", | "tokens" : "all_terms", | "number" : "occurrence", | "min_doc_freq" : 2000 | }], | "sparse" : "true"}""".stripMargin ML.prepareSpec(spec, “my-spec”)
  • 52. Access the vector directly // get the features – just another query val payload = s"""{"script_fields" : { "vector" : | { "script" : { "id" : “my-spec","lang" : “doc_to_vector" } } | }}""".stripMargin // index the data vectorRDD = sparkCtx.esRDD("ml/data", payload) // feed the vector to the pipeline val vectorized = vectorRDD.map ( x => // get indices, the vector and length (if (x._1 == "negative") 0.0d else 1.0d, ML.getVectorFrom(x._2)) ).toDF("label", "features")
  • 53. Revised ML pipeline val vectorized = vectorRDD.map... val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.001) val model = lr.fit(vectorized)
  • 54. Simplify ML pipeline Once per dataset, regardless of # of pipelines Raw data is not required any more
  • 55. Need to adjust the model? Change the spec val spec = s"""{ "features" : [{ | "field": "text", | "type" : "string", | "tokens" : "given", | "number" : "tf", | "terms": ["term1", "term2", ...] | }], | "sparse" : "true"}""".stripMargin ML.prepareSpec(spec)
  • 57. All this is WIP Not all features available (currently dictionary, vectors) Works with data outside or inside Elasticsearch (latter is much faster) Bind vectors to queries Other topics WIP: Focused on document / text classification – numeric support is next Model importing / exporting – Spark 2.0 ML persistence Feedback highly sought - Is this useful?