01
Exploring Word2vec in Scala
Gary Sieling
@garysieling
Wingspan, an IQVIA Company
Jan 11, 2018
PHASE
1
01
FindLectures.com: A case study on natural language search
• Demo
• Crawling
• Search Use Cases
• Machine Learning
2
01
Goals
• Using machine learning on text
• Practical examples of Word2Vec in Scala
• Show uses of CUDA
3
01
Agenda
• Proof of Concept: Email alerts
• Concept Search
• CUDA • Demo
• Crawling
• Search Use Cases
• Machine Learning
4
01
Papers
5
An empirical study of semantic similarity in WordNet and Word2Vec
http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=3003&context=td
A Dual Embedding Space Model for Document Ranking
https://arxiv.org/pdf/1602.01137v1.pdf
01
• Demo
• Crawling
• Search Use Cases
• Machine Learning
6
01
Email Alerts
7
01
Concept Search
• Writing, NOT Code
• Excludes “writing css”, “writing php”
• Implies "poetry", "fiction", “copyediting”
8
01
Concept Search
• Recipes, Vegetarian Food
• NOT Dairy
• All three might include "vegan cooking"
• Implies no milk, cheese
9
01
Requirements
• Demo
• Crawling
• Search Use Cases
• Machine Learning
1
0
• Talks ”about” the chosen topic
• Incorporate meaning – “Scala” + “Machine Learning” -> Dl4j
• May be a concept hierarchy
• Don’t combine meaning if nothing in common (hiking, art)
• Don’t send duplicate talks/articles (e.g. announcement from
different publications)
• Choose a wide variety of talks (not 5 on type systems, etc)
• Bonus points for “negative” meanings (scala, but not monads)
01
This is ”search” problem
• Demo
• Crawling
• Search Use Cases
• Machine Learning
1
1
• Tokenize text
• Maybe mark known “entities”
• Filter / de-emphasize common terms / meanings
• Find the terms we should have searched for
• Search for those terms
• Re-rank / filter results
01
Solution: Word2Vec
1
2
https://github.com/idio/wiki2vec
1
3
Terms in context: Political Coding
http://findlectures.com/?q=liberation
1
4
Terms in context: Context definitions
http://findlectures.com/?q=quaker
1
5
Training Vectors
Was raised a Quaker
[”was”, “raised”, ”a”, “religious”, “since”, “the”, “whose”, “patience”]
[1, 1, 1, 0, 0, 0, 0, 0 ]
The Quaker whose patience was
[”was”, “raised”, ”a”, “religious”, “since”, “the”, “whose”, “patience”]
[1, 0, 0, 0, 0, 1, 1, 1 ]
1
6
Word2Vec Output
P(Term | context)
Or
P(Context | Term)
01
Example: Vector Addition
Gloria Steinem - Person + Ideology ~=
1. Marxist Feminism
2. Radical Feminism
3. Feminist Movement
4. Feminist Theory
1
7
01
Suggested Search
1
8
01
Example: Data Format
1
9
{
"word":"zulus"
"count":30,
"syn0":[
-0.064,0.118,0.031,0.163,0.019,0.197,0.097,-0.139,-0.055,0.155,
-0.033,-0.252,-0.029,0.119,0.007,-0.017,0.187,0.017,0.058,-0.097,
-0.255,-0.159,-0.053,-0.090,-0.118,0.119,0.068,0.025,0.160,-0.035,
-0.216,0.065,0.017,0.038,-0.068,0.101,0.090,0.089,-0.023,0.265,
-0.161,-0.178,-0.362,0.016,0.226,-0.070,-0.079,0.040,0.368,-0.150
],
"syn1":[
0.312,0.379,0.168,-0.371,-0.094,0.218,-0.022,-0.051,0.003,-0.010,
0.233,-0.005,-0.037,0.105,0.025,-0.040,-0.127,.201,0.175,0.277,
0.185,-0.219,-0.504,-0.187,0.069,0.041,0.237,-0.245,0.067,
-0.186,0.127,0.235,-0.262,-0.020,-0.152,0.007,-0.346,0.008,-0.173,
-0.267,-0.049,0.051,0.087,0.046,-0.059,0.147,0.024,0.032,-0.403,
0.019
]
}
01
Example: Similarity
Number from [0, 1]
2
0
Image credit: https://engineering.aweber.com/cosine-similarity/
Operation 1: “Similarity”
def cosineSimilarity(
a: INDArray,
b: INDArray
): Double = {
Transforms.cosineSim(a, b)
}
INDArray
- Similar to numpy array
- Implementation depends on dependency:
libraryDependencies +=
"org.nd4j" % "nd4j-cuda-8.0-platform" % nd4jVersion
libraryDependencies +=
"org.nd4j" % "nd4j-native" % nd4jVersion
01
CUDA
• Specialized instruction set in video cards / GPUs
• Requires NVIDIA SDK and a recent card ($100-$xx,xxx)
• Available on AWS
• Deeplearning4j: JVM libraries for machine learning
• Nd4j/nd4s: matrix algebra on large arrays
2
3
CUDA: example C code
__global__ void coalescedMultiply(float *a, float *c, int M)
{
__shared__ float aTile[TILE_DIM][TILE_DIM],
transposedTile[TILE_DIM][TILE_DIM];
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
transposedTile[threadIdx.x][threadIdx.y] =
a[(blockIdx.x*blockDim.x + threadIdx.y)*TILE_DIM +
threadIdx.x];
__syncthreads();
for (int i = 0; i < TILE_DIM; i++)
sum += aTile[threadIdx.y][i]* transposedTile[i][threadIdx.x];
c[row*M+col] = sum;
}
Training Word2Vec
val vec =
new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(sentenceIterator)
.tokenizerFactory(tokenizer)
.build
vec.fit();
How do you tell if your code is running - GPU
How does this affect word2vec
• Dl4j Demo project: 72 minutes (CPU)
• Dl4j Demo project: 41 minutes (GPU)
Operation 2: Compute a document mean
def getWordVectorsMean(tokens: List[String]): INDArray = {
val words = tokens.filter(
model.getWordVector(_) != null
).sorted
model.getWordVectorsMean(
words.asJavaCollection
)
}
01
"Synonym" Discovery Example
"Code"
3
6
Image credit: https://engineering.aweber.com/cosine-similarity/
"Coat"
01
Word2Vec – Build a Full Text Query
3
7
List("python", "machine", "learning").map(
(queryTerm) =>
"(" +
model.wordsNearest(
List(queryTerm), // positive terms
List(), // negative terms
25
).map(
(nearWord) =>
"transcript:" + term2 +
"^" + model.similarity(nearWord, term2)
).mkString(" OR ")
+ ")"
).mkString(" AND ")
01
Visual – Nearest terms
3
8
Image credit: https://engineering.aweber.com/cosine-similarity/
Query Term
Top N closest
01
Example – Query (“Python + Machine Learning”)
3
9
title_s:python^10 OR title_s:"machine learning"^10 …
(title_s: software^1.21 OR title_s:database^1.20 OR title_s:format^1.18
title_s:applications^1.14 OR title_s:browser^1.14 OR title_s:setup^1.13
title_s:bootstrap^1.13 OR title_s:in-class^1.13 OR title_s:campesina^1.12 OR
title_s:excel^1.12 OR title_s:hardware^1.11 OR title_s:programming^1.11 OR
title_s:api^1.11 OR title_s:prototype^1.11 OR title_s:middleware^1.11 OR
title_s:openstreetmap^1.10 OR title_s:product^1.10 OR title_s:app^1.09 OR
title_s:hbp^1.09 OR title_s:programmers^1.09 OR title_s:application^1.09 OR
title_s:databases^1.09 OR title_s:idiomatic^1.09 OR title_s:spreadsheet^1.09
OR title_s:java^1.09 …
AND (…)
01
Results (Python + Machine Learning + BM25)
4
0
Python for Data Analysis
How To Get Started With Machine Learning? | Two Minute Papers
The /r/playrust Classifier: Real World Rust Data Science
Andreas Mueller - Commodity Machine Learning
A Gentle Introduction To Machine Learning
A full Machine learning pipeline in Scikit-learn vs in scala-Spark
Hello World - Machine Learning Recipes #1
Visual diagnostics for more informed machine learning
Lab to Factory: Robust Machine Learning Systems
Machine Learning with Scala on Spark by Jose Quesada
01
Word2Vec – “Writing”
4
1
Issues Related to the Teaching of Creative Writing
Is Nonfiction Literature?
"Oh, you liar, you storyteller": On Fibbing, Fact and Fabulation
The Value of the Essay in the 21st Century
Re writing Re reading Re thinking – Web Design in Words
Aspen New York Book Series: The Art of the Memoir
Cheryl Strayed: "Wild"
Siri Hustvedt in Conversation with Paul Auster
Mary Karr: The 2016 Diana and Simon Raab Writer-in-Residence
History, Memory, and the Novel
01
Aboutness
Re-sorting top 100 documents
val queryMean = model.getWordVectorsMean(List(“writing”))
val mean = model.getWordVectorsMean(NLP.getWords(document._1))
val distance = Transforms.cosineSim(vec._2, queryMean)
5 min 45 seconds @ 16 parallel threads
01
Visual – Aboutness
4
3
Image credit: https://engineering.aweber.com/cosine-similarity/
Query Average
Document Average
01
Aboutness - Results
Issues Related to the Teaching of Creative Writing: 0.43
Autobiography: 0.41
Contemporary Indian Writers: The Search for Creativity: 0.41
Marjorie Welish: Lecture: 0.40
History and Literature: The State of Play: A Roundtable Discussion: 0.40
Critical Reading of Great Writers: Albert Camus: 0.40
Daniel Schwarz: In Defense of Reading: 0.39
The Journey To The West by Professor Anthony C. Yu: 0.39
Blogs, Twitter, the Kindle: The Future of Reading: 0.39
01
Word2Vec + Overlapping Search Terms
4
5
Python, Programming vs Art, Hiking
terms.map(
(term1) =>
terms.map(
(term2) => (term1, term2)
)
).flatten.filter(
(tuple) => tuple._1 < tuple._2
).map(
(tuple) =>
(tuple._1, tuple._2, w2v.model.get.similarity(tuple._1, tuple._2))
)
01
Visual – Overlapping Search Terms
4
6
Image credit: https://engineering.aweber.com/cosine-similarity/
Query Term 1
Query Term 2
01
Word2Vec + Overlapping Search Terms
programming<-->python: 0.61
4
7
art<-->hiking: 0.10
Python, Programming
Hiking, Art
(python AND programming)
(hiking OR art)
01
Topic Diversity
A Conversation with David Gerrold, Writer of Star Trek: The Trouble
with Tribbles - Teletalk (58 minutes)
Star Trek: Science Fiction to Science Fact - STEM in 30 (28 minutes)
Pythons Positive Press Pumps Pandas
Why is Python Growing So Quickly? - Stack Overflow Blog
Python explosion blamed on pandas
Writing
Python
01
Visual – Topic Diversity
4
9
Image credit: https://engineering.aweber.com/cosine-similarity/
Document 1 - Average
Document 2 - Average
01
Pick one, find the least related (Python + Pandas)
5
0
Python explosion blamed on pandas: 1.0
Considering Python's Target Audience: 0.97
Animated routes with QGIS and Python: 0.97
I can't get some SQL to commit reading data from a database: 0.97
Using Python to build an AI Twitter bot people trust: 0.96
Getting a Job as a Self-Taught Python Developer: 0.96
Download and Process DEMs in Python: 0.96
How to mine newsfeed data and extract interactive insights in Python: 0.94
Differential Equation Solver In MATLAB, R, Julia, Python, C, Mathematica,
Maple, and Fortran: 0.86
My personal data science toolbox written in Python: 0.75
1 min 30 seconds @ 16 parallel threads
01
Technique - Summary
• Get top X results, re-shuffle
• More computing resources + data -> higher relevance
5
1
01
Where Word2Vec Works
• Synonym generation
• Improve recall
• Search suggestions
• Incorporate secondary dataset (e.g. for enterprise search, privacy)
5
2
01
Why Scala?
• Ecosystem: Lucene, Spark
• Dependency Management
5
3
01
Performance
• Models take 1-2 weeks to train
• Some of computations take minutes, which would not work in
a search engine
• Changes:
• Pre-compute tokens (e.g. use Lucene)
• Pre-compute averages (don’t naturally store in Lucene)
• Hazelcast
5
4
Other Lessons
- Inventing your own math does not work
- High-dimensional “objects” do not follow your intuition like 2D/3D
- Floating point math not associative
- Math in papers is untyped
- ”Distance” between two vectors – cosine, euclidean, manhattan?
- vs. Probability curves
- Unlike Physics ( types naturally compose, kg⋅m2⋅s−2 )
- Follow a paper
- Nearly impossible to test on your own
- Almost no one publishes code
Next Idea…
01
Resources
• "Relevant Search"
• “Deep Learning – A Practitioner’s Approach”
• Deeplearning4j
• Gensim
• https://github.com/DiceTechJobs/ConceptualSearch
• https://www.reddit.com/r/datasets/comments/3mg812/full_r
eddit_submission_corpus_now_available_2006/
6
3
01
FindLectures.com
Weekly Emails with Lunch and Learn Suggestions
http://findlectures.com/emails
6
4
01
Next installment:
Java Users Group In February 2018
“GPU Programming for Java Developers”
6
5
01
Contact:
@garysieling
@findlectures
gary@garysieling.com
https://www.findlectures.com
https://www.garysieling.com
https://github.com/garysieling/6
6

More Related Content

PPTX
Hadoop with Python
PDF
Building a Real-time Solr-powered Recommendation Engine
PPTX
Hadoop Streaming Tutorial With Python
PPT
Agile Data Science: Hadoop Analytics Applications
PPT
Meow Hagedorn
PPTX
EuroPython 2015 - Big Data with Python and Hadoop
PPTX
R meetup talk
PDF
Agile analytics applications on hadoop
Hadoop with Python
Building a Real-time Solr-powered Recommendation Engine
Hadoop Streaming Tutorial With Python
Agile Data Science: Hadoop Analytics Applications
Meow Hagedorn
EuroPython 2015 - Big Data with Python and Hadoop
R meetup talk
Agile analytics applications on hadoop

What's hot (14)

PPTX
Demo Eclipse Science
PDF
IPython Notebook as a Unified Data Science Interface for Hadoop
PDF
PyData Barcelona Keynote
PDF
21.04.2016 Meetup: Spark vs. Flink
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPT
Agile Data Science: Building Hadoop Analytics Applications
PPT
Do not crawl in the dust 
different ur ls similar text
PDF
Python網站框架絕技: Django 完全攻略班
PPTX
Python in big data world
PPTX
Building High Available and Scalable Machine Learning Applications
PDF
PDF
Elastic Relevance Presentation feb4 2020
PDF
Grails: The search is over
Demo Eclipse Science
IPython Notebook as a Unified Data Science Interface for Hadoop
PyData Barcelona Keynote
21.04.2016 Meetup: Spark vs. Flink
AI與大數據數據處理 Spark實戰(20171216)
Agile Data Science: Building Hadoop Analytics Applications
Do not crawl in the dust 
different ur ls similar text
Python網站框架絕技: Django 完全攻略班
Python in big data world
Building High Available and Scalable Machine Learning Applications
Elastic Relevance Presentation feb4 2020
Grails: The search is over
Ad

Similar to Exploring Word2Vec in Scala (20)

PPTX
Towards Computational Research Objects
PPTX
An Introduction to gensim: "Topic Modelling for Humans"
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
PPTX
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
PDF
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
PDF
The Magical Art of Extracting Meaning From Data
PPTX
Building a real time, solr-powered recommendation engine
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
PPT
Lucene Introduction
PDF
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
PDF
Thought Vectors and Knowledge Graphs in AI-powered Search
PDF
Reflected intelligence evolving self-learning data systems
PDF
Word embeddings as a service - PyData NYC 2015
PPTX
Next.ml Boston: Data Science Dev Ops
PDF
Elasticsearch first-steps
PDF
Agile Data Science 2.0: Using Spark with MongoDB
PDF
Montreal Elasticsearch Meetup
PDF
DMI Workshop: When Search Becomes Research
PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
Towards Computational Research Objects
An Introduction to gensim: "Topic Modelling for Humans"
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
The Magical Art of Extracting Meaning From Data
Building a real time, solr-powered recommendation engine
Reflected Intelligence: Lucene/Solr as a self-learning data system
Lucene Introduction
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Thought Vectors and Knowledge Graphs in AI-powered Search
Reflected intelligence evolving self-learning data systems
Word embeddings as a service - PyData NYC 2015
Next.ml Boston: Data Science Dev Ops
Elasticsearch first-steps
Agile Data Science 2.0: Using Spark with MongoDB
Montreal Elasticsearch Meetup
DMI Workshop: When Search Becomes Research
The Relevance of the Apache Solr Semantic Knowledge Graph
Ad

More from Gary Sieling (7)

PDF
Cloud native java script apps
PDF
Functional programming-in-the-cloud
PDF
Gatsby / JAMStack Philly Meetup - : Cloud Native Mapping Apps: How Satellite ...
PDF
Machine learning in Apache Zeppelin
PDF
Word2vec in Postgres
PPTX
Gpu programming with java
PPTX
Lucene/Solr Revolution 2017: Indexing Videos in Solr
Cloud native java script apps
Functional programming-in-the-cloud
Gatsby / JAMStack Philly Meetup - : Cloud Native Mapping Apps: How Satellite ...
Machine learning in Apache Zeppelin
Word2vec in Postgres
Gpu programming with java
Lucene/Solr Revolution 2017: Indexing Videos in Solr

Recently uploaded (20)

PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPT
Geologic Time for studying geology for geologist
PDF
CloudStack 4.21: First Look Webinar slides
PPT
What is a Computer? Input Devices /output devices
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Hybrid model detection and classification of lung cancer
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Modernising the Digital Integration Hub
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
1 - Historical Antecedents, Social Consideration.pdf
DP Operators-handbook-extract for the Mautical Institute
Web Crawler for Trend Tracking Gen Z Insights.pptx
Geologic Time for studying geology for geologist
CloudStack 4.21: First Look Webinar slides
What is a Computer? Input Devices /output devices
Zenith AI: Advanced Artificial Intelligence
A review of recent deep learning applications in wood surface defect identifi...
A contest of sentiment analysis: k-nearest neighbor versus neural network
WOOl fibre morphology and structure.pdf for textiles
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Assigned Numbers - 2025 - Bluetooth® Document
A comparative study of natural language inference in Swahili using monolingua...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Hybrid model detection and classification of lung cancer
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Modernising the Digital Integration Hub

Exploring Word2Vec in Scala

  • 1. 01 Exploring Word2vec in Scala Gary Sieling @garysieling Wingspan, an IQVIA Company Jan 11, 2018 PHASE 1
  • 2. 01 FindLectures.com: A case study on natural language search • Demo • Crawling • Search Use Cases • Machine Learning 2
  • 3. 01 Goals • Using machine learning on text • Practical examples of Word2Vec in Scala • Show uses of CUDA 3
  • 4. 01 Agenda • Proof of Concept: Email alerts • Concept Search • CUDA • Demo • Crawling • Search Use Cases • Machine Learning 4
  • 5. 01 Papers 5 An empirical study of semantic similarity in WordNet and Word2Vec http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=3003&context=td A Dual Embedding Space Model for Document Ranking https://arxiv.org/pdf/1602.01137v1.pdf
  • 6. 01 • Demo • Crawling • Search Use Cases • Machine Learning 6
  • 8. 01 Concept Search • Writing, NOT Code • Excludes “writing css”, “writing php” • Implies "poetry", "fiction", “copyediting” 8
  • 9. 01 Concept Search • Recipes, Vegetarian Food • NOT Dairy • All three might include "vegan cooking" • Implies no milk, cheese 9
  • 10. 01 Requirements • Demo • Crawling • Search Use Cases • Machine Learning 1 0 • Talks ”about” the chosen topic • Incorporate meaning – “Scala” + “Machine Learning” -> Dl4j • May be a concept hierarchy • Don’t combine meaning if nothing in common (hiking, art) • Don’t send duplicate talks/articles (e.g. announcement from different publications) • Choose a wide variety of talks (not 5 on type systems, etc) • Bonus points for “negative” meanings (scala, but not monads)
  • 11. 01 This is ”search” problem • Demo • Crawling • Search Use Cases • Machine Learning 1 1 • Tokenize text • Maybe mark known “entities” • Filter / de-emphasize common terms / meanings • Find the terms we should have searched for • Search for those terms • Re-rank / filter results
  • 13. 1 3 Terms in context: Political Coding http://findlectures.com/?q=liberation
  • 14. 1 4 Terms in context: Context definitions http://findlectures.com/?q=quaker
  • 15. 1 5 Training Vectors Was raised a Quaker [”was”, “raised”, ”a”, “religious”, “since”, “the”, “whose”, “patience”] [1, 1, 1, 0, 0, 0, 0, 0 ] The Quaker whose patience was [”was”, “raised”, ”a”, “religious”, “since”, “the”, “whose”, “patience”] [1, 0, 0, 0, 0, 1, 1, 1 ]
  • 16. 1 6 Word2Vec Output P(Term | context) Or P(Context | Term)
  • 17. 01 Example: Vector Addition Gloria Steinem - Person + Ideology ~= 1. Marxist Feminism 2. Radical Feminism 3. Feminist Movement 4. Feminist Theory 1 7
  • 20. 01 Example: Similarity Number from [0, 1] 2 0 Image credit: https://engineering.aweber.com/cosine-similarity/
  • 21. Operation 1: “Similarity” def cosineSimilarity( a: INDArray, b: INDArray ): Double = { Transforms.cosineSim(a, b) }
  • 22. INDArray - Similar to numpy array - Implementation depends on dependency: libraryDependencies += "org.nd4j" % "nd4j-cuda-8.0-platform" % nd4jVersion libraryDependencies += "org.nd4j" % "nd4j-native" % nd4jVersion
  • 23. 01 CUDA • Specialized instruction set in video cards / GPUs • Requires NVIDIA SDK and a recent card ($100-$xx,xxx) • Available on AWS • Deeplearning4j: JVM libraries for machine learning • Nd4j/nd4s: matrix algebra on large arrays 2 3
  • 24. CUDA: example C code __global__ void coalescedMultiply(float *a, float *c, int M) { __shared__ float aTile[TILE_DIM][TILE_DIM], transposedTile[TILE_DIM][TILE_DIM]; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x]; transposedTile[threadIdx.x][threadIdx.y] = a[(blockIdx.x*blockDim.x + threadIdx.y)*TILE_DIM + threadIdx.x]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) sum += aTile[threadIdx.y][i]* transposedTile[i][threadIdx.x]; c[row*M+col] = sum; }
  • 25. Training Word2Vec val vec = new Word2Vec.Builder() .minWordFrequency(5) .iterations(1) .layerSize(100) .seed(42) .windowSize(5) .iterate(sentenceIterator) .tokenizerFactory(tokenizer) .build vec.fit();
  • 26. How do you tell if your code is running - GPU
  • 27. How does this affect word2vec • Dl4j Demo project: 72 minutes (CPU) • Dl4j Demo project: 41 minutes (GPU)
  • 28. Operation 2: Compute a document mean def getWordVectorsMean(tokens: List[String]): INDArray = { val words = tokens.filter( model.getWordVector(_) != null ).sorted model.getWordVectorsMean( words.asJavaCollection ) }
  • 29. 01 "Synonym" Discovery Example "Code" 3 6 Image credit: https://engineering.aweber.com/cosine-similarity/ "Coat"
  • 30. 01 Word2Vec – Build a Full Text Query 3 7 List("python", "machine", "learning").map( (queryTerm) => "(" + model.wordsNearest( List(queryTerm), // positive terms List(), // negative terms 25 ).map( (nearWord) => "transcript:" + term2 + "^" + model.similarity(nearWord, term2) ).mkString(" OR ") + ")" ).mkString(" AND ")
  • 31. 01 Visual – Nearest terms 3 8 Image credit: https://engineering.aweber.com/cosine-similarity/ Query Term Top N closest
  • 32. 01 Example – Query (“Python + Machine Learning”) 3 9 title_s:python^10 OR title_s:"machine learning"^10 … (title_s: software^1.21 OR title_s:database^1.20 OR title_s:format^1.18 title_s:applications^1.14 OR title_s:browser^1.14 OR title_s:setup^1.13 title_s:bootstrap^1.13 OR title_s:in-class^1.13 OR title_s:campesina^1.12 OR title_s:excel^1.12 OR title_s:hardware^1.11 OR title_s:programming^1.11 OR title_s:api^1.11 OR title_s:prototype^1.11 OR title_s:middleware^1.11 OR title_s:openstreetmap^1.10 OR title_s:product^1.10 OR title_s:app^1.09 OR title_s:hbp^1.09 OR title_s:programmers^1.09 OR title_s:application^1.09 OR title_s:databases^1.09 OR title_s:idiomatic^1.09 OR title_s:spreadsheet^1.09 OR title_s:java^1.09 … AND (…)
  • 33. 01 Results (Python + Machine Learning + BM25) 4 0 Python for Data Analysis How To Get Started With Machine Learning? | Two Minute Papers The /r/playrust Classifier: Real World Rust Data Science Andreas Mueller - Commodity Machine Learning A Gentle Introduction To Machine Learning A full Machine learning pipeline in Scikit-learn vs in scala-Spark Hello World - Machine Learning Recipes #1 Visual diagnostics for more informed machine learning Lab to Factory: Robust Machine Learning Systems Machine Learning with Scala on Spark by Jose Quesada
  • 34. 01 Word2Vec – “Writing” 4 1 Issues Related to the Teaching of Creative Writing Is Nonfiction Literature? "Oh, you liar, you storyteller": On Fibbing, Fact and Fabulation The Value of the Essay in the 21st Century Re writing Re reading Re thinking – Web Design in Words Aspen New York Book Series: The Art of the Memoir Cheryl Strayed: "Wild" Siri Hustvedt in Conversation with Paul Auster Mary Karr: The 2016 Diana and Simon Raab Writer-in-Residence History, Memory, and the Novel
  • 35. 01 Aboutness Re-sorting top 100 documents val queryMean = model.getWordVectorsMean(List(“writing”)) val mean = model.getWordVectorsMean(NLP.getWords(document._1)) val distance = Transforms.cosineSim(vec._2, queryMean) 5 min 45 seconds @ 16 parallel threads
  • 36. 01 Visual – Aboutness 4 3 Image credit: https://engineering.aweber.com/cosine-similarity/ Query Average Document Average
  • 37. 01 Aboutness - Results Issues Related to the Teaching of Creative Writing: 0.43 Autobiography: 0.41 Contemporary Indian Writers: The Search for Creativity: 0.41 Marjorie Welish: Lecture: 0.40 History and Literature: The State of Play: A Roundtable Discussion: 0.40 Critical Reading of Great Writers: Albert Camus: 0.40 Daniel Schwarz: In Defense of Reading: 0.39 The Journey To The West by Professor Anthony C. Yu: 0.39 Blogs, Twitter, the Kindle: The Future of Reading: 0.39
  • 38. 01 Word2Vec + Overlapping Search Terms 4 5 Python, Programming vs Art, Hiking terms.map( (term1) => terms.map( (term2) => (term1, term2) ) ).flatten.filter( (tuple) => tuple._1 < tuple._2 ).map( (tuple) => (tuple._1, tuple._2, w2v.model.get.similarity(tuple._1, tuple._2)) )
  • 39. 01 Visual – Overlapping Search Terms 4 6 Image credit: https://engineering.aweber.com/cosine-similarity/ Query Term 1 Query Term 2
  • 40. 01 Word2Vec + Overlapping Search Terms programming<-->python: 0.61 4 7 art<-->hiking: 0.10 Python, Programming Hiking, Art (python AND programming) (hiking OR art)
  • 41. 01 Topic Diversity A Conversation with David Gerrold, Writer of Star Trek: The Trouble with Tribbles - Teletalk (58 minutes) Star Trek: Science Fiction to Science Fact - STEM in 30 (28 minutes) Pythons Positive Press Pumps Pandas Why is Python Growing So Quickly? - Stack Overflow Blog Python explosion blamed on pandas Writing Python
  • 42. 01 Visual – Topic Diversity 4 9 Image credit: https://engineering.aweber.com/cosine-similarity/ Document 1 - Average Document 2 - Average
  • 43. 01 Pick one, find the least related (Python + Pandas) 5 0 Python explosion blamed on pandas: 1.0 Considering Python's Target Audience: 0.97 Animated routes with QGIS and Python: 0.97 I can't get some SQL to commit reading data from a database: 0.97 Using Python to build an AI Twitter bot people trust: 0.96 Getting a Job as a Self-Taught Python Developer: 0.96 Download and Process DEMs in Python: 0.96 How to mine newsfeed data and extract interactive insights in Python: 0.94 Differential Equation Solver In MATLAB, R, Julia, Python, C, Mathematica, Maple, and Fortran: 0.86 My personal data science toolbox written in Python: 0.75 1 min 30 seconds @ 16 parallel threads
  • 44. 01 Technique - Summary • Get top X results, re-shuffle • More computing resources + data -> higher relevance 5 1
  • 45. 01 Where Word2Vec Works • Synonym generation • Improve recall • Search suggestions • Incorporate secondary dataset (e.g. for enterprise search, privacy) 5 2
  • 46. 01 Why Scala? • Ecosystem: Lucene, Spark • Dependency Management 5 3
  • 47. 01 Performance • Models take 1-2 weeks to train • Some of computations take minutes, which would not work in a search engine • Changes: • Pre-compute tokens (e.g. use Lucene) • Pre-compute averages (don’t naturally store in Lucene) • Hazelcast 5 4
  • 48. Other Lessons - Inventing your own math does not work - High-dimensional “objects” do not follow your intuition like 2D/3D - Floating point math not associative - Math in papers is untyped - ”Distance” between two vectors – cosine, euclidean, manhattan? - vs. Probability curves - Unlike Physics ( types naturally compose, kg⋅m2⋅s−2 ) - Follow a paper - Nearly impossible to test on your own - Almost no one publishes code
  • 50. 01 Resources • "Relevant Search" • “Deep Learning – A Practitioner’s Approach” • Deeplearning4j • Gensim • https://github.com/DiceTechJobs/ConceptualSearch • https://www.reddit.com/r/datasets/comments/3mg812/full_r eddit_submission_corpus_now_available_2006/ 6 3
  • 51. 01 FindLectures.com Weekly Emails with Lunch and Learn Suggestions http://findlectures.com/emails 6 4
  • 52. 01 Next installment: Java Users Group In February 2018 “GPU Programming for Java Developers” 6 5

Editor's Notes

  • #2: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #3: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #4: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #5: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #6: If we query for a topic instead, we also get appropriate results.
  • #7: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #8: If we query for a topic instead, we also get appropriate results.
  • #9: If we query for a topic instead, we also get appropriate results.
  • #10: If we query for a topic instead, we also get appropriate results.
  • #11: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #12: The dataset for this talk comes from our corporate lunch and learn. We look for general interest talks that stand alone and fit a lunch break.
  • #13: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #19: If we query the model for "MLK" using it's Python API, we get appropriate results. If you go further down this list, you get contemporary white supremacists.
  • #20: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #21: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #24: If you're interested in these topics, here are some useful resources.
  • #26: If you're interested in these topics, here are some useful resources.
  • #33: Google does something similar, using the series of search terms users typically type. Let's say you don't have many customers or users, or that they don't have insight.
  • #34: Wikipedia also has a lot of trivia, like every honorary degree Ron Chernow has earned.
  • #37: If we query for a topic instead, we also get appropriate results.
  • #38: If we query for a topic instead, we also get appropriate results.
  • #39: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #40: If we query for a topic instead, we also get appropriate results.
  • #41: If we query for a topic instead, we also get appropriate results.
  • #42: If we query for a topic instead, we also get appropriate results.
  • #43: If we query for a topic instead, we also get appropriate results.
  • #44: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #45: If we query for a topic instead, we also get appropriate results.
  • #46: If we query for a topic instead, we also get appropriate results.
  • #47: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #48: If we query for a topic instead, we also get appropriate results.
  • #49: If we query for a topic instead, we also get appropriate results.
  • #50: The solution is to use a machine learning algorithm that can identify significant relationships in text. Word2Vec has become a famous algorithm, because it can learn implicit concepts, like gender, verb tenses, or the concept of a capital city. In one of the most well known examples, it identifies that king is to queen as man is to woman. You can add and subtract concepts mathematically, such as king - man + woman = queen, and find distances between concepts, to find things which are similar.
  • #51: If we query for a topic instead, we also get appropriate results.
  • #52: If you're interested in these topics, here are some useful resources.
  • #53: If we query for a topic instead, we also get appropriate results.
  • #54: If you're interested in these topics, here are some useful resources.
  • #55: If you're interested in these topics, here are some useful resources.
  • #57: If you're interested in these topics, here are some useful resources.
  • #58: If you're interested in these topics, here are some useful resources.
  • #64: If you're interested in these topics, here are some useful resources.
  • #65: If you're interested in these topics, here are some useful resources.
  • #66: If you're interested in these topics, here are some useful resources.
  • #67: We're Hiring