Exploring Word2Vec in Scala

01
Exploring Word2vec in Scala
Gary Sieling
@garysieling
Wingspan, an IQVIA Company
Jan 11, 2018
PHASE
1

01
FindLectures.com: A case study on natural language search
• Demo
• Crawling
• Search Use Cases
• Machine Learning
2

01
Goals
• Using machine learning on text
• Practical examples of Word2Vec in Scala
• Show uses of CUDA
3

01
Agenda
• Proof of Concept: Email alerts
• Concept Search
• CUDA • Demo
• Crawling
4

01
Papers
5
An empirical study of semantic similarity in WordNet and Word2Vec
http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=3003&context=td
A Dual Embedding Space Model for Document Ranking
https://arxiv.org/pdf/1602.01137v1.pdf

01
• Demo
• Crawling
6

01
Concept Search
• Writing, NOT Code
• Excludes “writing css”, “writing php”
• Implies "poetry", "fiction", “copyediting”
8

01
Concept Search
• Recipes, Vegetarian Food
• NOT Dairy
• All three might include "vegan cooking"
• Implies no milk, cheese
9

01
Requirements
• Demo
• Crawling
1
0
• Talks ”about” the chosen topic
• Incorporate meaning – “Scala” + “Machine Learning” -> Dl4j
• May be a concept hierarchy
• Don’t combine meaning if nothing in common (hiking, art)
• Don’t send duplicate talks/articles (e.g. announcement from
different publications)
• Choose a wide variety of talks (not 5 on type systems, etc)
• Bonus points for “negative” meanings (scala, but not monads)

01
This is ”search” problem
• Demo
• Crawling
1
1
• Tokenize text
• Maybe mark known “entities”
• Filter / de-emphasize common terms / meanings
• Find the terms we should have searched for
• Search for those terms
• Re-rank / filter results

01
Solution: Word2Vec
1
2
https://github.com/idio/wiki2vec

1
3
Terms in context: Political Coding
http://findlectures.com/?q=liberation

1
4
Terms in context: Context definitions
http://findlectures.com/?q=quaker

1
5
Training Vectors
Was raised a Quaker
[”was”, “raised”, ”a”, “religious”, “since”, “the”, “whose”, “patience”]
[1, 1, 1, 0, 0, 0, 0, 0 ]
The Quaker whose patience was
[”was”, “raised”, ”a”, “religious”, “since”, “the”, “whose”, “patience”]
[1, 0, 0, 0, 0, 1, 1, 1 ]

1
6
Word2Vec Output
P(Term | context)
Or
P(Context | Term)

01
Example: Vector Addition
Gloria Steinem - Person + Ideology ~=
1. Marxist Feminism
2. Radical Feminism
3. Feminist Movement
4. Feminist Theory
1
7

01
Example: Data Format
1
9
{
"word":"zulus"
"count":30,
"syn0":[
-0.064,0.118,0.031,0.163,0.019,0.197,0.097,-0.139,-0.055,0.155,
-0.033,-0.252,-0.029,0.119,0.007,-0.017,0.187,0.017,0.058,-0.097,
-0.255,-0.159,-0.053,-0.090,-0.118,0.119,0.068,0.025,0.160,-0.035,
-0.216,0.065,0.017,0.038,-0.068,0.101,0.090,0.089,-0.023,0.265,
-0.161,-0.178,-0.362,0.016,0.226,-0.070,-0.079,0.040,0.368,-0.150
],
"syn1":[
0.312,0.379,0.168,-0.371,-0.094,0.218,-0.022,-0.051,0.003,-0.010,
0.233,-0.005,-0.037,0.105,0.025,-0.040,-0.127,.201,0.175,0.277,
0.185,-0.219,-0.504,-0.187,0.069,0.041,0.237,-0.245,0.067,
-0.186,0.127,0.235,-0.262,-0.020,-0.152,0.007,-0.346,0.008,-0.173,
-0.267,-0.049,0.051,0.087,0.046,-0.059,0.147,0.024,0.032,-0.403,
0.019
]
}

01
Example: Similarity
Number from [0, 1]
2
0
Image credit: https://engineering.aweber.com/cosine-similarity/

Operation 1: “Similarity”
def cosineSimilarity(
a: INDArray,
b: INDArray
): Double = {
Transforms.cosineSim(a, b)
}

INDArray
- Similar to numpy array
- Implementation depends on dependency:
libraryDependencies +=
"org.nd4j" % "nd4j-cuda-8.0-platform" % nd4jVersion
libraryDependencies +=
"org.nd4j" % "nd4j-native" % nd4jVersion

01
CUDA
• Specialized instruction set in video cards / GPUs
• Requires NVIDIA SDK and a recent card ($100-$xx,xxx)
• Available on AWS
• Deeplearning4j: JVM libraries for machine learning
• Nd4j/nd4s: matrix algebra on large arrays
2
3

CUDA: example C code
__global__ void coalescedMultiply(float *a, float *c, int M)
{
__shared__ float aTile[TILE_DIM][TILE_DIM],
transposedTile[TILE_DIM][TILE_DIM];
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
transposedTile[threadIdx.x][threadIdx.y] =
a[(blockIdx.x*blockDim.x + threadIdx.y)*TILE_DIM +
threadIdx.x];
__syncthreads();
for (int i = 0; i < TILE_DIM; i++)
sum += aTile[threadIdx.y][i]* transposedTile[i][threadIdx.x];
c[row*M+col] = sum;
}

Training Word2Vec
val vec =
new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(sentenceIterator)
.tokenizerFactory(tokenizer)
.build
vec.fit();

How do you tell if your code is running - GPU

How does this affect word2vec
• Dl4j Demo project: 72 minutes (CPU)
• Dl4j Demo project: 41 minutes (GPU)

Operation 2: Compute a document mean
def getWordVectorsMean(tokens: List[String]): INDArray = {
val words = tokens.filter(
model.getWordVector(_) != null
).sorted
model.getWordVectorsMean(
words.asJavaCollection
)
}

01
"Synonym" Discovery Example
"Code"
3
6
"Coat"

01
Word2Vec – Build a Full Text Query
3
7
List("python", "machine", "learning").map(
(queryTerm) =>
"(" +
model.wordsNearest(
List(queryTerm), // positive terms
List(), // negative terms
25
).map(
(nearWord) =>
"transcript:" + term2 +
"^" + model.similarity(nearWord, term2)
).mkString(" OR ")
+ ")"
).mkString(" AND ")

01
Visual – Nearest terms
3
8
Query Term
Top N closest

01
Example – Query (“Python + Machine Learning”)
3
9
title_s:python^10 OR title_s:"machine learning"^10 …
(title_s: software^1.21 OR title_s:database^1.20 OR title_s:format^1.18
title_s:applications^1.14 OR title_s:browser^1.14 OR title_s:setup^1.13
title_s:bootstrap^1.13 OR title_s:in-class^1.13 OR title_s:campesina^1.12 OR
title_s:excel^1.12 OR title_s:hardware^1.11 OR title_s:programming^1.11 OR
title_s:api^1.11 OR title_s:prototype^1.11 OR title_s:middleware^1.11 OR
title_s:openstreetmap^1.10 OR title_s:product^1.10 OR title_s:app^1.09 OR
title_s:hbp^1.09 OR title_s:programmers^1.09 OR title_s:application^1.09 OR
title_s:databases^1.09 OR title_s:idiomatic^1.09 OR title_s:spreadsheet^1.09
OR title_s:java^1.09 …
AND (…)

01
Results (Python + Machine Learning + BM25)
4
0
Python for Data Analysis
How To Get Started With Machine Learning? | Two Minute Papers
The /r/playrust Classifier: Real World Rust Data Science
Andreas Mueller - Commodity Machine Learning
A Gentle Introduction To Machine Learning
A full Machine learning pipeline in Scikit-learn vs in scala-Spark
Hello World - Machine Learning Recipes #1
Visual diagnostics for more informed machine learning
Lab to Factory: Robust Machine Learning Systems
Machine Learning with Scala on Spark by Jose Quesada

01
Word2Vec – “Writing”
4
1
Issues Related to the Teaching of Creative Writing
Is Nonfiction Literature?
"Oh, you liar, you storyteller": On Fibbing, Fact and Fabulation
The Value of the Essay in the 21st Century
Re writing Re reading Re thinking – Web Design in Words
Aspen New York Book Series: The Art of the Memoir
Cheryl Strayed: "Wild"
Siri Hustvedt in Conversation with Paul Auster
Mary Karr: The 2016 Diana and Simon Raab Writer-in-Residence
History, Memory, and the Novel

01
Aboutness
Re-sorting top 100 documents
val queryMean = model.getWordVectorsMean(List(“writing”))
val mean = model.getWordVectorsMean(NLP.getWords(document._1))
val distance = Transforms.cosineSim(vec._2, queryMean)
5 min 45 seconds @ 16 parallel threads

01
Visual – Aboutness
4
3
Query Average
Document Average

01
Aboutness - Results
Issues Related to the Teaching of Creative Writing: 0.43
Autobiography: 0.41
Contemporary Indian Writers: The Search for Creativity: 0.41
Marjorie Welish: Lecture: 0.40
History and Literature: The State of Play: A Roundtable Discussion: 0.40
Critical Reading of Great Writers: Albert Camus: 0.40
Daniel Schwarz: In Defense of Reading: 0.39
The Journey To The West by Professor Anthony C. Yu: 0.39
Blogs, Twitter, the Kindle: The Future of Reading: 0.39

01
Word2Vec + Overlapping Search Terms
4
5
Python, Programming vs Art, Hiking
terms.map(
(term1) =>
terms.map(
(term2) => (term1, term2)
)
).flatten.filter(
(tuple) => tuple._1 < tuple._2
).map(
(tuple) =>
(tuple._1, tuple._2, w2v.model.get.similarity(tuple._1, tuple._2))
)

01
Visual – Overlapping Search Terms
4
6
Query Term 1
Query Term 2

01
Word2Vec + Overlapping Search Terms
programming<-->python: 0.61
4
7
art<-->hiking: 0.10
Python, Programming
Hiking, Art
(python AND programming)
(hiking OR art)

01
Topic Diversity
A Conversation with David Gerrold, Writer of Star Trek: The Trouble
with Tribbles - Teletalk (58 minutes)
Star Trek: Science Fiction to Science Fact - STEM in 30 (28 minutes)
Pythons Positive Press Pumps Pandas
Why is Python Growing So Quickly? - Stack Overflow Blog
Python explosion blamed on pandas
Writing
Python

01
Visual – Topic Diversity
4
9
Document 1 - Average
Document 2 - Average

01
Pick one, find the least related (Python + Pandas)
5
0
Python explosion blamed on pandas: 1.0
Considering Python's Target Audience: 0.97
Animated routes with QGIS and Python: 0.97
I can't get some SQL to commit reading data from a database: 0.97
Using Python to build an AI Twitter bot people trust: 0.96
Getting a Job as a Self-Taught Python Developer: 0.96
Download and Process DEMs in Python: 0.96
How to mine newsfeed data and extract interactive insights in Python: 0.94
Differential Equation Solver In MATLAB, R, Julia, Python, C, Mathematica,
Maple, and Fortran: 0.86
My personal data science toolbox written in Python: 0.75
1 min 30 seconds @ 16 parallel threads

01
Technique - Summary
• Get top X results, re-shuffle
• More computing resources + data -> higher relevance
5
1

01
Where Word2Vec Works
• Synonym generation
• Improve recall
• Search suggestions
• Incorporate secondary dataset (e.g. for enterprise search, privacy)
5
2

01
Why Scala?
• Ecosystem: Lucene, Spark
• Dependency Management
5
3

01
Performance
• Models take 1-2 weeks to train
• Some of computations take minutes, which would not work in
a search engine
• Changes:
• Pre-compute tokens (e.g. use Lucene)
• Pre-compute averages (don’t naturally store in Lucene)
• Hazelcast
5
4

Other Lessons
- Inventing your own math does not work
- High-dimensional “objects” do not follow your intuition like 2D/3D
- Floating point math not associative
- Math in papers is untyped
- ”Distance” between two vectors – cosine, euclidean, manhattan?
- vs. Probability curves
- Unlike Physics ( types naturally compose, kg⋅m2⋅s−2 )
- Follow a paper
- Nearly impossible to test on your own
- Almost no one publishes code

01
Resources
• "Relevant Search"
• “Deep Learning – A Practitioner’s Approach”
• Deeplearning4j
• Gensim
• https://github.com/DiceTechJobs/ConceptualSearch
• https://www.reddit.com/r/datasets/comments/3mg812/full_r
eddit_submission_corpus_now_available_2006/
6
3

01
FindLectures.com
Weekly Emails with Lunch and Learn Suggestions
http://findlectures.com/emails
6
4

01
Next installment:
Java Users Group In February 2018
“GPU Programming for Java Developers”
6
5

01
Contact:
@garysieling
@findlectures
gary@garysieling.com
https://www.findlectures.com
https://www.garysieling.com
https://github.com/garysieling/6
6

Exploring Word2Vec in Scala

More Related Content

What's hot (14)

Similar to Exploring Word2Vec in Scala (20)

More from Gary Sieling (7)

Recently uploaded (20)

Exploring Word2Vec in Scala

Editor's Notes