Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai

Machine intelligence in HR technology: resume analysis at scale. 
 
Similarity matching, resume processing and no-frills deep learning models deployment

Matching jobs to people 
— 
 
We apply data science over large numbers of resumes in real time telling recruiters 
who the most qualified candidates are for their job requirements and explain why.
Resumes processing and profile analysis 
— 
 
Opening scans through resume files and database candidate profiles to recommend the
perfect candidates for any given raw job description by analyzing patterns in candidate
history, weighing up skills and fetching candidate code & portfolios to support the decision.
A high level overview of our platform is here:  
https://speakerdeck.com/amorroxic/opening-dot-io-system-architecture

Quick overview: resume logic pipeline
input doc->pdf
string byte array (pdf)
read pdf
resume text
download
byte array
feature extraction
topics extraction
json
json
education parser
json
json
… (10 other tasks)
elasticsearch percolator
combine
json
json
text stream
extra tasks
json
regex (email, etc)
salary regression
json
…
Reactive streams - successive aggregation of state generated by specialized actors

Information extraction ﬂow
links screenshot
array[url]
screenshot
screenshot
…
json
json
json
code extraction
json
combine
json
json
links/emails/phones/etc
github link
simeria http call
combine
jsontext
regex
…
search index
experience vector
summary vector
json
json vec
vec
Async i/o & search index creation. Indexes (candidate vectors) generated/stored on-the-ﬂy.

Matching pipeline
provided title
search
job description
job title
neural parsing
encoder network
neural parsing
dense vector
encoder network
dense vector
Matching jobs/candidates and people similar to each other in high volumes of resumes 
— 
 
All input encoded as dense vectors
Similarity = angular/cosine sim between sets of encodings

Real time queries
random projection trees candidates
candidates
A * x + B * (1-x)
random projection trees
Fast matching - computing similarity over vast vector collections (x2) 
— 
 
Expensive to compute similarity metrics in real time -> k-nn approximations.
dense input
dense input
job title
x - search biasjob description

PARSING:  
Multi-class, seq2seq, character-level output (dates / OOV names / ..) 
 
SIMILARITY/ENCODERS:  
siamese networks 
 
UP-SKILLING 
model ensembles (input -> latent space -> salary regression -> sequences) 
 
SUMMARIES 
current area of research 
 
We train multiple models for various contexts (jobs / resumes / ..)  
Encoding input and NLP models architecture
General considerations 
— 
 
Mostly seq2seq, siamese, attention architectures
Input is mostly word vectors - however at times we augment input features 
with ngrams / character-level information

Caution on word embedding
Potentially trivial example, however - ideal to have models trained on data speciﬁc to a particular problem domain 
— 
 
fastText (own corpus, 10gb) 
 
“scala’s”, “java/c++/scala”, “java/scala”, “clojure”, ..  
similarity “scala” - “opera” = 0.17 (very syntax oriented) 
fastText (own corpus, no character n-grams) 
 
“kotlin”, “clojure”, “haskell”, “scala’s”, “f#”, ..  
similarity “scala” - “opera” = 0.14 (good) 
fastText (facebook pre-trained vectors, en wiki) 
 
“traviata”, “barbiere”, “teatro”, “verdi”, ..  
similarity “scala” - “opera” = 0.57 (very broad) 
word2vec (own corpus) 
 
“kotlin”, “clojure”, “haskell”, “f#”, ..  
similarity “scala” - “opera” = 0.05 (very speciﬁc) 
syntactic bias char n-grams in skipgram/cbow semantic biasno char n-grams in skipgram/cbow“scala”

Similarity network architecture
Sequence encoders and similarity core 
— 
 
Recurrent networks sharing weights (siamese architecture)
x
1
(b)
x
2
(b)
x
3
(b)
machine learning rocks
h
1
(b)
h
2
(b)
h
3
(b)
x
1
(a)
x
2
(a)
x
3
(a)
x
4
(a)
she loves data science
h
1
(a)
h
2
(a)
h
3
(a)
h
4
(a)
objective score
Input encoding derives from the trained sim network:
activations from the last dense layer before output.

Models as http micro-services 
— 
 
Components: Simeria (horizontal scale), Yenisei (vertical), model servers 
All native binaries - golang (simeria), c (yenisei & model servers) 
 
Identical provisioning for dev/prod (Ansible) and model hot-swap / roll-back with 0 downtime (Tensorﬂow serving), AWS/Azure VMs.
Deployments at scale - opening Baikal vm’s
json
processing / search
simeria
…
vector
candidates
yenisei
model server
model server
model server
yenisei
model server
model server
model server
horizontal
verticalvertical
http
http, grpc grpc
LSH query

Search approximation take 1: random projection trees
Forced to optimize this from day one: not a problem of high trafﬁc on regular usage, instead one of large spikes in I/O at ingestion, each customer 
having potentially 1M+ resumes = 60M i/o requests (conversions/screenshots/etc), 100m queries (regressions, vectors, etc) and real-time search.  
— 
 
Reduced number of lookups via hyperplanes: 
k random partitions of set elements using a suitable sim metric (eq. cosine)
dense input
id
id
id
idid
id
sim
sort
idid
id
id
id
id
sim
id
id
id
id
id
id
sim
candidates

Random projection trees: issues
Good.  
— 
 
Good recall, fast queries Slow to generate
The bad.  
— 
 
The ugly.  
— 
 
Memory usage
Hashing functions generating identical hashes for similar (but not identical) input.  
Various implementations for diﬀerent distances: Hyperplane, Cross polytope (cosine), MinHash LSH (Jaccard), … 
 
Survey: 
https://arxiv.org/pdf/1408.2927.pdf
Locality sensitive hashing
Alternatives.  
— 
 
We use Super-Bit LSH (internal variant, golang) but there’s a wide array of libraries readily available:  
FALCONN, ANNOY, FLANN, RPFOREST, ..

The bigger picture
client resume ﬁle S3 bucket path
http://x.x.x.x/parser
http request (internal)
future(json)
connemara 
(resume parsing, i/o,  
task orchestration)

Supporting infrastructure (i/o & conversion)
doc->pdf
byte array http post
conversion service
storage
pdf byte array
screenshot
string byte arrayhttp post
screenshot service
http response (zip with images)
storage
json {“path”: …”, “url”: “… }
Load balancing containerized services via Fabio
pdf byte array
conversion service
conversion service
screenshot service

Supporting infrastructure (i/o & conversion)
libreofﬁce ramdisk
golang web server (iris)Docker containers (http micro servers, golang)
Deployed via MESOS / Marathon
Mesos - kernel abstraction over a cluster
(exposes several machines as they would be one)
Marathon - Mesos init system 
Discovery & http load balancing - Consul / Fabio 
Conversion (document->pdf) service: http://convert.opening.io/doc-to-pdf
URL screenshots service: http://convert.opening.io/visitor
Conversion (pdf screenshots) service: http://convert.opening.io/pdf-to-img 
Generic demo: http://engineering.opening.io/demo.html
doc to pdf container

Thank you. 
 
 
https://opening.io 
 
 
@openingdublin 
founders@opening.io 
 
 
25 Oxford Lane, Ranelagh 
Dublin, Ireland, European Union.
Catalyser programme

Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai

More Related Content

Similar to Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai (20)

More from Sebastian Ruder (20)

Recently uploaded (20)

Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai