SlideShare a Scribd company logo




Machine intelligence in HR technology: resume analysis at scale.



Similarity matching, resume processing and no-frills deep learning models deployment
Matching jobs to people

—



We apply data science over large numbers of resumes in real time telling recruiters

who the most qualified candidates are for their job requirements and explain why.
Resumes processing and profile analysis

—



Opening scans through resume files and database candidate profiles to recommend the
perfect candidates for any given raw job description by analyzing patterns in candidate
history, weighing up skills and fetching candidate code & portfolios to support the decision.
A high level overview of our platform is here: 

https://speakerdeck.com/amorroxic/opening-dot-io-system-architecture
Quick overview: resume logic pipeline
input doc->pdf
string byte array (pdf)
read pdf
resume text
download
byte array
feature extraction
topics extraction
json
json
education parser
json
json
… (10 other tasks)
elasticsearch percolator
combine
json
json
text stream
extra tasks
json
regex (email, etc)
salary regression
json
…
Reactive streams - successive aggregation of state generated by specialized actors
Information extraction flow
links screenshot
array[url]
screenshot
screenshot
…
json
json
json
code extraction
json
combine
json
json
links/emails/phones/etc
github link
simeria http call
combine
jsontext
regex
…
search index
experience vector
summary vector
json
json vec
vec
Async i/o & search index creation. Indexes (candidate vectors) generated/stored on-the-fly.
Matching pipeline
provided title
search
job description
job title
neural parsing
encoder network
neural parsing
dense vector
encoder network
dense vector
Matching jobs/candidates and people similar to each other in high volumes of resumes

—



All input encoded as dense vectors
Similarity = angular/cosine sim between sets of encodings
Real time queries
random projection trees candidates
candidates
A * x + B * (1-x)
random projection trees
Fast matching - computing similarity over vast vector collections (x2)

—



Expensive to compute similarity metrics in real time -> k-nn approximations.
dense input
dense input
job title
x - search biasjob description
PARSING: 

Multi-class, seq2seq, character-level output (dates / OOV names / ..)



SIMILARITY/ENCODERS: 

siamese networks



UP-SKILLING

model ensembles (input -> latent space -> salary regression -> sequences)



SUMMARIES

current area of research



We train multiple models for various contexts (jobs / resumes / ..) 

Encoding input and NLP models architecture
General considerations

—



Mostly seq2seq, siamese, attention architectures
Input is mostly word vectors - however at times we augment input features

with ngrams / character-level information
Caution on word embedding
Potentially trivial example, however - ideal to have models trained on data specific to a particular problem domain

—



fastText (own corpus, 10gb)



“scala’s”, “java/c++/scala”, “java/scala”, “clojure”, .. 

similarity “scala” - “opera” = 0.17 (very syntax oriented)

fastText (own corpus, no character n-grams)



“kotlin”, “clojure”, “haskell”, “scala’s”, “f#”, .. 

similarity “scala” - “opera” = 0.14 (good)

fastText (facebook pre-trained vectors, en wiki)



“traviata”, “barbiere”, “teatro”, “verdi”, .. 

similarity “scala” - “opera” = 0.57 (very broad)

word2vec (own corpus)



“kotlin”, “clojure”, “haskell”, “f#”, .. 

similarity “scala” - “opera” = 0.05 (very specific)

syntactic bias char n-grams in skipgram/cbow semantic biasno char n-grams in skipgram/cbow“scala”
Similarity network architecture
Sequence encoders and similarity core

—



Recurrent networks sharing weights (siamese architecture)
x
1
(b)
x
2
(b)
x
3
(b)
machine learning rocks
h
1
(b)
h
2
(b)
h
3
(b)
x
1
(a)
x
2
(a)
x
3
(a)
x
4
(a)
she loves data science
h
1
(a)
h
2
(a)
h
3
(a)
h
4
(a)
objective score
Input encoding derives from the trained sim network:
activations from the last dense layer before output.
Models as http micro-services

—



Components: Simeria (horizontal scale), Yenisei (vertical), model servers

All native binaries - golang (simeria), c (yenisei & model servers)



Identical provisioning for dev/prod (Ansible) and model hot-swap / roll-back with 0 downtime (Tensorflow serving), AWS/Azure VMs.
Deployments at scale - opening Baikal vm’s
json
processing / search
simeria
…
vector
candidates
yenisei
model server
model server
model server
yenisei
model server
model server
model server
horizontal
verticalvertical
http
http, grpc grpc
LSH query
Search approximation take 1: random projection trees
Forced to optimize this from day one: not a problem of high traffic on regular usage, instead one of large spikes in I/O at ingestion, each customer

having potentially 1M+ resumes = 60M i/o requests (conversions/screenshots/etc), 100m queries (regressions, vectors, etc) and real-time search. 

—



Reduced number of lookups via hyperplanes:

k random partitions of set elements using a suitable sim metric (eq. cosine)
dense input
id
id
id
idid
id
sim
sort
idid
id
id
id
id
sim
id
id
id
id
id
id
sim
candidates
Random projection trees: issues
Good. 

—



Good recall, fast queries Slow to generate
The bad. 

—



The ugly. 

—



Memory usage
Hashing functions generating identical hashes for similar (but not identical) input. 

Various implementations for different distances: Hyperplane, Cross polytope (cosine), MinHash LSH (Jaccard), …



Survey:

https://arxiv.org/pdf/1408.2927.pdf
Locality sensitive hashing
Alternatives. 

—



We use Super-Bit LSH (internal variant, golang) but there’s a wide array of libraries readily available: 

FALCONN, ANNOY, FLANN, RPFOREST, ..
The bigger picture
client resume file S3 bucket path
http://x.x.x.x/parser
http request (internal)
future(json)
connemara

(resume parsing, i/o, 

task orchestration)

Supporting infrastructure (i/o & conversion)
doc->pdf
byte array http post
conversion service
storage
pdf byte array
screenshot
string byte arrayhttp post
screenshot service
http response (zip with images)
storage
json {“path”: …”, “url”: “… }
Load balancing containerized services via Fabio
pdf byte array
conversion service
conversion service
screenshot service
Supporting infrastructure (i/o & conversion)
libreoffice ramdisk
golang web server (iris)Docker containers (http micro servers, golang)
Deployed via MESOS / Marathon
Mesos - kernel abstraction over a cluster
(exposes several machines as they would be one)
Marathon - Mesos init system

Discovery & http load balancing - Consul / Fabio

Conversion (document->pdf) service: http://convert.opening.io/doc-to-pdf
URL screenshots service: http://convert.opening.io/visitor
Conversion (pdf screenshots) service: http://convert.opening.io/pdf-to-img

Generic demo: http://engineering.opening.io/demo.html
doc to pdf container
Thank you.





https://opening.io





@openingdublin

founders@opening.io





25 Oxford Lane, Ranelagh

Dublin, Ireland, European Union.
Catalyser programme

More Related Content

PDF
Lightning fast genomics with Spark, Adam and Scala
PPT
Oscon keynote: Working hard to keep it simple
PPT
PDF
Awesome Banking API's
PPTX
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
PPT
Intro to-html-backbone
PPTX
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Lightning fast genomics with Spark, Adam and Scala
Oscon keynote: Working hard to keep it simple
Awesome Banking API's
Alberto Maria Angelo Paro - Isomorphic programming in Scala and WebDevelopmen...
Intro to-html-backbone
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Similar to Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai (20)

PDF
Elegant and Scalable Code Querying with Code Property Graphs
PDF
Writing RESTful web services using Node.js
PPTX
Thing you didn't know you could do in Spark
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
PPT
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
PDF
SnappyData at Spark Summit 2017
PPTX
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
PPT
Rapid, Scalable Web Development with MongoDB, Ming, and Python
PPTX
Intro to SnappyData Webinar
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
PDF
Crash Course HTML/Rails Slides
PPT
Spark training-in-bangalore
PPTX
Modern C++
PDF
Scalding big ADta
PDF
Deep Dive on Deep Learning (June 2018)
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PPTX
Deep Learning and Watson Studio
PPT
NOSQL and Cassandra
PDF
Signal Digital: The Skinny on Wide Rows
Elegant and Scalable Code Querying with Code Property Graphs
Writing RESTful web services using Node.js
Thing you didn't know you could do in Spark
A look under the hood at Apache Spark's API and engine evolutions
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
SnappyData at Spark Summit 2017
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Intro to SnappyData Webinar
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Crash Course HTML/Rails Slides
Spark training-in-bangalore
Modern C++
Scalding big ADta
Deep Dive on Deep Learning (June 2018)
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Deep Learning and Watson Studio
NOSQL and Cassandra
Signal Digital: The Skinny on Wide Rows
Ad

More from Sebastian Ruder (20)

PDF
Frontiers of Natural Language Processing
PDF
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
PDF
On the Limitations of Unsupervised Bilingual Dictionary Induction
PDF
Neural Semi-supervised Learning under Domain Shift
PDF
Successes and Frontiers of Deep Learning
PDF
Optimization for Deep Learning
PPTX
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
PDF
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
PDF
Transfer Learning for Natural Language Processing
PDF
Transfer Learning -- The Next Frontier for Machine Learning
PDF
Making sense of word senses: An introduction to word-sense disambiguation and...
PDF
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
PDF
NIPS 2016 Highlights - Sebastian Ruder
PDF
Modeling documents with Generative Adversarial Networks - John Glover
PDF
Multi-modal Neural Machine Translation - Iacer Calixto
PDF
Funded PhD/MSc. Opportunities at AYLIEN
PDF
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
PPTX
Transformation Functions for Text Classification: A case study with StackOver...
PDF
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
PDF
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Frontiers of Natural Language Processing
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
On the Limitations of Unsupervised Bilingual Dictionary Induction
Neural Semi-supervised Learning under Domain Shift
Successes and Frontiers of Deep Learning
Optimization for Deep Learning
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Transfer Learning for Natural Language Processing
Transfer Learning -- The Next Frontier for Machine Learning
Making sense of word senses: An introduction to word-sense disambiguation and...
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
NIPS 2016 Highlights - Sebastian Ruder
Modeling documents with Generative Adversarial Networks - John Glover
Multi-modal Neural Machine Translation - Iacer Calixto
Funded PhD/MSc. Opportunities at AYLIEN
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
Transformation Functions for Text Classification: A case study with StackOver...
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Modernizing your data center with Dell and AMD
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Modernizing your data center with Dell and AMD

Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai

  • 1. 
 
 Machine intelligence in HR technology: resume analysis at scale.
 
 Similarity matching, resume processing and no-frills deep learning models deployment
  • 2. Matching jobs to people
 —
 
 We apply data science over large numbers of resumes in real time telling recruiters
 who the most qualified candidates are for their job requirements and explain why. Resumes processing and profile analysis
 —
 
 Opening scans through resume files and database candidate profiles to recommend the perfect candidates for any given raw job description by analyzing patterns in candidate history, weighing up skills and fetching candidate code & portfolios to support the decision. A high level overview of our platform is here: 
 https://speakerdeck.com/amorroxic/opening-dot-io-system-architecture
  • 3. Quick overview: resume logic pipeline input doc->pdf string byte array (pdf) read pdf resume text download byte array feature extraction topics extraction json json education parser json json … (10 other tasks) elasticsearch percolator combine json json text stream extra tasks json regex (email, etc) salary regression json … Reactive streams - successive aggregation of state generated by specialized actors
  • 4. Information extraction flow links screenshot array[url] screenshot screenshot … json json json code extraction json combine json json links/emails/phones/etc github link simeria http call combine jsontext regex … search index experience vector summary vector json json vec vec Async i/o & search index creation. Indexes (candidate vectors) generated/stored on-the-fly.
  • 5. Matching pipeline provided title search job description job title neural parsing encoder network neural parsing dense vector encoder network dense vector Matching jobs/candidates and people similar to each other in high volumes of resumes
 —
 
 All input encoded as dense vectors Similarity = angular/cosine sim between sets of encodings
  • 6. Real time queries random projection trees candidates candidates A * x + B * (1-x) random projection trees Fast matching - computing similarity over vast vector collections (x2)
 —
 
 Expensive to compute similarity metrics in real time -> k-nn approximations. dense input dense input job title x - search biasjob description
  • 7. PARSING: 
 Multi-class, seq2seq, character-level output (dates / OOV names / ..)
 
 SIMILARITY/ENCODERS: 
 siamese networks
 
 UP-SKILLING
 model ensembles (input -> latent space -> salary regression -> sequences)
 
 SUMMARIES
 current area of research
 
 We train multiple models for various contexts (jobs / resumes / ..) 
 Encoding input and NLP models architecture General considerations
 —
 
 Mostly seq2seq, siamese, attention architectures Input is mostly word vectors - however at times we augment input features
 with ngrams / character-level information
  • 8. Caution on word embedding Potentially trivial example, however - ideal to have models trained on data specific to a particular problem domain
 —
 
 fastText (own corpus, 10gb)
 
 “scala’s”, “java/c++/scala”, “java/scala”, “clojure”, .. 
 similarity “scala” - “opera” = 0.17 (very syntax oriented)
 fastText (own corpus, no character n-grams)
 
 “kotlin”, “clojure”, “haskell”, “scala’s”, “f#”, .. 
 similarity “scala” - “opera” = 0.14 (good)
 fastText (facebook pre-trained vectors, en wiki)
 
 “traviata”, “barbiere”, “teatro”, “verdi”, .. 
 similarity “scala” - “opera” = 0.57 (very broad)
 word2vec (own corpus)
 
 “kotlin”, “clojure”, “haskell”, “f#”, .. 
 similarity “scala” - “opera” = 0.05 (very specific)
 syntactic bias char n-grams in skipgram/cbow semantic biasno char n-grams in skipgram/cbow“scala”
  • 9. Similarity network architecture Sequence encoders and similarity core
 —
 
 Recurrent networks sharing weights (siamese architecture) x 1 (b) x 2 (b) x 3 (b) machine learning rocks h 1 (b) h 2 (b) h 3 (b) x 1 (a) x 2 (a) x 3 (a) x 4 (a) she loves data science h 1 (a) h 2 (a) h 3 (a) h 4 (a) objective score Input encoding derives from the trained sim network: activations from the last dense layer before output.
  • 10. Models as http micro-services
 —
 
 Components: Simeria (horizontal scale), Yenisei (vertical), model servers
 All native binaries - golang (simeria), c (yenisei & model servers)
 
 Identical provisioning for dev/prod (Ansible) and model hot-swap / roll-back with 0 downtime (Tensorflow serving), AWS/Azure VMs. Deployments at scale - opening Baikal vm’s json processing / search simeria … vector candidates yenisei model server model server model server yenisei model server model server model server horizontal verticalvertical http http, grpc grpc LSH query
  • 11. Search approximation take 1: random projection trees Forced to optimize this from day one: not a problem of high traffic on regular usage, instead one of large spikes in I/O at ingestion, each customer
 having potentially 1M+ resumes = 60M i/o requests (conversions/screenshots/etc), 100m queries (regressions, vectors, etc) and real-time search. 
 —
 
 Reduced number of lookups via hyperplanes:
 k random partitions of set elements using a suitable sim metric (eq. cosine) dense input id id id idid id sim sort idid id id id id sim id id id id id id sim candidates
  • 12. Random projection trees: issues Good. 
 —
 
 Good recall, fast queries Slow to generate The bad. 
 —
 
 The ugly. 
 —
 
 Memory usage Hashing functions generating identical hashes for similar (but not identical) input. 
 Various implementations for different distances: Hyperplane, Cross polytope (cosine), MinHash LSH (Jaccard), …
 
 Survey:
 https://arxiv.org/pdf/1408.2927.pdf Locality sensitive hashing Alternatives. 
 —
 
 We use Super-Bit LSH (internal variant, golang) but there’s a wide array of libraries readily available: 
 FALCONN, ANNOY, FLANN, RPFOREST, ..
  • 13. The bigger picture client resume file S3 bucket path http://x.x.x.x/parser http request (internal) future(json) connemara
 (resume parsing, i/o, 
 task orchestration)

  • 14. Supporting infrastructure (i/o & conversion) doc->pdf byte array http post conversion service storage pdf byte array screenshot string byte arrayhttp post screenshot service http response (zip with images) storage json {“path”: …”, “url”: “… } Load balancing containerized services via Fabio pdf byte array conversion service conversion service screenshot service
  • 15. Supporting infrastructure (i/o & conversion) libreoffice ramdisk golang web server (iris)Docker containers (http micro servers, golang) Deployed via MESOS / Marathon Mesos - kernel abstraction over a cluster (exposes several machines as they would be one) Marathon - Mesos init system
 Discovery & http load balancing - Consul / Fabio
 Conversion (document->pdf) service: http://convert.opening.io/doc-to-pdf URL screenshots service: http://convert.opening.io/visitor Conversion (pdf screenshots) service: http://convert.opening.io/pdf-to-img
 Generic demo: http://engineering.opening.io/demo.html doc to pdf container