SlideShare a Scribd company logo
…with 

Natural Language
Processing and
Text Classification
Data Natives 2015
19.11.2015 - Peter Grosskopf
Hey, I’m Peter.
Developer (mostly Ruby), Founder (of Zweitag)
Chief Development Officer @ HitFox Group
Department „Tech & Development“ (TechDev)
Company Builder with 500+
employees
in AdTech, FinTech and Big Data
Company Builder =
💡Ideas + 👥People
How do we select the best people out of more than 1000
applications every month in a consistent way?
?
? ?
Machine Learning ?
Yeah!
I found a
solution
Not really 💩
Our Goal
Add a sort-by-
relevance to lower
the screening costs
and invite people
faster
Let’s Go!
Action Steps
1. Prepare the textual data
2. Build a model to classify the data
3. Run it!
4. Display and interpret 

the results
1. Prepare
Load data
Kick out outlier
Clean out stopwords (language
detection + stemming with NLTK)
Define classes for workflow states
Link data
2. Build a model
tf-idf / bag of words
!: term-frequency
idf: inverse document frequency
Transform / Quantization
from a textual shape to a numerical
vector-form
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
inverse document
frequency (idf)
Count how often a term occurs in
the whole document set and invert
with the logarithm
d1(I play a fun game)
-> v1(i, play, a, fun, game)
d2(I am a nice little text)
-> v2(i, am, a, nice, little, text)
-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …)
-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
bag of words
Simple approach to calculate the
frequency of relevant terms
Ignores contextual information 😢
better:
n-grams
n-grams
Generate new tokens by
concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, nice_little, little, little_text, text)
-> From three tokens we just generated 5 tokens.
example2 (1 and 2-grams): (new, york, is, a, nice,
city)
-> (new, new_york, york, york_is, is, is_a, a,
a_nice, nice, nice_city, city)
vectorize the resumes
build 1 to 4 n_grams with Scikit
(sklearn) TdIdf-Vectorizer
Define runtime
Train-test-split by date (80/20)
Approach:
Pick randomly CVs out of the test
group
Count how many CVs have to be
screened to find all the good CVs
3. run it!
After the resumes are transformed
to vector form, the classification
gets done with a classical statistical
machine learning model 



(e.g. multinominal-naive-bayes,
stochastic-gradient-descent-
classifier, logistic-regression and
random-forest)
4. Results
Generated with a combination of
stochastic-gradient-descent-
classifier and logistic-regression
with the python machine-learning
library scikit-learn
AUC: 73.0615 %
Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import data
vectorize the
CVs with
1 to 4 n_grams
choose Machine
Learning model
visualize results
clean data
define train-test-
split
run it!
Area under curve
(AUC)
Conclusion
After trying many different
approaches (doc2vec, Recurrent
Neuronal Networks, Feature
Hashing)- bag of words still the
best
Explana<on: CV documents do not
contain too many semantics
Outlook
Build a better database
Experiment with new approaches
and tune models
Build a continuous learning model
Happy End.
Thanks :-)

More Related Content

PDF
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
PPTX
Introduction to MongoDB at IGDTUW
KEY
MongoDB Java Development - MongoBoston 2010
PDF
Building apps why you should bet on the web
PDF
Сергей Матвеенко: MongoEngine: NoORM for NoSQL
PPTX
Intro to mongodb mongouk jun2010
PDF
Hadoop bangalore-meetup-dec-2011-crux
PDF
Conn w 2016 web
Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Introduction to MongoDB at IGDTUW
MongoDB Java Development - MongoBoston 2010
Building apps why you should bet on the web
Сергей Матвеенко: MongoEngine: NoORM for NoSQL
Intro to mongodb mongouk jun2010
Hadoop bangalore-meetup-dec-2011-crux
Conn w 2016 web

Viewers also liked (12)

PPT
The 5 Golden Rules of Location Marketing
PDF
Comparison of Matrix Completion Algorithms for Background Initialization in V...
PPTX
(youthlab indo) How Converse Beats Them All: Sneakers as status symbol for In...
PPS
Structura calculatorului
PDF
[Elite Camp 2016] Peep Laja - Fresh Out Of the Oven
 
PPTX
Beyond the Data Lake - Matthias Korn, Technical Consultant at Data Virtuality
PPTX
Machine Learning in Big Data
PDF
Innoveren als een startup
PPT
BOYUN ANATOMİSİ (fazlası için www.tipfakultesi.org)
PPTX
Word study dlc
PDF
[Elite Camp 2016] Yehoshua Coren - Strategic And Tactical Implementation And ...
 
PDF
РИФ 2016, Предикативная аналитика
The 5 Golden Rules of Location Marketing
Comparison of Matrix Completion Algorithms for Background Initialization in V...
(youthlab indo) How Converse Beats Them All: Sneakers as status symbol for In...
Structura calculatorului
[Elite Camp 2016] Peep Laja - Fresh Out Of the Oven
 
Beyond the Data Lake - Matthias Korn, Technical Consultant at Data Virtuality
Machine Learning in Big Data
Innoveren als een startup
BOYUN ANATOMİSİ (fazlası için www.tipfakultesi.org)
Word study dlc
[Elite Camp 2016] Yehoshua Coren - Strategic And Tactical Implementation And ...
 
РИФ 2016, Предикативная аналитика
Ad

Similar to ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox (20)

PPTX
Feature Engineering for NLP
PPTX
Text classification with fast text elena_meetup_milano_27_june
PDF
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
PDF
Text mining and social network analysis of twitter data part 1
PDF
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
PPT
Hands on Mahout!
PDF
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPT
Intro.ppt
PDF
F sharp - an overview
PDF
Types Working for You, Not Against You
PDF
Introduction to R for data science
PDF
Recipe2Vec: Or how does my robot know what’s tasty
PDF
CommitBERT.pdf
PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
PDF
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
PDF
RDataMining slides-r-programming
PDF
Daniel Krasner - High Performance Text Processing with Rosetta
PDF
Natural Language Processing
Feature Engineering for NLP
Text classification with fast text elena_meetup_milano_27_june
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
AM4TM_WS22_Practice_01_NLP_Basics.pdf
Text mining and social network analysis of twitter data part 1
Spoofax: ontwikkeling van domeinspecifieke talen in Eclipse
Hands on Mahout!
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
AI與大數據數據處理 Spark實戰(20171216)
Intro.ppt
F sharp - an overview
Types Working for You, Not Against You
Introduction to R for data science
Recipe2Vec: Or how does my robot know what’s tasty
CommitBERT.pdf
Best Practices for Building and Deploying Data Pipelines in Apache Spark
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
RDataMining slides-r-programming
Daniel Krasner - High Performance Text Processing with Rosetta
Natural Language Processing
Ad

More from Dataconomy Media (20)

PDF
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
PDF
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
PDF
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
PDF
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
PPTX
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
PPTX
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
PPTX
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
PDF
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
PPTX
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
PDF
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
PPTX
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
PDF
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
PDF
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
PDF
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
PDF
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
PPTX
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
PDF
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
PPTX
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
PPTX
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
PPTX
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...

Recently uploaded (20)

PPTX
Leprosy and NLEP programme community medicine
PPTX
Introduction to Inferential Statistics.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
Predictive modeling basics in data cleaning process
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Microsoft 365 products and services descrption
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
New ISO 27001_2022 standard and the changes
PPTX
CYBER SECURITY the Next Warefare Tactics
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Leprosy and NLEP programme community medicine
Introduction to Inferential Statistics.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
retention in jsjsksksksnbsndjddjdnFPD.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Predictive modeling basics in data cleaning process
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Microsoft 365 products and services descrption
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Introduction to Data Science and Data Analysis
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
New ISO 27001_2022 standard and the changes
CYBER SECURITY the Next Warefare Tactics
DU, AIS, Big Data and Data Analytics.ppt
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Pilar Kemerdekaan dan Identi Bangsa.pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx

""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

  • 1. …with 
 Natural Language Processing and Text Classification Data Natives 2015 19.11.2015 - Peter Grosskopf
  • 2. Hey, I’m Peter. Developer (mostly Ruby), Founder (of Zweitag) Chief Development Officer @ HitFox Group Department „Tech & Development“ (TechDev)
  • 3. Company Builder with 500+ employees in AdTech, FinTech and Big Data
  • 5. How do we select the best people out of more than 1000 applications every month in a consistent way? ? ? ? Machine Learning ?
  • 7. Our Goal Add a sort-by- relevance to lower the screening costs and invite people faster
  • 9. Action Steps 1. Prepare the textual data 2. Build a model to classify the data 3. Run it! 4. Display and interpret 
 the results
  • 10. 1. Prepare Load data Kick out outlier Clean out stopwords (language detection + stemming with NLTK) Define classes for workflow states Link data
  • 11. 2. Build a model tf-idf / bag of words !: term-frequency idf: inverse document frequency
  • 12. Transform / Quantization from a textual shape to a numerical vector-form I am a nice little text -> v(i, am, a, nice, little, text) -> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)
  • 13. term-frequency (tf) Count occurrences in document I am a nice little text -> v(i, am, a, nice, little, text) -> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)
  • 14. inverse document frequency (idf) Count how often a term occurs in the whole document set and invert with the logarithm d1(I play a fun game) -> v1(i, play, a, fun, game) d2(I am a nice little text) -> v2(i, am, a, nice, little, text) -> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …) -> v2(0, 0.3, 0, 0.3, 0.3, 0.3)
  • 15. bag of words Simple approach to calculate the frequency of relevant terms Ignores contextual information 😢 better: n-grams
  • 16. n-grams Generate new tokens by concatenating neighboured tokens example (1 and 2-grams): (nice, little, text) -> (nice, nice_little, little, little_text, text) -> From three tokens we just generated 5 tokens. example2 (1 and 2-grams): (new, york, is, a, nice, city) -> (new, new_york, york, york_is, is, is_a, a, a_nice, nice, nice_city, city)
  • 17. vectorize the resumes build 1 to 4 n_grams with Scikit (sklearn) TdIdf-Vectorizer
  • 18. Define runtime Train-test-split by date (80/20) Approach: Pick randomly CVs out of the test group Count how many CVs have to be screened to find all the good CVs
  • 19. 3. run it! After the resumes are transformed to vector form, the classification gets done with a classical statistical machine learning model 
 
 (e.g. multinominal-naive-bayes, stochastic-gradient-descent- classifier, logistic-regression and random-forest)
  • 20. 4. Results Generated with a combination of stochastic-gradient-descent- classifier and logistic-regression with the python machine-learning library scikit-learn AUC: 73.0615 %
  • 21. Wrap Up 1. Prepare 2. Build Model 3. Run 4. Interpret import data vectorize the CVs with 1 to 4 n_grams choose Machine Learning model visualize results clean data define train-test- split run it! Area under curve (AUC)
  • 22. Conclusion After trying many different approaches (doc2vec, Recurrent Neuronal Networks, Feature Hashing)- bag of words still the best Explana<on: CV documents do not contain too many semantics
  • 23. Outlook Build a better database Experiment with new approaches and tune models Build a continuous learning model