""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

…with  
Natural Language
Processing and
Text Classiﬁcation
Data Natives 2015
19.11.2015 - Peter Grosskopf

Hey, I’m Peter.
Developer (mostly Ruby), Founder (of Zweitag)
Chief Development Officer @ HitFox Group
Department „Tech & Development“ (TechDev)

Company Builder with 500+
employees
in AdTech, FinTech and Big Data

Company Builder =
💡Ideas + 👥People

How do we select the best people out of more than 1000
applications every month in a consistent way?
?
? ?
Machine Learning ?

Yeah!
I found a
solution
Not really 💩

Our Goal
Add a sort-by-
relevance to lower
the screening costs
and invite people
faster

Action Steps
1. Prepare the textual data
2. Build a model to classify the data
3. Run it!
4. Display and interpret  
the results

1. Prepare
Load data
Kick out outlier
Clean out stopwords (language
detection + stemming with NLTK)
Define classes for workflow states
Link data

2. Build a model
tf-idf / bag of words
!: term-frequency
idf: inverse document frequency

Transform / Quantization
from a textual shape to a numerical
vector-form
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(tf*idf, tf*idf, tf*idf, tf*idf, tf*idf, tf*idf)

term-frequency (tf)
Count occurrences in document
I am a nice little text
-> v(i, am, a, nice, little, text)
-> v(1*idf, 1*idf, 1*idf, 1*idf, 1*idf, 1*idf)

inverse document
frequency (idf)
Count how often a term occurs in
the whole document set and invert
with the logarithm
d1(I play a fun game)
-> v1(i, play, a, fun, game)
d2(I am a nice little text)
-> v2(i, am, a, nice, little, text)
-> v2(1*log(2/2), 1*log(2/1), 1*log(2/2), …)
-> v2(0, 0.3, 0, 0.3, 0.3, 0.3)

bag of words
Simple approach to calculate the
frequency of relevant terms
Ignores contextual information 😢
better:
n-grams

n-grams
Generate new tokens by
concatenating neighboured tokens
example (1 and 2-grams): (nice, little, text)
-> (nice, nice_little, little, little_text, text)
-> From three tokens we just generated 5 tokens.
example2 (1 and 2-grams): (new, york, is, a, nice,
city)
-> (new, new_york, york, york_is, is, is_a, a,
a_nice, nice, nice_city, city)

vectorize the resumes
build 1 to 4 n_grams with Scikit
(sklearn) TdIdf-Vectorizer

Deﬁne runtime
Train-test-split by date (80/20)
Approach:
Pick randomly CVs out of the test
group
Count how many CVs have to be
screened to find all the good CVs

3. run it!
After the resumes are transformed
to vector form, the classification
gets done with a classical statistical
machine learning model  
 
(e.g. multinominal-naive-bayes,
stochastic-gradient-descent-
classifier, logistic-regression and
random-forest)

4. Results
Generated with a combination of
stochastic-gradient-descent-
classifier and logistic-regression
with the python machine-learning
library scikit-learn
AUC: 73.0615 %

Wrap Up
1. Prepare 2. Build Model 3. Run 4. Interpret
import data
vectorize the
CVs with
1 to 4 n_grams
choose Machine
Learning model
visualize results
clean data
define train-test-
split
run it!
Area under curve
(AUC)

Conclusion
After trying many different
approaches (doc2vec, Recurrent
Neuronal Networks, Feature
Hashing)- bag of words still the
best
Explana<on: CV documents do not
contain too many semantics

Outlook
Build a better database
Experiment with new approaches
and tune models
Build a continuous learning model

""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox

More Related Content

Viewers also liked (12)

Similar to ""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox (20)

More from Dataconomy Media (20)

Recently uploaded (20)

""Into the Wild" ... with Natural Language Processing and Text Classification", Peter Grosskopf, Chief Development Officer at HitFox