SlideShare a Scribd company logo
Marko Smiljanić,
NIRI Inteligent computing Ltd,
CEO
Developing and validating
a document classifier:
a real-life story
Developing and validating
a document classifier:
a real-life story
Marko Smiljanić, CEO
www.niri-ic.com
About us.
 NIRI: 10 years in Intelligent Computing
 Text Mining
 Knowledge Discovery and Management
 All about Data Science
About me.
My role
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
Business context
Business context
 Largest clients include
 Public Employment Services in EU, USA, and Asia
 Staffing companies in EU, USA
Job seekers
Job
Taxonomy
Skill
Taxonomy
ELISE
Platform
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
Job
Taxonomy
Document
Classification
Occupation Taxonomies
 ISCO (International Standard Classification of Occupations)
 ESCO
 O*NET
 and many more
ISCO level 1 (10)
ISCO level 2 (42)
ISCO level 3 (124)
ISCO level 4 (400)
ESCO level 5 (5000)
“Delivery service worker”
Challenges (for humans)
 Knowing the taxonomy
 Ambiguous taxonomy
 Hybrid positions
 Vague vacancy
Client’s situation
in 2014
Vacancy
Aggregator
and
Classifier
Correct
Code?
Publish
Repair
Code!NO
23%
ОК
65%
nohelp
14%
OK
9%
nocode
12%
2000-4000 per day (into >2000 taxonomy classes)
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
The Solution:
NIRI will build you a better classifier
Vacancy
Aggregator
and
Classifier
NIRI
Classifier
Publish2000-4000 per day
Really?
How accurate will it be?
How will it fit our process?
 Reduce manual effort
 Increase volume
 Improve final accuracy
Really. We will (try to):
But you need to give us training data
> 1M vacancies
No class
12%
Not verified
14%
Verified
74%
Long tail effect
Architecture of our solution
Feature
Extractor
Negotiator
Classifier 1
Classifier 2
Classifier N
…
Vacancy [Class,
Confidence]+
Vacancy Classifier
External
Services
What to do with confidence?
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
Vacancy, Code, Confidence
… To check manualy
Batch Processing
CONFIDENCE High accuracy
Low accuracy
Using confidence
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
Measuring accuracy in the laboratory
No class
12%
Not verified
14%
Verified
74%
No class
Incorrect
Correct
Test
20%
Train
80% Train
Test
x 5
Vacancy
Classifier
74% 78% 80%
85%
14%
13% 12%
10%
12% 9% 8% 5%
CORPUS CLASSIFIER CLASSIFIER 100 CLASSIFIER 1000
ONE OF MANY LABORATORY MEASUREMENTS
Correct Incorrect No class
Measuring accuracy in the laboratory
Measuring accuracy in the laboratory
No class
12%
Not verified
14%
Verified
74%
Vacancy
Classifier
No class 9%
Incorrect
13%
Correct
78%
Original
Classifier
This is not relaity
 Biased train/test set
 Accuracy of test set unknown
 Inability to test against 26%
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
Remember the process?
Vacancy
Aggregator
and
Classifier
Correct
Code?
Publish
Repair
Code!NO
23%
ОК
65%
nohelp
14%
OK
9%
nocode
12%
This is what it actually looks like.
Check Repair
 Reduce manual effort
 Increase volume
 Improve final accuracy
We will
And we proposed this one.
Bulk Accept Check Repair
Best/worst case analysis,
some manual validation,
careful assumptions:
Bulk
Accept
Check Repair
Impact estimation showed that:
 Step 1 effort reduction 60%
(due to bulk acceptance)
 Step 2 effort reduction 11%
(due to bulk acceptance and top 5 offers)
 Significant published volume increase
(almost to 100%)
 Accuracy slightly larger
(+1%, to around 92%)
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
No class
12%
Not verified
14%
Verified
74%
How can we measure production
accuracy?
We can not,
unless…
Golden Test Set
How was it built?
Check & Repair
4 eye principle
Vacancy
Classifier
Published
Original Code
&
Top 5 VC codes
Original Code
&
Top 5 VC codes
Original Code
&
Top 5 VC codes
Every single classification was marked as either
Correct, Acceptable, or Wrong
Results
63.05%
73.91% 72.06% 74.38%
65.98%
77.56% 76.25% 78.69%
CURRENT NIRI VC CURRENT
(HQ SOURCE)
NIRI VC
(HQ SOURCE)
GOLDEN TEST SET RESULTS
Correct Acceptable
Highest Quality Source (Training)
 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow
Wrap up
 Clean semantic data, in real-life, can only be a myth. We are looking into
data cleansing approaches.
 Measuring usefulness can be hard and expensive, but …
 … it can/must to be monitored after the system is deployed.
It changes over time. Continuous learning, where possible is a great thing.
 1) Implementing state-of-the-art machine learning algorithm is one thing.
2) Making it useful is another.
3) Explaining that to the end-user is the third.
 NIRI is a very cool company to work with!
I hope you liked the story, and I thank you for your attention.
Developing and validating
a document classifier:
a real-life story
Marko Smiljanić, CEO
www.niri-ic.com

More Related Content

PDF
The Secret Sauce of Successful Teams
PDF
Tales from a radically polyglot team
PDF
Starting a Collaboration Revolution
PDF
It's the culture, but not as you know it
PDF
How to Build a Healthy On-Call Culture
PDF
Resolve Incidents Faster: Transforming Your Incident Management Process
PDF
A Product Manager's Place in a DevOps World
PDF
The Black Magic of Engineering Management
The Secret Sauce of Successful Teams
Tales from a radically polyglot team
Starting a Collaboration Revolution
It's the culture, but not as you know it
How to Build a Healthy On-Call Culture
Resolve Incidents Faster: Transforming Your Incident Management Process
A Product Manager's Place in a DevOps World
The Black Magic of Engineering Management

What's hot (20)

PPTX
True Agile Certifications - Purchased or Earned...?
PDF
With Great Automation Comes Great Responsibility
PDF
Agile Gambling: A Cautionary Tale!
PDF
The Team Playbook: A Recipe for Healthy Teams
PDF
(Why) Will Agile Work This Time
PPTX
The Business of Agile - Better Faster Cheaper
PPTX
5 Signs You Are In A Waterfall Agile Transformation
PPTX
Advanced Scrum: Answering the Difficult Questions
PDF
The Trojan Retrospective - From Crickets to Conversations
PPTX
Self Managing Scrum Teams: 4 Building Blocks & An Evidence Based Approach
PDF
Stop writing stories, start validating working software
PDF
Agile is Dead :: Aginext London 2018
PPTX
[Atlassian in 부산]Keynote: 성공하는 팀의 비밀 소스 (The Secret Sauce of Successful Teams)
PDF
Agile Risk Management
PDF
Designing for Agile Delight! Customer Obsessed Innovation at Intuit
PDF
What does a Scrum Master do, or should do, all day?
PDF
Agile is Dead :: Agile Connect Lisbon 2018
PDF
The #NoEstimates Movement - 2017
PDF
Agile is Dead :: Viana Tech Meetups 2018
PDF
DOES SFO 2016 - Kaimar Karu - ITIL. You keep using that word. I don't think i...
True Agile Certifications - Purchased or Earned...?
With Great Automation Comes Great Responsibility
Agile Gambling: A Cautionary Tale!
The Team Playbook: A Recipe for Healthy Teams
(Why) Will Agile Work This Time
The Business of Agile - Better Faster Cheaper
5 Signs You Are In A Waterfall Agile Transformation
Advanced Scrum: Answering the Difficult Questions
The Trojan Retrospective - From Crickets to Conversations
Self Managing Scrum Teams: 4 Building Blocks & An Evidence Based Approach
Stop writing stories, start validating working software
Agile is Dead :: Aginext London 2018
[Atlassian in 부산]Keynote: 성공하는 팀의 비밀 소스 (The Secret Sauce of Successful Teams)
Agile Risk Management
Designing for Agile Delight! Customer Obsessed Innovation at Intuit
What does a Scrum Master do, or should do, all day?
Agile is Dead :: Agile Connect Lisbon 2018
The #NoEstimates Movement - 2017
Agile is Dead :: Viana Tech Meetups 2018
DOES SFO 2016 - Kaimar Karu - ITIL. You keep using that word. I don't think i...
Ad

Viewers also liked (15)

PPT
boston airport transportation
PPTX
Dustin humphrey
DOCX
El Podcast y su uso educativo
PPTX
Ibis Performance Insights – Big Data in action - Milan Simakovic
PPTX
Paloma schell
PDF
High perforance-browse-networking-2015-bwahn
PDF
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
PDF
Ralf Herbrich - Introduction to Graphical models in Industry
PPTX
Wet Deregulering Beoordeling Arbeidsrelaties (DBA)
PPTX
Generación de los sistemas operativos
PPT
Robson Alves Mardem Reifison fotografia pet 2016-2
PDF
Tensorflow in Docker
PPSX
ระบบคอมพิวเตอร์
PDF
도커 컨테이너 활용 사례 Codigm - 남 유석 개발팀장 :: AWS Container Day
PDF
Swift package manager
boston airport transportation
Dustin humphrey
El Podcast y su uso educativo
Ibis Performance Insights – Big Data in action - Milan Simakovic
Paloma schell
High perforance-browse-networking-2015-bwahn
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Ralf Herbrich - Introduction to Graphical models in Industry
Wet Deregulering Beoordeling Arbeidsrelaties (DBA)
Generación de los sistemas operativos
Robson Alves Mardem Reifison fotografia pet 2016-2
Tensorflow in Docker
ระบบคอมพิวเตอร์
도커 컨테이너 활용 사례 Codigm - 남 유석 개발팀장 :: AWS Container Day
Swift package manager
Ad

Similar to Developing and validating a document classifier: a real-life story - Marko Smiljanic (20)

PDF
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
PPTX
DevOpsDaysRiga 2018: Antonio Pigna - Put the brAIn into your DevOps workflow
PPTX
Using Machine Learning to Optimize DevOps Practices
PDF
How to add machine learning to your applications today
PPTX
2018-Sogeti-TestExpo-Intelligent_Predictive_Models.pptx
PPTX
Machine Learning Project - 1994 U.S. Census
PDF
Agile Mumbai 2019 Conference | Intelligent DevOps enabling Enterprise Agilit...
PDF
Machine Learning to moderate ads in real world classified's business
PDF
Managing machine learning
PDF
AI and ML Skills for the Testing World Tutorial
PDF
[DSC DACH 23] Learnings integrating a machine learning model to existing soft...
PPTX
Future of data science as a profession
PPTX
Demystifying AI: From Technology to Business Value
PPTX
Data Science Innovation Summit Philadelphia 2019 - pariveda
PDF
7 Leading machine learning Use-cases (AWS)
PDF
How to test an AI application
PPTX
Gliimps Analytical Private Limited
PPTX
Gliimps Analytical Private Limited
PDF
Fitman webinar 2015 06 Verification and Validation methodology
PDF
"The Lean Mindset": Mary & Tom Poppendieck's Keynote at AgileDayChile 2013
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Applied Machin...
DevOpsDaysRiga 2018: Antonio Pigna - Put the brAIn into your DevOps workflow
Using Machine Learning to Optimize DevOps Practices
How to add machine learning to your applications today
2018-Sogeti-TestExpo-Intelligent_Predictive_Models.pptx
Machine Learning Project - 1994 U.S. Census
Agile Mumbai 2019 Conference | Intelligent DevOps enabling Enterprise Agilit...
Machine Learning to moderate ads in real world classified's business
Managing machine learning
AI and ML Skills for the Testing World Tutorial
[DSC DACH 23] Learnings integrating a machine learning model to existing soft...
Future of data science as a profession
Demystifying AI: From Technology to Business Value
Data Science Innovation Summit Philadelphia 2019 - pariveda
7 Leading machine learning Use-cases (AWS)
How to test an AI application
Gliimps Analytical Private Limited
Gliimps Analytical Private Limited
Fitman webinar 2015 06 Verification and Validation methodology
"The Lean Mindset": Mary & Tom Poppendieck's Keynote at AgileDayChile 2013

More from Institute of Contemporary Sciences (20)

PDF
First 5 years of PSI:ML - Filip Panjevic
PPTX
Building valuable (online and offline) Data Science communities - Experience ...
PPT
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
PPTX
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
PPTX
Solving churn challenge in Big Data environment - Jelena Pekez
PDF
Application of Business Intelligence in bank risk management - Dimitar Dilov
PPTX
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
PPTX
Recommender systems for personalized financial advice from concept to product...
PDF
Advanced tools in real time analytics and AI in customer support - Milan Sima...
PPTX
Complex AI forecasting methods for investments portfolio optimization - Pawel...
PPTX
From Zero to ML Hero for Underdogs - Amir Tabakovic
PDF
Data and data scientists are not equal to money david hoyle
PPSX
The price is right - Tomislav Krizan
PPTX
When it's raining gold, bring a bucket - Andjela Culibrk
PPTX
Reality and traps of real time data engineering - Milos Solujic
PPTX
Sensor networks for personalized health monitoring - Vladimir Brusic
PDF
Improving Data Quality with Product Similarity Search
PPTX
Prediction of good patterns for future sales using image recognition
PPTX
Using data to fight corruption: full budget transparency in local government
PPTX
Geospatial Analysis and Open Data - Forest and Climate
First 5 years of PSI:ML - Filip Panjevic
Building valuable (online and offline) Data Science communities - Experience ...
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Solving churn challenge in Big Data environment - Jelena Pekez
Application of Business Intelligence in bank risk management - Dimitar Dilov
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Recommender systems for personalized financial advice from concept to product...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
From Zero to ML Hero for Underdogs - Amir Tabakovic
Data and data scientists are not equal to money david hoyle
The price is right - Tomislav Krizan
When it's raining gold, bring a bucket - Andjela Culibrk
Reality and traps of real time data engineering - Milos Solujic
Sensor networks for personalized health monitoring - Vladimir Brusic
Improving Data Quality with Product Similarity Search
Prediction of good patterns for future sales using image recognition
Using data to fight corruption: full budget transparency in local government
Geospatial Analysis and Open Data - Forest and Climate

Recently uploaded (20)

PDF
Global Data and Analytics Market Outlook Report
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Microsoft Core Cloud Services powerpoint
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Managing Community Partner Relationships
PPTX
New ISO 27001_2022 standard and the changes
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Microsoft 365 products and services descrption
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Transcultural that can help you someday.
PPT
Predictive modeling basics in data cleaning process
PPTX
IMPACT OF LANDSLIDE.....................
Global Data and Analytics Market Outlook Report
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft Core Cloud Services powerpoint
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Managing Community Partner Relationships
New ISO 27001_2022 standard and the changes
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Pilar Kemerdekaan dan Identi Bangsa.pptx
ISS -ESG Data flows What is ESG and HowHow
[EN] Industrial Machine Downtime Prediction
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Microsoft 365 products and services descrption
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Transcultural that can help you someday.
Predictive modeling basics in data cleaning process
IMPACT OF LANDSLIDE.....................

Developing and validating a document classifier: a real-life story - Marko Smiljanic

  • 1. Marko Smiljanić, NIRI Inteligent computing Ltd, CEO Developing and validating a document classifier: a real-life story
  • 2. Developing and validating a document classifier: a real-life story Marko Smiljanić, CEO www.niri-ic.com
  • 3. About us.  NIRI: 10 years in Intelligent Computing  Text Mining  Knowledge Discovery and Management  All about Data Science
  • 5.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 7. Business context  Largest clients include  Public Employment Services in EU, USA, and Asia  Staffing companies in EU, USA
  • 9.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 11. Occupation Taxonomies  ISCO (International Standard Classification of Occupations)  ESCO  O*NET  and many more ISCO level 1 (10) ISCO level 2 (42) ISCO level 3 (124) ISCO level 4 (400) ESCO level 5 (5000) “Delivery service worker” Challenges (for humans)  Knowing the taxonomy  Ambiguous taxonomy  Hybrid positions  Vague vacancy
  • 13.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 14. The Solution: NIRI will build you a better classifier Vacancy Aggregator and Classifier NIRI Classifier Publish2000-4000 per day
  • 15. Really? How accurate will it be? How will it fit our process?  Reduce manual effort  Increase volume  Improve final accuracy Really. We will (try to):
  • 16. But you need to give us training data > 1M vacancies No class 12% Not verified 14% Verified 74%
  • 18. Architecture of our solution Feature Extractor Negotiator Classifier 1 Classifier 2 Classifier N … Vacancy [Class, Confidence]+ Vacancy Classifier External Services
  • 19. What to do with confidence? Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence Vacancy, Code, Confidence … To check manualy Batch Processing CONFIDENCE High accuracy Low accuracy
  • 21.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 22. Measuring accuracy in the laboratory No class 12% Not verified 14% Verified 74% No class Incorrect Correct Test 20% Train 80% Train Test x 5 Vacancy Classifier
  • 23. 74% 78% 80% 85% 14% 13% 12% 10% 12% 9% 8% 5% CORPUS CLASSIFIER CLASSIFIER 100 CLASSIFIER 1000 ONE OF MANY LABORATORY MEASUREMENTS Correct Incorrect No class Measuring accuracy in the laboratory
  • 24. Measuring accuracy in the laboratory No class 12% Not verified 14% Verified 74% Vacancy Classifier No class 9% Incorrect 13% Correct 78% Original Classifier This is not relaity  Biased train/test set  Accuracy of test set unknown  Inability to test against 26%
  • 25.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 27. This is what it actually looks like. Check Repair  Reduce manual effort  Increase volume  Improve final accuracy We will
  • 28. And we proposed this one. Bulk Accept Check Repair
  • 29. Best/worst case analysis, some manual validation, careful assumptions: Bulk Accept Check Repair
  • 30. Impact estimation showed that:  Step 1 effort reduction 60% (due to bulk acceptance)  Step 2 effort reduction 11% (due to bulk acceptance and top 5 offers)  Significant published volume increase (almost to 100%)  Accuracy slightly larger (+1%, to around 92%)
  • 31.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 32. No class 12% Not verified 14% Verified 74% How can we measure production accuracy? We can not, unless…
  • 34. How was it built? Check & Repair 4 eye principle Vacancy Classifier Published Original Code & Top 5 VC codes Original Code & Top 5 VC codes Original Code & Top 5 VC codes Every single classification was marked as either Correct, Acceptable, or Wrong
  • 35. Results 63.05% 73.91% 72.06% 74.38% 65.98% 77.56% 76.25% 78.69% CURRENT NIRI VC CURRENT (HQ SOURCE) NIRI VC (HQ SOURCE) GOLDEN TEST SET RESULTS Correct Acceptable Highest Quality Source (Training)
  • 36.  Business Context  The Challenge  The Solution  Effectiveness  Laboratory measurements  Impact estimation  Reality  Wrap up The flow
  • 37. Wrap up  Clean semantic data, in real-life, can only be a myth. We are looking into data cleansing approaches.  Measuring usefulness can be hard and expensive, but …  … it can/must to be monitored after the system is deployed. It changes over time. Continuous learning, where possible is a great thing.  1) Implementing state-of-the-art machine learning algorithm is one thing. 2) Making it useful is another. 3) Explaining that to the end-user is the third.  NIRI is a very cool company to work with! I hope you liked the story, and I thank you for your attention.
  • 38. Developing and validating a document classifier: a real-life story Marko Smiljanić, CEO www.niri-ic.com

Editor's Notes

  • #11: FIXME Write about challenges.. Size ambiguiotz, semantics, multi faceted vacanc ies