SlideShare a Scribd company logo
Jeff Fried
CTO
BA Insight
@jefffried
#tbc2016
Rules-Based vs. Document-Based Bake-off
AutoClassificaiton - Rules versus Machine Learning
Focused on Search and
SharePoint since 2004
Longtime
Search Nerd
• CTO, BA Insight
• Senior PM, Microsoft
• VP, FAST
• SVP, LingoMotors
About Jeff Fried
Passionate About
• Search
• SharePoint
• Search-driven
applications
• Information Strategy
Blog:
BAinsight.com/blog
Technet Column
“A View from the
Crawlspace”
jeff.fried@bainsight.com
About BA Insight


– Connectivity
– Applications -
– Classification -
– Analytics

Metadata Drives Great User Experiences
Documents from many sources
All client or matter-relevant documents are integrated.
Rich MetaData
Content annotated automatically – concepts,
categories, citations, matters, clients, etc
Navigation Controls
Explore, Discover, Drill-down
Manual Tagging is impractical
and remarkably inconsistent
Automation
Called: AutoClassification, AutoTagging, Metadata Generation, Text Analytics, ….
8
Complicators






–
–
–

–
–
AutoClassificaiton - Rules versus Machine Learning
11
Common Techniques across Applications
-
-
-
-
-
-
-
-
-
-
-
-
Rules-based Approach
Enhanced Content
Enriched with
Metadata and
Content Types
Search Visualization Workflow
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Rule-based Classifier (Example)
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Example Rules Engine UI
Examples of Rules
Boolean
• “IT” OR “Information Technology” or “MIS”
• (“Expert” OR “Witness”) NOT “police”
• “New York” AND “environmental policy”
• *work
• "legal" -briefs
• "Legal" NEAR(5) "issue“
Property-based
• filetype:docx
• title:"2029 L.P" or title:2030
• footer="BA Insight Confidential" or
footer:proprietary or footer:BA*
Overriding/changing Linguistics
• NOSTEM(“illumination")
• CASE("prerequisites")
• SOUNDLIKE("prerech")
Regular expressions
• title:REGEX([0-4])
• REGEX("b(([w-]+://?|www[.])[^s()<>]+(?:([wd]+)|([^[:punct:]s]|/)))")
Controlling scores & thresholds
Taxonomy Management is often included with
Auto-Classification Tools





Where do you get Taxonomies?
20
Semantics! Machine Learning! AI!
AutoClassificaiton - Rules versus Machine Learning
Key Concepts




False positives vs. false negatives
Look at the impact of each in your context
Machine Learning Approach
Example: identify people as good or
bad from their appearance
Decision Tree Classifier
Building an accurate classifier




–
Training and Test Data
28
Choosing the algorithm

–

–
–
+ Easy to get started
+ Transparent and
debuggable
+ Easily controlled (when #
rules not too large)
- Need taxonomies
- Rule maintenance effort
- Harder to cover domain
fully and to switch domains
+ Don’t need taxonomies
+ Improves without manual
maintenance
+ Handles new data
types/domains more easily
- Need a training set
- Opaque, usually can’t debug
- Can’t specify or control
specific examples
What would you use for







Case Study
Content Identification and Movement
Benchmarks




Large scale example
Combinations of Techniques usually work better
Examples of hybrid configurations




Example: clustering combined with rules
AutoClassificaiton - Rules versus Machine Learning
carrot2
Open Source & Platform packages
offer an easy way to play
How to get started
 Setup up a metadata framework
– keep it simple
 Develop or acquire managed vocabularies for
critical elements
 Start with rule-driven automation
 Test out ML-based techniques as you grow
41
www.BAinsight.com
Jeff.Fried@BAinsight.com
@jefffried

More Related Content

PPTX
Mohannad hussain dicom and imaging tools
PPTX
Furore devdays 2017- rdf1(solbrig)
PPTX
Paul2 ecn 2012
PPTX
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
PPTX
Furore devdays 2017-sdc (lloyd)
PPTX
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
PDF
Incident response before:after breach
PPTX
Furore devdays 2017 - implementation guides (lloyd)
Mohannad hussain dicom and imaging tools
Furore devdays 2017- rdf1(solbrig)
Paul2 ecn 2012
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
Furore devdays 2017-sdc (lloyd)
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
Incident response before:after breach
Furore devdays 2017 - implementation guides (lloyd)

What's hot (20)

PPTX
Transforming other content (grahame)
PDF
Islandora and Linked Open Data
PPTX
Furore devdays 2017- profiling academy - profiling guidelines v1
PPTX
Data Management for librarians
PPTX
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
PPTX
PPTX
Profiling with clin fhir
PPTX
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
PDF
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
PDF
Riley-o.com
PPTX
Unified characterisation, please
PPTX
gitsight
PPTX
Datat and donuts: how to write a data management plan
PPTX
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
PPT
Hypatia for dlf 2011
PPTX
iAuthor.cn: ORCID China Services and International Identifier for Researchers
PDF
Dean da costa 1 pager - sourcing methodology
PPTX
Data Archiving and Sharing
PPTX
Intro to Reproducible Research
PPT
Ontology Web services for Semantic Applications
Transforming other content (grahame)
Islandora and Linked Open Data
Furore devdays 2017- profiling academy - profiling guidelines v1
Data Management for librarians
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
Profiling with clin fhir
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
Embedding Linked Data Invisibly into Web Pages: Strategies and Workflows for ...
Riley-o.com
Unified characterisation, please
gitsight
Datat and donuts: how to write a data management plan
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Hypatia for dlf 2011
iAuthor.cn: ORCID China Services and International Identifier for Researchers
Dean da costa 1 pager - sourcing methodology
Data Archiving and Sharing
Intro to Reproducible Research
Ontology Web services for Semantic Applications
Ad

Viewers also liked (20)

PPTX
Violin Memory DOAG (German Oracle User Group) Nov 2012
PDF
Mobile application testing
PDF
PPTX
Reading user’s mind from their eye’s
DOC
Pradeep_iOS_Developer
PDF
Restful风格ž„web服务架构
PPTX
Presentation - Programming a Heterogeneous Computing Cluster
PPTX
Azure websites Overview
PDF
AD-IN-ONE SUCCESS STORY GREY GROUP PRAGUE.PDF
PDF
The five graphs of telecommunications may 22 2013 webinar final
PPTX
Test slideshare
PPTX
Profitable Sustainability
PDF
Annualreport07 08
PPSX
PPT
Presentation EOI - Apps & Tech 2.0
PDF
Solutions Catalog # 3 by ISIS Papyrus Software
DOC
BE IN ELECTRONICS AND COMMUNICATION WITH 1 YEAR EXPERIENCE
PDF
Certificate Of Participation
PDF
Want to become a speaker at ROCKIT?
Violin Memory DOAG (German Oracle User Group) Nov 2012
Mobile application testing
Reading user’s mind from their eye’s
Pradeep_iOS_Developer
Restful风格ž„web服务架构
Presentation - Programming a Heterogeneous Computing Cluster
Azure websites Overview
AD-IN-ONE SUCCESS STORY GREY GROUP PRAGUE.PDF
The five graphs of telecommunications may 22 2013 webinar final
Test slideshare
Profitable Sustainability
Annualreport07 08
Presentation EOI - Apps & Tech 2.0
Solutions Catalog # 3 by ISIS Papyrus Software
BE IN ELECTRONICS AND COMMUNICATION WITH 1 YEAR EXPERIENCE
Certificate Of Participation
Want to become a speaker at ROCKIT?
Ad

Similar to AutoClassificaiton - Rules versus Machine Learning (20)

PPT
Catégorisation automatisée de contenus documentaires : la ...
PDF
Why You Need Intelligent Metadata and Auto-classification in Records Management
PPT
Catégorisation automatisée de contenus documentaires : la ...
PDF
Introduction to conventional machine learning techniques
PPT
rules classifier in machine learning .ppt
PPTX
The How and Why of Feature Engineering
PPT
Tna how taxonomy applications were built
PPT
Rule-Based Classifiers
PDF
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
PPTX
lec06_Classification_NaiveBayes_RuleBased.pptx
PPTX
Bionic Info Pro - Taxonomies and Machine Learning SLA 2014
PDF
Machine Learning - Classification (ctd.)
DOCX
Name IDPractical Data MiningCOMP-321BTutorial 5.docx
PDF
PDF
Introduction to Data Mining
PDF
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
PDF
Combining Pattern Classifiers Methods And Algorithms Ludmila I Kuncheva
PDF
The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy Webinar
PPTX
Taxonomies in Search
PDF
Harvester_presentaion
Catégorisation automatisée de contenus documentaires : la ...
Why You Need Intelligent Metadata and Auto-classification in Records Management
Catégorisation automatisée de contenus documentaires : la ...
Introduction to conventional machine learning techniques
rules classifier in machine learning .ppt
The How and Why of Feature Engineering
Tna how taxonomy applications were built
Rule-Based Classifiers
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
lec06_Classification_NaiveBayes_RuleBased.pptx
Bionic Info Pro - Taxonomies and Machine Learning SLA 2014
Machine Learning - Classification (ctd.)
Name IDPractical Data MiningCOMP-321BTutorial 5.docx
Introduction to Data Mining
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Combining Pattern Classifiers Methods And Algorithms Ludmila I Kuncheva
The Nuts and Bolts of Metadata Tagging and Taxonomies Made Easy Webinar
Taxonomies in Search
Harvester_presentaion

More from Jeff Fried (20)

PPTX
AI for Intelligent Search & Discovery
PDF
Use O365 and Azure Cognitive Services for intelligent search
PDF
The Race is on: comparing Google and Microsoft's Cognitive Services
PDF
Fried data summit data quality data analytics together
PDF
Fried data summit big data for lob content
PPTX
Is BCS Dead?
PPTX
Cloud Hybrid Search with SharePoint
PPTX
Fried connecting across silos seminar
PPTX
Understanding and Applying Cloud Hybrid Search
PPTX
O365 Tools for Building a Digital Workplace
PPTX
search driven intranets
PPTX
Understanding and Applying Cloud Hybrid Search
PPTX
Searching for SharePoint Analytics
PDF
Take Cloud Hybrid Search to the Next Level
PDF
Fried sp techcon hybrid search deeper dive
PPTX
Succeeding with Hybrid SharePoint (includes new Cloud SSA material)
PPTX
Search Success in 2016 - Recap of ESE2015
PPTX
Information Strategy with O365 in Mind
PPTX
Succeeding with Hybrid SharePoint and search
PPTX
Spsct fried info strategy session
AI for Intelligent Search & Discovery
Use O365 and Azure Cognitive Services for intelligent search
The Race is on: comparing Google and Microsoft's Cognitive Services
Fried data summit data quality data analytics together
Fried data summit big data for lob content
Is BCS Dead?
Cloud Hybrid Search with SharePoint
Fried connecting across silos seminar
Understanding and Applying Cloud Hybrid Search
O365 Tools for Building a Digital Workplace
search driven intranets
Understanding and Applying Cloud Hybrid Search
Searching for SharePoint Analytics
Take Cloud Hybrid Search to the Next Level
Fried sp techcon hybrid search deeper dive
Succeeding with Hybrid SharePoint (includes new Cloud SSA material)
Search Success in 2016 - Recap of ESE2015
Information Strategy with O365 in Mind
Succeeding with Hybrid SharePoint and search
Spsct fried info strategy session

Recently uploaded (20)

PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Introduction to the IoT system, how the IoT system works
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
Introduction to Information and Communication Technology
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
SAP Ariba Sourcing PPT for learning material
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPT
tcp ip networks nd ip layering assotred slides
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Decoding a Decade: 10 Years of Applied CTI Discipline
Introduction to the IoT system, how the IoT system works
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Power Point - Lesson 3_2.pptx grad school presentation
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
The Internet -By the Numbers, Sri Lanka Edition
Job_Card_System_Styled_lorem_ipsum_.pptx
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
An introduction to the IFRS (ISSB) Stndards.pdf
522797556-Unit-2-Temperature-measurement-1-1.pptx
Cloud-Scale Log Monitoring _ Datadog.pdf
PptxGenJS_Demo_Chart_20250317130215833.pptx
Unit-1 introduction to cyber security discuss about how to secure a system
Introduction to Information and Communication Technology
international classification of diseases ICD-10 review PPT.pptx
SAP Ariba Sourcing PPT for learning material
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
tcp ip networks nd ip layering assotred slides

AutoClassificaiton - Rules versus Machine Learning

Editor's Notes

  • #3: documents, web sites, blog posts, database entries, etc.