SlideShare a Scribd company logo
Evaluating the Use of Clustering
    for Automatically Organising
      Digital Library Collections
             Mark M. Hall, Mark Stevenson,
                    Paul D. Clough


TPDL 2012, Cyprus, 24-27 September 2012
Opening Up Digital Cultural Heritage




                                                                     http://www.flickr.com/photos/brokenthoughts/122096903/
Carl Collins
http://www.flickr.com/photos/carlcollins/199792939/




                                 http://www.flickr.com/photos/usnationalarchives/4069633668/
   TPDL 2012, Cyprus, 24-27 September 2012
Exploring Collections
• Exploring / Browsing as an alternative to
  Search (where applicable)
• Requires some kind of structuring of the
  data
• Manual structuring ideal
    – Expensive to generate
    – Integration of collections problematic
• Alternative: Automatic structuring via
  clustering

TPDL 2012, Cyprus, 24-27 September 2012
Test Collection
• 28133 photographs provided
  by the University of St
  Andrews Library
    – 85% pre 1940                             Ottery St Mary
    – 89% black and white                      Church

    – Majority UK
    – Title and description tend to be
      short


TPDL 2012, Cyprus, 24-27 September 2012
Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)
    – 300 & 900 topics
    – With and without Pairwise Mutual Information
      (PMI) filtering
• K-Means
    – 900 clusters
    – TFIDF vectors & LDA topic vectors
• OPTICS
    – 900 clusters
    – TFIDF vectors & LDA topic vectors

TPDL 2012, Cyprus, 23-27 September 2012
Processing Time
Model                                     Wall-clock Time
LDA 300                                   00:21:48
LDA 900                                   00:42:42
LDA + PMI 300                             05:05:13
LDA + PMI 900                             17:26:08
K-Means TFIDF                             09:37:40
K-Means LDA                               03:49:04
Optics TFIDF                              12:42:13
Optics LDA                                05:12:49



TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cluster cohesion
    – Items in a cluster should be similar to each
      other
    – Items in a cluster should be different from
      items in other clusters
• How to test this?
    – “Intruder” test
    – If you insert an intruder into a cluster, can
      people find it

TPDL 2012, Cyprus, 24-27 September 2012
Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the
   “intruder” topic
4. Randomly select one item from the
   second topic – the “intruder” item
5. Scramble the five items and let the user
   choose which one is the “intruder”

TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Cohesive




TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Not Cohesive




TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cohesive
    – “Intruder” is chosen significantly more
      frequently than by chance
    – Choice distribution is significantly different
      from the uniform distribution
• Borderline cohesive
    – Two out of five items make up > 95% of the
      answers
    – “Intruder” is one of those two

TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Bounds
• Upper bound
    – Manual annotation
         • 936 topics
• Lower bound
    – 3 cohesive topics
    – <5% likelihood of seeing that number of cohesive
      topics by chance
• Control data
    – 10 “really, totally, completely obvious” intruders
      used to filter participants who randomly select
      answers


TPDL 2012, Cyprus, 24-27 September 2012
Experiment
• Crowd-sourced using staff & students at
  Sheffield University
    – 700 participants
• 9 clustering strategies
    – 30 units per strategy – total of 270 units
• Results
    – 8840 ratings
    – 21 – 30 ratings per unit (median 27 ratings)


TPDL 2012, Cyprus, 24-27 September 2012
Results
Model                        Cohesive     Borderline   Non-Cohesive
Upper Bound                  27           0            3
Lower Bound                  3            0            27
LDA 300                      15           6            9
LDA 900                      20           4            6
LDA + PMI 300                16           4            10
LDA + PMI 900                21           2            7
K-Means TFIDF                24           3            3
K-Means LDA                  20           0            10
Optics TFIDF                 14           2            14
Optics LDA                   16           0            14

TPDL 2012, Cyprus, 24-27 September 2012
Conclusions
• K-means almost as good as the human
  classification
• LDA is very fast and approximately two
  thirds of the topics are acceptably
  cohesive

• Future work:
    – Make it hierarchical
    – Create hybrid algorithms

TPDL 2012, Cyprus, 24-27 September 2012
Thank you for listening



                                   Find out more about the project:

                              http://www.paths-project.eu


                                       m.mhall@sheffield.ac.uk



The research leading to these results has received funding from the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project
partners involved in PATHS (see: http://www.paths-project.eu).

More Related Content

PDF
Outlook test e mail auto configuration autodiscover troubleshooting tools p...
PDF
Should i use a single namespace for exchange infrastructure part 1#2 part ...
PDF
Outlook autodiscover decision process choosing the right autodiscover method ...
PDF
PATHS: Personalised Access to Cultural Heritage Spaces
PPTX
IND-2012-277 St.Xavier’s High School -Zero Garbage Campaign
PDF
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
PDF
Presentación Drupal Commerce en OpenExpo Ecommerce
PDF
PATHS Demo: Exploring Digital Cultural Heritage Spaces
Outlook test e mail auto configuration autodiscover troubleshooting tools p...
Should i use a single namespace for exchange infrastructure part 1#2 part ...
Outlook autodiscover decision process choosing the right autodiscover method ...
PATHS: Personalised Access to Cultural Heritage Spaces
IND-2012-277 St.Xavier’s High School -Zero Garbage Campaign
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Presentación Drupal Commerce en OpenExpo Ecommerce
PATHS Demo: Exploring Digital Cultural Heritage Spaces

Viewers also liked (7)

PPTX
My E-mail appears as spam - troubleshooting path - part 11 of 17
PDF
The autodiscover algorithm for locating the source of information part 05#36
PPTX
Word pressで情報を得るのに役立つwebサイトの紹介
PPT
DFC2012 India: Health & Hygiene
PDF
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
PPTX
Plivo OSDC FR 2012
PPTX
Think before you speak
My E-mail appears as spam - troubleshooting path - part 11 of 17
The autodiscover algorithm for locating the source of information part 05#36
Word pressで情報を得るのに役立つwebサイトの紹介
DFC2012 India: Health & Hygiene
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Plivo OSDC FR 2012
Think before you speak
Ad

Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (7)

PDF
Evaluating the Use of Clustering for Automatically Organising Digital Library...
PDF
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
PDF
Data Mining
PDF
University Recommendation Support System using ML Algorithms
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
Recommendation system using unsupervised machine learning algorithm & assoc
PPTX
An improved fuzzy system for representing web pages in Clustering Tasks
Evaluating the Use of Clustering for Automatically Organising Digital Library...
A Novel Approach for Travel Package Recommendation Using Probabilistic Matrix...
Data Mining
University Recommendation Support System using ML Algorithms
Welcome to International Journal of Engineering Research and Development (IJERD)
Recommendation system using unsupervised machine learning algorithm & assoc
An improved fuzzy system for representing web pages in Clustering Tasks
Ad

More from pathsproject (20)

PDF
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
PDF
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PDF
Implementing Recommendations in the PATHS system, SUEDL 2013
PDF
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
PDF
Generating Paths through Cultural Heritage Collections Latech2013 paper
PDF
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
PDF
PATHS state of the art monitoring report
PDF
Recommendations for the automatic enrichment of digital library content using...
PDF
Semantic Enrichment of Cultural Heritage content in PATHS
PDF
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
PPT
PATHS @ LATECH 2013
PDF
PATHS at the eChallenges conference
PDF
PATHS at the EAA conference 2013
PDF
PATHS at the eCult dialogue day 2013
PDF
Comparing taxonomies for organising collections of documents presentation
PDF
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
PDF
A pilot on Semantic Textual Similarity
PDF
Comparing taxonomies for organising collections of documents
PDF
PATHS Final prototype interface design v1.0
PDF
PATHS Evaluation of the 1st paths prototype
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
Implementing Recommendations in the PATHS system, SUEDL 2013
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
Generating Paths through Cultural Heritage Collections Latech2013 paper
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
PATHS state of the art monitoring report
Recommendations for the automatic enrichment of digital library content using...
Semantic Enrichment of Cultural Heritage content in PATHS
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
PATHS @ LATECH 2013
PATHS at the eChallenges conference
PATHS at the EAA conference 2013
PATHS at the eCult dialogue day 2013
Comparing taxonomies for organising collections of documents presentation
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
Comparing taxonomies for organising collections of documents
PATHS Final prototype interface design v1.0
PATHS Evaluation of the 1st paths prototype

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPT
Teaching material agriculture food technology
PDF
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Programs and apps: productivity, graphics, security and other tools
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25-Week II
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative analysis of optical character recognition models for extracting...
Teaching material agriculture food technology
Empathic Computing: Creating Shared Understanding

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

  • 1. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012
  • 2. Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012
  • 3. Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal – Expensive to generate – Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012
  • 4. Test Collection • 28133 photographs provided by the University of St Andrews Library – 85% pre 1940 Ottery St Mary – 89% black and white Church – Majority UK – Title and description tend to be short TPDL 2012, Cyprus, 24-27 September 2012
  • 5. Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) – 300 & 900 topics – With and without Pairwise Mutual Information (PMI) filtering • K-Means – 900 clusters – TFIDF vectors & LDA topic vectors • OPTICS – 900 clusters – TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012
  • 6. Processing Time Model Wall-clock Time LDA 300 00:21:48 LDA 900 00:42:42 LDA + PMI 300 05:05:13 LDA + PMI 900 17:26:08 K-Means TFIDF 09:37:40 K-Means LDA 03:49:04 Optics TFIDF 12:42:13 Optics LDA 05:12:49 TPDL 2012, Cyprus, 24-27 September 2012
  • 7. Evaluation Metrics • Cluster cohesion – Items in a cluster should be similar to each other – Items in a cluster should be different from items in other clusters • How to test this? – “Intruder” test – If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012
  • 8. Intruder Test 1. Randomly select one topic 2. Randomly select four items from the topic 3. Randomly select a second topic – the “intruder” topic 4. Randomly select one item from the second topic – the “intruder” item 5. Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012
  • 9. Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012
  • 10. Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012
  • 11. Evaluation Metrics • Cohesive – “Intruder” is chosen significantly more frequently than by chance – Choice distribution is significantly different from the uniform distribution • Borderline cohesive – Two out of five items make up > 95% of the answers – “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012
  • 12. Evaluation Bounds • Upper bound – Manual annotation • 936 topics • Lower bound – 3 cohesive topics – <5% likelihood of seeing that number of cohesive topics by chance • Control data – 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012
  • 13. Experiment • Crowd-sourced using staff & students at Sheffield University – 700 participants • 9 clustering strategies – 30 units per strategy – total of 270 units • Results – 8840 ratings – 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012
  • 14. Results Model Cohesive Borderline Non-Cohesive Upper Bound 27 0 3 Lower Bound 3 0 27 LDA 300 15 6 9 LDA 900 20 4 6 LDA + PMI 300 16 4 10 LDA + PMI 900 21 2 7 K-Means TFIDF 24 3 3 K-Means LDA 20 0 10 Optics TFIDF 14 2 14 Optics LDA 16 0 14 TPDL 2012, Cyprus, 24-27 September 2012
  • 15. Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: – Make it hierarchical – Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012
  • 16. Thank you for listening Find out more about the project: http://www.paths-project.eu m.mhall@sheffield.ac.uk The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).