SlideShare a Scribd company logo
Badenes-Olmedo, Carlos
Redondo Garcia, Jose Luís
Corcho, Oscar
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
cbadenes@fi.upm.es
@carbadol
github.com/librairy
oeg-upm.net
Efficient Clustering
from Distributions
over Topics
K-CAP 2017
Knowledge Capture

December 4th-6th, 2017

Austin, Texas, United States
2
source: emages.eventshigh.com
3
Connected Documents in a Collection
4
From Sets of Textual documents to Graphs
5
Similarity Matrix
Non-Symmetric: 31*31-31=930
Symmetric: 930/2=465
6
Solution: Partition of the Search Space
7
Use-Case: Digital Publisher
7,648 Books
68,653 Chapters
4x1.6 Ghz
16GB RAM
4x1.6 Ghz
16GB RAM
2,910,883,150
similarities
aprox 8 hours
76,301 Documents
All pairwise similarities
approx 3.5 days
8
Hypothesis
Topic	1	
Topic	2	
Topic	3	
Topic	4	
LDA	
LDA for efficient space partition
Probabilis)c	Topic	Models	(PTM)	and	in	par)cular,	on	Latent	Dirichlet	Alloca-on	(LDA)	
can	efficiently	divide	the	search	space	and	speed	up	the	process	of	finding	rela)ons	
among	documents	inside	big	collec)ons.
9
• Each topic is a distribution over words
• Each document is a mixture of corpus-wide topics
• Each word is drawn from one of those topics
Probabilistic Topic Models
source: David Blei, Probabilistic Topic Models
10
Topic Model
Similarity Valuetopic0: 0.520,
topic1: 0.327
topic2: 0.081
…
topic122: 0.182
topic1: 0.573,
topic2: 0.172
topic3: 0.136
…
topic122: 0.099
0.595
..
Topic0 Topic1 Topic2 Topic122
DOCUMENT
‘A’
DOCUMENT
‘B’
Topic-based Similarity
Jensen-Shannon Divergence
11
Scenario
[ 0.243, 0.145, 0.600, 0.022]
corpus Prob. Topic Model
Topic 1
Topic 2
Topic 3
Dirichlet Distribution
• Exponential family distribution over the simplex, 

i.e. positive vectors that sum to one
Approach 1: TDC
12
•  Instead	of	directly	relying	on	the	topic	distribu5on’s	scores,	it	considers	
their	varia5ons	across	consecu5ve	topics	inside	a	document’s	topic	
distribu5on	
	
	
	
		
0.23	 0.18	 0.33	 0.13	 	0.13	P	=		
>	 <	 =	>	
2	 1	 2	 0	T	=		
Trends on Dirichlet-based Clustering (TDC)
13
Approach 2: RDC
Ranking on Dirichlet-based Clustering (RDC)
•  Only	considering	the	top	n	topics	from	the	ranked	list	of	probability	distribu7ons	[29]	
•  Based	on	the	assump7on	that	the	highest	weighted	topics	have	a	high	influence	in	the	
rest	of	topics	when	calcula7ng	distances		
	
	
	
	
		
0.23	 0.18	 0.33	 0.13	 	0.13	n=2,	P	=		
2	 0	R	=		
1	0	 2	 3	 4
14
Approach 3: CRDC
Cumulative Ranking on Dirichlet-based Clustering (CRDC)
15
Experiments: Dataset
16
Experiments: Baselines
•  		
1:	h%p://commons.apache.org/proper/commons-math/		
	
Similarity Metrics:
•Jensen-Shannon Divergence
•Hellinger Distance
Clustering Algorithms:
17
Experiments: Measures
18
Experiments: Results
Precision
19
Experiments: Results
Recall
20
Experiments: Results
Number of Clusters
21
Experiments: Results
Effectiveness
JSD
22
Experiments: Results
Effectiveness
Hellinger
23
Experiments: Results
Cost
JSD
24
Experiments: Results
Cost
Hellinger
25
Experiments: Results
Efficiency
JSD
26
Experiments: Results
Efficiency
Hellinger
27
Conclusions
•  Unsupervised	clustering	algorithms,	TDC,	RDC	and	CRDC.	
•  CRDC	is	a	promising	approach,	which	improves	the	efficiency	
obtained	by	other	centroid-based	and	density-based	approaches	such	
as	K-Means	
•  Hierarchical	approach	for	RDC	algorithm	was	also	considered	but	it	
did	not	produce	good	results
28
Future Work
•  Hybrid	methods	combining	some	of	these	novel	approaches	with	
exis8ng	techniques	will	be	performed	in	future	work	on	the	same	line	
•  Nearest	neighbors	as	baseline
29
Use-Case: Digital Publisher
7,648 Books
68,653 Chapters
4x1.6 Ghz
16GB RAM
4x1.6 Ghz
16GB RAM
2,910,883,150
similarities
aprox 8 hours
76,301 Documents
by using CRDC
aprox 2 hours
64,635,080
similarities
threshold = 0.9
approx 3.5 days
K-CAP 2017
Knowledge Capture

December 4th-6th, 2017

Austin, Texas, United States
Badenes-Olmedo, Carlos
Redondo Garcia, Jose Luís
Corcho, Oscar
Ontology Engineering Group (OEG)
Universidad Politécnica de Madrid (UPM)
Efficient Clustering
from Distributions
over Topics
cbadenes@fi.upm.es
@carbadol
github.com/librairy
oeg-upm.net

More Related Content

PDF
Semantically-enabled Browsing of Large Multilingual Document Collections
PDF
Distributing Text Mining tasks with librAIry
PDF
Scalable Cross-lingual Document Similarity through Language-specific Concept ...
PPTX
An initial analysis of topic-based similarity among scientific documents base...
PPTX
Topic Extraction on Domain Ontology
PPTX
PDF
Compressed full text indexes
PDF
Topics Modeling
Semantically-enabled Browsing of Large Multilingual Document Collections
Distributing Text Mining tasks with librAIry
Scalable Cross-lingual Document Similarity through Language-specific Concept ...
An initial analysis of topic-based similarity among scientific documents base...
Topic Extraction on Domain Ontology
Compressed full text indexes
Topics Modeling

What's hot (17)

PPTX
Topic modeling using big data analytics
PDF
Latent dirichletallocation presentation
ODP
Topic Modeling
PDF
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
PDF
Latent Dirichlet Allocation
PPTX
Topic model, LDA and all that
PPT
Topic Models - LDA and Correlated Topic Models
PDF
TopicModels_BleiPaper_Summary.pptx
PPTX
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
PDF
Topic Models Exploration
PDF
Perspectives on mining knowledge graphs from text
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Using NLP to Explore Entity Relationships in COVID-19 Literature
PPT
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
PDF
A Text Mining Research Based on LDA Topic Modelling
PPTX
Open Data Mashups: linking fragments into mosaics
PPTX
Reuse for research, presentation, idcc17
Topic modeling using big data analytics
Latent dirichletallocation presentation
Topic Modeling
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Latent Dirichlet Allocation
Topic model, LDA and all that
Topic Models - LDA and Correlated Topic Models
TopicModels_BleiPaper_Summary.pptx
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Topic Models Exploration
Perspectives on mining knowledge graphs from text
Spark Summit Europe: Share and analyse genomic data at scale
Using NLP to Explore Entity Relationships in COVID-19 Literature
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
A Text Mining Research Based on LDA Topic Modelling
Open Data Mashups: linking fragments into mosaics
Reuse for research, presentation, idcc17
Ad

Similar to Efficient Clustering from Distributions over Topics (20)

PDF
Probabilistic Topic models
PDF
When The New Science Is In The Outliers
PDF
Bl24409420
PPT
Cluster
PDF
50120130406022
PPTX
clustering.pptx
PPT
lecture12-clustering.ppt
PPT
lecture12-clustering.ppt
PPT
lecture12-clustering.ppt
PPT
lecture12-clustering.ppt
PPTX
End-to-End Learning for Answering Structured Queries Directly over Text
PPT
Information Retrieval: Clustering process
PPT
K mean clustering algorithm Unsupervised Learning
PPT
For beginner k means slide-clustering.ppt
PPT
unit4-clustering.ppt forghhghghghhhhhhhh
POTX
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
PPT
Learning for Optimization: EDAs, probabilistic modelling, or ...
PPT
(Talk in Powerpoint Format)
PDF
Software tools for high-throughput materials data generation and data mining
PPTX
Clustering ppt
Probabilistic Topic models
When The New Science Is In The Outliers
Bl24409420
Cluster
50120130406022
clustering.pptx
lecture12-clustering.ppt
lecture12-clustering.ppt
lecture12-clustering.ppt
lecture12-clustering.ppt
End-to-End Learning for Answering Structured Queries Directly over Text
Information Retrieval: Clustering process
K mean clustering algorithm Unsupervised Learning
For beginner k means slide-clustering.ppt
unit4-clustering.ppt forghhghghghhhhhhhh
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
Learning for Optimization: EDAs, probabilistic modelling, or ...
(Talk in Powerpoint Format)
Software tools for high-throughput materials data generation and data mining
Clustering ppt
Ad

More from Carlos Badenes-Olmedo (6)

PDF
NLP and Knowledge Graphs
PDF
Crosslingual search-engine
PDF
Cross-lingual Similarity
PDF
Multilingual searchapi
PDF
Multilingual document analysis
PDF
Docker Introduction
NLP and Knowledge Graphs
Crosslingual search-engine
Cross-lingual Similarity
Multilingual searchapi
Multilingual document analysis
Docker Introduction

Recently uploaded (20)

PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Business Analytics and business intelligence.pdf
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Business_Capability_Map_Collection__pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
New ISO 27001_2022 standard and the changes
PDF
Microsoft 365 products and services descrption
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
ISS -ESG Data flows What is ESG and HowHow
IMPACT OF LANDSLIDE.....................
Microsoft Core Cloud Services powerpoint
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
modul_python (1).pptx for professional and student
SAP 2 completion done . PRESENTATION.pptx
A Complete Guide to Streamlining Business Processes
[EN] Industrial Machine Downtime Prediction
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Business Analytics and business intelligence.pdf
Navigating the Thai Supplements Landscape.pdf
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Business_Capability_Map_Collection__pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
New ISO 27001_2022 standard and the changes
Microsoft 365 products and services descrption
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...

Efficient Clustering from Distributions over Topics