SlideShare a Scribd company logo
Clustering Citation Distributions for 
Semantic Categorization and 
Citation Prediction 
Francesco Osbornea , Silvio Peronibc, Enrico Mottaa, 
a KMi, The Open University, United Kingdom 
b Department of Computer Science and Engineering, University 
of Bologna, Bologna, Italy 
c Institute of Cognitive Sciences and Technologies, CNR, Rome, 
Italy 
October 2014
Is it possible to say who will have a 
bigger impact?
Can I exploit this information for semantic 
expert search?
Clustering 
of Citation 
Distribution 
Authors’ 
data 
Clusters of authors 
with similar citation 
patterns 
EExxttrraaccttiioonn ooff 
semantic 
features 
RDF 
BiDO Ontology 
Our approach
Clustering Citation Distributions 
We cluster the citation distributions by 
exploiting a bottom-up hierarchical clustering 
algorithm. 
We thus need to define: 
• A norm 
• A metric to assess the quality of a set of 
clusters
Clustering Citation Distributions 
A B 
C D
Clustering Citation Distributions 
A B 
C D 
1. dis(A, B) = dis(C, D)
Clustering Citation Distributions 
A B 
C D 
1. dis(A, B) = dis(C, D) 
2. dis(A, C) > 0 , dis(B, D) > 0
Clustering Citation Distributions 
A B 
C D 
1. dis(A, B) = dis(C, D) 
2. dis(A, C) > 0 , dis(B, D) > 0 
3. Can be computed incrementally
Clustering Citation Distributions 
A simple way to satisfy these three 
requirements is to use a normalized Euclidean 
distance:
/2
Clustering Citation Distributions 
We want to maximize the homogeneity of the 
cluster populations in the following years.
Standard deviation is not the solution…
Standard deviation is not the solution…
Clustering Citation Distributions 
We estimate the homogeneity by computing the 
weighted average of the MAD: 
   
      
MAD (Median Absolute Deviation ) is a robust 
measure of statistical dispersion and it is used to 
compute the variability of an univariate sample 
of quantitative data.
Clustering Citation Distributions 
We then compute the memberships of all authors 
in our dataset with the centroids of the resulting 
clusters. 
!   
Σ 
$%'
(')*,, 
$%'
(')-,,
/./0 
Finally we calculate a number of statistics for 
estimating the evolution of the members of each 
clusters.
How can we represent this data? 
Bibliometric data are subject to the simultaneous application of 
different variables. In particular, one should take into account at 
least: 
• the temporal association of such data to entities; 
• the particular agent who provided such data (e.g., Google 
Scholar, Scopus, our algorithm); 
• the characterisation of such data in at least two different 
kinds, i.e., numeric bibliometric data (e.g., the standard 
bibliometric measures such as h-index, journal impact factor, 
citation count) and categorial bibliometric data (so as to 
enable the description of entities, e.g., authors, according to 
specific descriptive categories).
BiDO
Extraction of Semantic features
:hasCurve [ a :Curve ; 
:hasTrend :increasing ;
:hasCurve [ a :Curve ; 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;
:hasCurve [ a :Curve ; 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 
:hasSlope [ a :Slope ; :hasStrength :low ;
:hasCurve [ a :Curve ; 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 
:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ;
:hasCurve [ a :Curve ; 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 
:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; 
:hasOrderOfMagnitude :[243,729) ;
:hasCurve [ a :Curve ; 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 
:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; 
:hasOrderOfMagnitude :[243,729) ; 
:concernsResearchPeriod :5-years-beginning .
:increasing-with-premature-deceleration-and-low-logarithmic-slope-in- 
[243,729)-5-years-beginning a :ResearchCareerCategory ; 
:hasCurve [ a :Curve ; 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 
:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; 
:hasOrderOfMagnitude :[243,729) ; 
:concernsResearchPeriod :5-years-beginning .
:john-doe :holdsBibliometricDataInTime [ 
a :BibliometricDataInTime ; 
tvc:atTime [ a time:Interval ; time:hasBeginning :2014-07-11 ] ; 
:accordingTo [ a fabio:Algorithm ; 
:increasing-with-premature-deceleration-and-low-logarithmic-slope-in-[243,729)- 
5-years-beginning a :ResearchCareerCategory ; 
:hasCurve [ a :Curve ; 
frbr:realization [ a fabio:ComputerProgram ] ] ; 
:withBibliometricData 
:increasing-with-premature-deceleration-and-low-logarithmic- 
:hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; 
:hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; 
slope-in-[243,729)-5-years-beginning . 
:hasOrderOfMagnitude :[243,729) ; 
:concernsResearchPeriod :5-years-beginning .
Evaluation 
• We evaluated our method on a dataset of 20000 researchers 
working in the field of computer science in the 1990-2010 
interval. 
• This dataset was derived from the database of Rexplore , a 
system to provide support for exploring scholarly data, which 
integrates several data sources (Microsoft Academic Search, 
DBLP++ and DBpedia).
Evaluation
Evaluation 
Y 
C18 (1.4%) C22 (2.5%) C25 (2.7%) C28 (2.3%) C29 (8.8%) 
range mean range mean range mean range mean range mean 
6 420-800 567±98 160-280 209±34 100-180 129±25 60-100 72±14 40-60 39±9 
7 440-960 610±120 160-320 225±45 100-200 138±30 60-120 79±18 40-80 45±14 
8 440-1020 650±137 160-400 246±58 100-260 158±45 60-160 90±26 40-100 50±18 
9 440-1260 699±186 160-440 269±74 100-340 187±68 60-200 104±37 40-120 57±25 
10 480-2940 751±411 160-500 292±85 100-400 211±82 60-280 125±57 40-160 68±35 
11 480-2480 826±336 180-660 331±112 100-520 241±100 60-540 155±103 40-200 82±47 
12 480-3520 914±467 180-860 370±151 100-640 270±126 60-440 166±96 40-260 97±60
Evaluation

More Related Content

PPTX
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
PPTX
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
PDF
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
PPTX
Early Detection and Forecasting of Research Trends
PPTX
Supporting Springer Nature Editors by means of Semantic Technologies
PDF
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
PPTX
Language Models for Information Retrieval
PPTX
Social Phrases Having Impact in Altmetrics - SOPHIA
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
Early Detection and Forecasting of Research Trends
Supporting Springer Nature Editors by means of Semantic Technologies
EKAW 2016 - Ontology Forecasting in Scientific Literature: Semantic Concepts ...
Language Models for Information Retrieval
Social Phrases Having Impact in Altmetrics - SOPHIA

What's hot (20)

PDF
Sybrandt Thesis Proposal Presentation
PPTX
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
PDF
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
PDF
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
PDF
Invited Talk: Early Detection of Research Topics
PDF
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
PDF
Concurrent Inference of Topic Models and Distributed Vector Representations
PDF
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
PDF
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
PPT
kantorNSF-NIJ-ISI-03-06-04.ppt
PDF
Recommender Systems and Linked Open Data
PDF
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
PPT
Computer Software in Qualitative Research: An Introduction to NVivo
PDF
Navigation through citation network based on content similarity using cosine ...
PDF
Semantic Annotation of Documents
PPTX
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
PPT
Wikipedia as an Ontology for Describing Documents
PDF
Probabilistic Information Retrieval
PDF
Translating Ontologies in Real-World Settings
PPTX
Tutorial on Question Answering Systems
Sybrandt Thesis Proposal Presentation
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Invited Talk: Early Detection of Research Topics
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Concurrent Inference of Topic Models and Distributed Vector Representations
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
SSSW 2013 - Feeding Recommender Systems with Linked Open Data
kantorNSF-NIJ-ISI-03-06-04.ppt
Recommender Systems and Linked Open Data
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Computer Software in Qualitative Research: An Introduction to NVivo
Navigation through citation network based on content similarity using cosine ...
Semantic Annotation of Documents
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Wikipedia as an Ontology for Describing Documents
Probabilistic Information Retrieval
Translating Ontologies in Real-World Settings
Tutorial on Question Answering Systems
Ad

Similar to Linked science presentation 25 (20)

PPT
Clustering
PPT
Cluster
PPTX
Cluster Analysis.pptx
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
PPT
Web Information Extraction Learning based on Probabilistic Graphical Models
PDF
DAOC: Stable Clustering of Large Networks
PDF
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
PDF
Dynamic extraction of key paper from the cluster using variance values of cit...
PDF
Ir3116271633
PDF
Engineering Data Science Objectives for Social Network Analysis
DOCX
Data mining BY Zubair Yaseen
PDF
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
PPTX
Clustering, Types of clustering, Types of data
PPTX
Clustering.pptx
PPTX
Clustering.pptx
PPTX
Hierarchical clustering
PDF
Data Science: Origins, Methods, Challenges and the future?
PDF
PDF
An Analysis On Clustering Algorithms In Data Mining
PDF
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Clustering
Cluster
Cluster Analysis.pptx
Textual Data Partitioning with Relationship and Discriminative Analysis
Web Information Extraction Learning based on Probabilistic Graphical Models
DAOC: Stable Clustering of Large Networks
State-of-the-art Clustering Techniques: Support Vector Methods and Minimum Br...
Dynamic extraction of key paper from the cluster using variance values of cit...
Ir3116271633
Engineering Data Science Objectives for Social Network Analysis
Data mining BY Zubair Yaseen
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Clustering, Types of clustering, Types of data
Clustering.pptx
Clustering.pptx
Hierarchical clustering
Data Science: Origins, Methods, Challenges and the future?
An Analysis On Clustering Algorithms In Data Mining
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Ad

Recently uploaded (20)

PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
famous lake in india and its disturibution and importance
PPTX
2. Earth - The Living Planet earth and life
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
BIOMOLECULES PPT........................
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
. Radiology Case Scenariosssssssssssssss
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
Sciences of Europe No 170 (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPT
protein biochemistry.ppt for university classes
PPTX
Microbiology with diagram medical studies .pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
An interstellar mission to test astrophysical black holes
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
microscope-Lecturecjchchchchcuvuvhc.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
famous lake in india and its disturibution and importance
2. Earth - The Living Planet earth and life
AlphaEarth Foundations and the Satellite Embedding dataset
BIOMOLECULES PPT........................
Placing the Near-Earth Object Impact Probability in Context
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
. Radiology Case Scenariosssssssssssssss
Viruses (History, structure and composition, classification, Bacteriophage Re...
Sciences of Europe No 170 (2025)
Biophysics 2.pdffffffffffffffffffffffffff
neck nodes and dissection types and lymph nodes levels
ECG_Course_Presentation د.محمد صقران ppt
protein biochemistry.ppt for university classes
Microbiology with diagram medical studies .pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
An interstellar mission to test astrophysical black holes
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

Linked science presentation 25

  • 1. Clustering Citation Distributions for Semantic Categorization and Citation Prediction Francesco Osbornea , Silvio Peronibc, Enrico Mottaa, a KMi, The Open University, United Kingdom b Department of Computer Science and Engineering, University of Bologna, Bologna, Italy c Institute of Cognitive Sciences and Technologies, CNR, Rome, Italy October 2014
  • 2. Is it possible to say who will have a bigger impact?
  • 3. Can I exploit this information for semantic expert search?
  • 4. Clustering of Citation Distribution Authors’ data Clusters of authors with similar citation patterns EExxttrraaccttiioonn ooff semantic features RDF BiDO Ontology Our approach
  • 5. Clustering Citation Distributions We cluster the citation distributions by exploiting a bottom-up hierarchical clustering algorithm. We thus need to define: • A norm • A metric to assess the quality of a set of clusters
  • 7. Clustering Citation Distributions A B C D 1. dis(A, B) = dis(C, D)
  • 8. Clustering Citation Distributions A B C D 1. dis(A, B) = dis(C, D) 2. dis(A, C) > 0 , dis(B, D) > 0
  • 9. Clustering Citation Distributions A B C D 1. dis(A, B) = dis(C, D) 2. dis(A, C) > 0 , dis(B, D) > 0 3. Can be computed incrementally
  • 10. Clustering Citation Distributions A simple way to satisfy these three requirements is to use a normalized Euclidean distance:
  • 11. /2
  • 12. Clustering Citation Distributions We want to maximize the homogeneity of the cluster populations in the following years.
  • 13. Standard deviation is not the solution…
  • 14. Standard deviation is not the solution…
  • 15. Clustering Citation Distributions We estimate the homogeneity by computing the weighted average of the MAD: MAD (Median Absolute Deviation ) is a robust measure of statistical dispersion and it is used to compute the variability of an univariate sample of quantitative data.
  • 16. Clustering Citation Distributions We then compute the memberships of all authors in our dataset with the centroids of the resulting clusters. ! Σ $%' (')*,, $%' (')-,,
  • 17. /./0 Finally we calculate a number of statistics for estimating the evolution of the members of each clusters.
  • 18. How can we represent this data? Bibliometric data are subject to the simultaneous application of different variables. In particular, one should take into account at least: • the temporal association of such data to entities; • the particular agent who provided such data (e.g., Google Scholar, Scopus, our algorithm); • the characterisation of such data in at least two different kinds, i.e., numeric bibliometric data (e.g., the standard bibliometric measures such as h-index, journal impact factor, citation count) and categorial bibliometric data (so as to enable the description of entities, e.g., authors, according to specific descriptive categories).
  • 19. BiDO
  • 21. :hasCurve [ a :Curve ; :hasTrend :increasing ;
  • 22. :hasCurve [ a :Curve ; :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ;
  • 23. :hasCurve [ a :Curve ; :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; :hasSlope [ a :Slope ; :hasStrength :low ;
  • 24. :hasCurve [ a :Curve ; :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; :hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ;
  • 25. :hasCurve [ a :Curve ; :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; :hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; :hasOrderOfMagnitude :[243,729) ;
  • 26. :hasCurve [ a :Curve ; :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; :hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; :hasOrderOfMagnitude :[243,729) ; :concernsResearchPeriod :5-years-beginning .
  • 27. :increasing-with-premature-deceleration-and-low-logarithmic-slope-in- [243,729)-5-years-beginning a :ResearchCareerCategory ; :hasCurve [ a :Curve ; :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; :hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; :hasOrderOfMagnitude :[243,729) ; :concernsResearchPeriod :5-years-beginning .
  • 28. :john-doe :holdsBibliometricDataInTime [ a :BibliometricDataInTime ; tvc:atTime [ a time:Interval ; time:hasBeginning :2014-07-11 ] ; :accordingTo [ a fabio:Algorithm ; :increasing-with-premature-deceleration-and-low-logarithmic-slope-in-[243,729)- 5-years-beginning a :ResearchCareerCategory ; :hasCurve [ a :Curve ; frbr:realization [ a fabio:ComputerProgram ] ] ; :withBibliometricData :increasing-with-premature-deceleration-and-low-logarithmic- :hasTrend :increasing ; :hasAccelerationPoint :premature-deceleration ] ; :hasSlope [ a :Slope ; :hasStrength :low ; :hasGrowth :logarithmic ] ; slope-in-[243,729)-5-years-beginning . :hasOrderOfMagnitude :[243,729) ; :concernsResearchPeriod :5-years-beginning .
  • 29. Evaluation • We evaluated our method on a dataset of 20000 researchers working in the field of computer science in the 1990-2010 interval. • This dataset was derived from the database of Rexplore , a system to provide support for exploring scholarly data, which integrates several data sources (Microsoft Academic Search, DBLP++ and DBpedia).
  • 31. Evaluation Y C18 (1.4%) C22 (2.5%) C25 (2.7%) C28 (2.3%) C29 (8.8%) range mean range mean range mean range mean range mean 6 420-800 567±98 160-280 209±34 100-180 129±25 60-100 72±14 40-60 39±9 7 440-960 610±120 160-320 225±45 100-200 138±30 60-120 79±18 40-80 45±14 8 440-1020 650±137 160-400 246±58 100-260 158±45 60-160 90±26 40-100 50±18 9 440-1260 699±186 160-440 269±74 100-340 187±68 60-200 104±37 40-120 57±25 10 480-2940 751±411 160-500 292±85 100-400 211±82 60-280 125±57 40-160 68±35 11 480-2480 826±336 180-660 331±112 100-520 241±100 60-540 155±103 40-200 82±47 12 480-3520 914±467 180-860 370±151 100-640 270±126 60-440 166±96 40-260 97±60
  • 33. Future Works • Augment the clustering process with a variety of other features (e.g., research areas, co-authors); • apply this technique to groups of researchers rather then single individuals; • extend BiDO in order to provide a semantically-aware description of such new features; • make available a triplestore of bibliometric data linked to other datasets such as Semantic Web Dog Food and DBLP.
  • 34. Questions? francesco.osborne@open.ac.uk silvio.peroni@unibo.it e.motta@open.ac.uk BiDO Ontology: http://purl.org/spar/bido