SlideShare a Scribd company logo
HIEDS: A Generic and Efficient Approach
to Hierarchical Dataset Summarization
Gong Cheng, Cheng Jin, Yuzhong Qu
National Key Laboratory for Novel Software Technology
Nanjing University, China
Websoft
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
Scenario: browsing a dataset in an
open data portal
https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory
I need some insight
into the contents,
not just metadata.
Meeting the challenge with a
dataset summary
i.e., automatically generated small-sized, high-level abstraction of data,
to summarize the contents of a dataset for quick inspection.
Expected features of a dataset summary
• To provide multigranular abstraction of data to be
incrementally explored
• To preserve the structural nature of a dataset
• To be comprehensible
Constitution of a dataset summary
• An example
A hierarchical grouping of entities Relations connecting sibling groups
A property-value pair differentiates a group of entities from sibling groups.
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• large subgroups, frequent relations
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• moderate-sized subgroups
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• informative (i.e., less frequent) property-value pairs
• Overlap between groups
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• controllable overlap
• Homogeneity of groups
Quality of a dataset summary
• Coverage of data
• Height of hierarchy
• Cohesion within groups
• Overlap between groups
• Homogeneity of groups
• different values of the same property
Problem formulation:
multidimensional knapsack problem (MKP)
maximizing moderateness
of each subgroup
maximizing cohesion
within each subgroup
disallowing large overlap
between subgroups
selecting ≤k subgroups
(optionally) disallowing different properties
Problem solution
• A greedy strategy is used
(sorting candidates by )
but its efficient implementation is non-trivial.
Experiments
• Baseline: LODeX (ISWC’14)
• flat grouping
• biased towards coverage (e.g., Type:Person)
• redundant information (e.g., Type:Person and Type:Chair)
• Advantages of HIEDS
• hierarchical grouping
• trade-off between coverage and cohesion (e.g., Type:Actor)
• controllable overlap
Details can be found
in our poster!

More Related Content

PDF
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
PPTX
Data availability and feasibility of validation – A genomics case study
PPTX
Towards effective research recommender systems for repositories
PDF
From data to knowledge – the Ondex System for integrating Life Sciences data ...
PPTX
SciGaP Science Gateways for Artificial Intelligence and Machine Learning
PPT
Who will use the open data? Mark Humphries keynote
PPTX
Sources of Change in Modern Knowledge Organization Systems
PDF
Esposto conflitto interessi
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
Data availability and feasibility of validation – A genomics case study
Towards effective research recommender systems for repositories
From data to knowledge – the Ondex System for integrating Life Sciences data ...
SciGaP Science Gateways for Artificial Intelligence and Machine Learning
Who will use the open data? Mark Humphries keynote
Sources of Change in Modern Knowledge Organization Systems
Esposto conflitto interessi

Similar to HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization (20)

PPTX
Data clustring
PPTX
Data mining presentation
PDF
The Genopolis Microarray database
PPTX
Hetrogeneous Data handling in Big Data Analysis
PDF
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
PPTX
PgVector + : Enable Richer Interaction with vector database.pptx
PPTX
Survey on NoSQL integration
PPTX
Machine Learning : Clustering - Cluster analysis.pptx
PPTX
Building genomic data cyberinfrastructure with the online database software T...
PPT
Large Scale Data Mining using Genetics-Based Machine Learning
PPTX
Semantic Technologies for Big Sciences including Astrophysics
PDF
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
PPTX
UNIT - 4: Data Warehousing and Data Mining
PPTX
Semantic web 101: Benefits for geologists
PDF
Easing embedding learning by comprehensive transcription of heterogeneous inf...
PPTX
Sharing a Startup’s Big Data Lessons
PPTX
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
PDF
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
PDF
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
PPTX
An Introduction to NOSQL, Graph Databases and Neo4j
Data clustring
Data mining presentation
The Genopolis Microarray database
Hetrogeneous Data handling in Big Data Analysis
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
PgVector + : Enable Richer Interaction with vector database.pptx
Survey on NoSQL integration
Machine Learning : Clustering - Cluster analysis.pptx
Building genomic data cyberinfrastructure with the online database software T...
Large Scale Data Mining using Genetics-Based Machine Learning
Semantic Technologies for Big Sciences including Astrophysics
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
UNIT - 4: Data Warehousing and Data Mining
Semantic web 101: Benefits for geologists
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Sharing a Startup’s Big Data Lessons
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
An Introduction to NOSQL, Graph Databases and Neo4j
Ad

More from Gong Cheng (20)

PPTX
Towards Content-Based Dataset Search - Test Collections and Beyond
PPTX
从元数据到内容——新一代知识图谱搜索引擎初探
PPTX
知识图谱中的实体摘要:基于神经网络的方法
PPTX
Generating Compact and Relaxable Answers to Keyword Queries over Knowledge Gr...
PPTX
知识图谱中的关联搜索
PPTX
面向高考机器人的知识表示与推理初探
PPTX
知识图谱中的实体关联搜索
PPTX
Semantic Data Retrieval: Search, Ranking, and Summarization
PPTX
Semantic Web related top conference review
PPTX
Relatedness-based Multi-Entity Summarization
PPTX
Generating Illustrative Snippets for Open Data on the Web
PPTX
常识推理在地理自动答题中的需求分析
PPTX
Efficient Algorithms for Association Finding and Frequent Association Pattern...
PPTX
Summarizing Semantic Data
PPTX
Taking up the Gaokao Challenge: An Information Retrieval Approach
PPTX
Summarizing Entity Descriptions for Effective and Efficient Human-centered En...
PPTX
知识的摘要
PPTX
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
PPTX
Facilitating Human Intervention in Coreference Resolution with Comparative En...
PPTX
Towards Exploratory Relationship Search: A Clustering-based Approach
Towards Content-Based Dataset Search - Test Collections and Beyond
从元数据到内容——新一代知识图谱搜索引擎初探
知识图谱中的实体摘要:基于神经网络的方法
Generating Compact and Relaxable Answers to Keyword Queries over Knowledge Gr...
知识图谱中的关联搜索
面向高考机器人的知识表示与推理初探
知识图谱中的实体关联搜索
Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Web related top conference review
Relatedness-based Multi-Entity Summarization
Generating Illustrative Snippets for Open Data on the Web
常识推理在地理自动答题中的需求分析
Efficient Algorithms for Association Finding and Frequent Association Pattern...
Summarizing Semantic Data
Taking up the Gaokao Challenge: An Information Retrieval Approach
Summarizing Entity Descriptions for Effective and Efficient Human-centered En...
知识的摘要
Explass: Exploring Associations between Entities via Top-K Ontological Patter...
Facilitating Human Intervention in Coreference Resolution with Comparative En...
Towards Exploratory Relationship Search: A Clustering-based Approach
Ad

Recently uploaded (20)

PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
The scientific heritage No 166 (166) (2025)
PDF
Biophysics 2.pdffffffffffffffffffffffffff
Comparative Structure of Integument in Vertebrates.pptx
2. Earth - The Living Planet Module 2ELS
HPLC-PPT.docx high performance liquid chromatography
Derivatives of integument scales, beaks, horns,.pptx
famous lake in india and its disturibution and importance
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Taita Taveta Laboratory Technician Workshop Presentation.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Cell Membrane: Structure, Composition & Functions
Introduction to Cardiovascular system_structure and functions-1
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
POSITIONING IN OPERATION THEATRE ROOM.ppt
Microbiology with diagram medical studies .pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
neck nodes and dissection types and lymph nodes levels
The scientific heritage No 166 (166) (2025)
Biophysics 2.pdffffffffffffffffffffffffff

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization

  • 1. HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization Gong Cheng, Cheng Jin, Yuzhong Qu National Key Laboratory for Novel Software Technology Nanjing University, China Websoft
  • 2. Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
  • 3. Scenario: browsing a dataset in an open data portal https://data.europa.eu/euodp/en/data/dataset/dgt-translation-memory I need some insight into the contents, not just metadata.
  • 4. Meeting the challenge with a dataset summary i.e., automatically generated small-sized, high-level abstraction of data, to summarize the contents of a dataset for quick inspection.
  • 5. Expected features of a dataset summary • To provide multigranular abstraction of data to be incrementally explored • To preserve the structural nature of a dataset • To be comprehensible
  • 6. Constitution of a dataset summary • An example A hierarchical grouping of entities Relations connecting sibling groups A property-value pair differentiates a group of entities from sibling groups.
  • 7. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • Overlap between groups • Homogeneity of groups
  • 8. Quality of a dataset summary • Coverage of data • large subgroups, frequent relations • Height of hierarchy • Cohesion within groups • Overlap between groups • Homogeneity of groups
  • 9. Quality of a dataset summary • Coverage of data • Height of hierarchy • moderate-sized subgroups • Cohesion within groups • Overlap between groups • Homogeneity of groups
  • 10. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • informative (i.e., less frequent) property-value pairs • Overlap between groups • Homogeneity of groups
  • 11. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • Overlap between groups • controllable overlap • Homogeneity of groups
  • 12. Quality of a dataset summary • Coverage of data • Height of hierarchy • Cohesion within groups • Overlap between groups • Homogeneity of groups • different values of the same property
  • 13. Problem formulation: multidimensional knapsack problem (MKP) maximizing moderateness of each subgroup maximizing cohesion within each subgroup disallowing large overlap between subgroups selecting ≤k subgroups (optionally) disallowing different properties
  • 14. Problem solution • A greedy strategy is used (sorting candidates by ) but its efficient implementation is non-trivial.
  • 15. Experiments • Baseline: LODeX (ISWC’14) • flat grouping • biased towards coverage (e.g., Type:Person) • redundant information (e.g., Type:Person and Type:Chair) • Advantages of HIEDS • hierarchical grouping • trade-off between coverage and cohesion (e.g., Type:Actor) • controllable overlap
  • 16. Details can be found in our poster!