SlideShare a Scribd company logo
Extending DBpedia (LOD) using
WikiTables
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org
Linked Open Data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
October 12, 2012 -- E. Muñoz
Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
October 12, 2012 -- E. Muñoz
Linked Open Data
• DBpedia, an export of Wikipedia’s structured data
DBpedia provides RDF version of all wikipedia structured data (infoboxes)
But not yet a version of all normal Wikipedia tables or wikitables
October 12, 2012 -- E. Muñoz
Tables as a source of LOD
http://en.wikipedia.org/wiki/Dublin
Caption as
another row
Column header represents
types of information
The values
represent
instances of that
types
http://en.wikipedia.org/wiki/Galway
Infoboxes
(attr-value)
October 12, 2012 -- E. Muñoz
Tables are inherently concise
as well as information rich
Reasoning over Wikipedia Tables
http://en.wikipedia.org/wiki/Dublin
Recovering Table Semantics …
October 12, 2012 -- E. Muñoz
Dublin is twinned with the following places:
Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
http://en.wikipedia.org/wiki/Dublin
Entity annotation for cells, mappings to DBpedia resources
(xsd:integer)
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
dbpedia.org/resource/San_Jose,_California
dbpedia.org/resource/Liverpool
dbpedia.org/resource/Matsue,_Shimane
dbpedia.org/resource/Barcelona
dbpedia.org/resource/Beijing
dbpedia.org/resource/United_States
dbpedia.org/resource/United_Kingdom
dbpedia.org/resource/Japan
dbpedia.org/resource/Spain
dbpedia.org/resource/People’s_Republic_of_China
(xsd:integer)
dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since
dbpedia.org/ontology/country
dbpedia.org/property/subdivisionName
is dbpedia.org/ontology/country of
http://en.wikipedia.org/wiki/Dublin
Extracting relations
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/People's_Republic_of_China> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/People's_Republic_of_China> .
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/San_Jose,_California>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_States> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Liverpool>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/United_Kingdom> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Matsue,_Shimane>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Japan> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Barcelona>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/Spain> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/property/subdivisionName>
<http://dbpedia.org/resource/People's_Republic_of_China> .
• <http://dbpedia.org/resource/Beijing>
<http://dbpedia.org/ontology/country>
<http://dbpedia.org/resource/People's_Republic_of_China> .
October 12, 2012 -- E. Muñoz
Reasoning over Wikipedia Tables
• Let’s analyze these cases …
• Liverpool
• Matsue
• Beijing
October 12, 2012 -- E. Muñoz
Not that simple…
• Web tables usually don’t have explicit semantics
by themselves.
• Main issues:
– Complex tables with spans
– Captions inside the table as another row
– Not well-formed tables (i.e., not a matrix)
– We need filters (e.g., min 2 columns, 2 rows)
• We are extracting relations at row level and
between the main entity and the table resources
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
http://en.wikipedia.org/wiki/People%27s_Republic_of_China
Caption as
another row
Table split
October 12, 2012 -- E. Muñoz
Rowspans
with pictures
First step: parsing Wiki format
Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
• Problems with parsing the cell’s content
http://en.wikipedia.org/wiki/Danny_Kaye
October 12, 2012 -- E. Muñoz
Parsing: Extracting Tables
Same page link Many different
formats
Anchor text
vs.
Content text
http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s
October 12, 2012 -- E. Muñoz
Extracting Relations
A table
containing tables
http://en.wikipedia.org/wiki/AFC_Ajax
October 12, 2012 -- E. Muñoz
Extracting Relations
• Also relations between the main entity and
the entities in the table
dbpedia.org/resource/AFC_Ajax
14 dbpedia.org/ontology/team
14 dbpedia.org/property/clubs
11 dbpedia.org/property/currentclub
3 dbpedia.org/property/youthclubs
In his dbpedia page
there is no mention
to AFC Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
October 12, 2012 -- E. Muñoz
dbpedia.org/resource/Christian_Eriksen
Disambiguation page
dbpedia.org/resource/Ajax
http://en.wikipedia.org/wiki/AFC_Ajax
October 12, 2012 -- E. Muñoz
Our Dataset
• enwiki dump from 2012-09-03 02:17:37
• 8.6 GB of Wikipedia pages that comprise
– 10,531,986 documents (HTML pages)
– Only 413,256 HTML contains tables
– 2,989,098 tables
– 905,929 tables after the filter
• 27.7% of the whole tables
– 0.46 tables per page (or 2.15 discarding pages
without tables)
October 12, 2012 -- E. Muñoz
Methodology
October 12, 2012 -- E. Muñoz
Ranking of Relationships
• The current ranking function is naïve
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/AFC_Ajax
16 players
freq relationship score
14 dbpedia.org/ontology/team 0,875
14 dbpedia.org/property/clubs 0,875
11 dbpedia.org/property/currentclub 0,6875
3 dbpedia.org/property/youthclubs 0,1875
𝑠𝑐𝑜𝑟𝑒 =
𝑓𝑟𝑒𝑙
𝑛 𝑟𝑜𝑤𝑠
Ranking of Relationships
• For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1]
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Danny_Kaye
Ongoing Work and Challenges
• Improve the ranking function for relations.
• Store the 5.5M DBpedia (transitive) redirects
locally (optimizing time).
• Statistical analysis of Wikipedia tables
– Number of columns, rows
– Headers, Captions
– External and internal links
• The big following challenge is the evaluation.
October 12, 2012 -- E. Muñoz
What’s next?
• Some ideas in mind:
– Use the extracted relations to classify WikiTables
– Define a similarity function for WikiTables
English Italian
October 12, 2012 -- E. Muñoz
What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
What means
this number?
Here there is no reference to those numbers!
What’s next?
October 12, 2012 -- E. Muñoz
http://en.wikipedia.org/wiki/Electronegativity
http://en.wikipedia.org/wiki/Chlorine
Chlorous acid is a chlorite
http://dbpedia.org/page/Chlorous_acid
Open problems
• Handle multiple-entities in the same cell
• Improve the ranking function
• Handle redirects before querying DBpedia
• How to evaluate the outcome
October 12, 2012 -- E. Muñoz
Thanks!
Q & A
Thanks!
Emir Muñoz
Unit for Reasoning and Querying
emir.munoz@deri.org

More Related Content

PDF
Notes from the Library Juice Academy courses on XPath, XSLT, and XQuery: Univ...
PDF
Notes from the Library Juice Academy course, “Introduction to XML”: Universit...
PDF
Exchanging OWL 2 QL Knowledge Bases
PDF
XSPARQL Tutorial
PDF
Extracting Information for Context-aware Meeting Preparation
PDF
Exchanging more than Complete Data
PDF
Borders of Decidability in Verification of Data-Centric Dynamic Systems
PDF
Random Manhattan Indexing
Notes from the Library Juice Academy courses on XPath, XSLT, and XQuery: Univ...
Notes from the Library Juice Academy course, “Introduction to XML”: Universit...
Exchanging OWL 2 QL Knowledge Bases
XSPARQL Tutorial
Extracting Information for Context-aware Meeting Preparation
Exchanging more than Complete Data
Borders of Decidability in Verification of Data-Centric Dynamic Systems
Random Manhattan Indexing

Similar to Extending DBpedia (LOD) using WikiTables (20)

PDF
WikiTables DERI Talk
PDF
06 gioca-ontologies
PDF
Using Linked Data to Mine RDF from Wikipedia's Tables
ODP
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
PDF
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
PDF
Building and using ontologies
PDF
On Storing Big Data
PPT
RDBMS vs NoSQL
PPTX
NoSQL and The Big Data Hullabaloo
PPTX
IP-Lesson_Planning(Unit4 - Database concepts and SQL).pptx
ODP
DBpedia i18n - Amsterdam Meeting (30/01/2014)
PPTX
cloud computinghshdbbsbshdhsjdbxbxhdnxbxbsbxbxbxbx
PDF
Database_Introduction.pdf
PDF
Comparing taxonomies for organising collections of documents presentation
PPT
The XML Submission Tool: A System for Managing Text Collections at Indiana Un...
PPTX
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
PPTX
A Practical Look at the NOSQL and Big Data Hullabaloo
PPT
DBMS-Week-2about hardware and software.PPT
PPT
DBMS-Week-2about hardware and software.PPT
DOCX
CIS 111 STUDY Introduction Education--cis111study.com
WikiTables DERI Talk
06 gioca-ontologies
Using Linked Data to Mine RDF from Wikipedia's Tables
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Tutorial: Building and using ontologies - E.Simperl - ESWC SS 2014
Building and using ontologies
On Storing Big Data
RDBMS vs NoSQL
NoSQL and The Big Data Hullabaloo
IP-Lesson_Planning(Unit4 - Database concepts and SQL).pptx
DBpedia i18n - Amsterdam Meeting (30/01/2014)
cloud computinghshdbbsbshdhsjdbxbxhdnxbxbsbxbxbxbx
Database_Introduction.pdf
Comparing taxonomies for organising collections of documents presentation
The XML Submission Tool: A System for Managing Text Collections at Indiana Un...
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Practical Look at the NOSQL and Big Data Hullabaloo
DBMS-Week-2about hardware and software.PPT
DBMS-Week-2about hardware and software.PPT
CIS 111 STUDY Introduction Education--cis111study.com
Ad

More from net2-project (11)

PDF
Vector spaces for information extraction - Random Projection Example
PDF
Federation and Navigation in SPARQL 1.1
PDF
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
PDF
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
PDF
Managing Social Communities
PDF
Data Exchange over RDF
PDF
Exchanging More than Complete Data
PDF
Exchanging More than Complete Data
PDF
Answer-set programming
PDF
Evolving web, evolving search
PPTX
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
Vector spaces for information extraction - Random Projection Example
Federation and Navigation in SPARQL 1.1
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy f...
Tailoring Temporal Description Logics for Reasoning over Temporal Conceptual ...
Managing Social Communities
Data Exchange over RDF
Exchanging More than Complete Data
Exchanging More than Complete Data
Answer-set programming
Evolving web, evolving search
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Assigned Numbers - 2025 - Bluetooth® Document
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Extending DBpedia (LOD) using WikiTables

  • 1. Extending DBpedia (LOD) using WikiTables Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org
  • 2. Linked Open Data Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ October 12, 2012 -- E. Muñoz
  • 3. Linked Open Data • DBpedia, an export of Wikipedia’s structured data DBpedia provides RDF version of all wikipedia structured data (infoboxes) October 12, 2012 -- E. Muñoz
  • 4. Linked Open Data • DBpedia, an export of Wikipedia’s structured data DBpedia provides RDF version of all wikipedia structured data (infoboxes) But not yet a version of all normal Wikipedia tables or wikitables October 12, 2012 -- E. Muñoz
  • 5. Tables as a source of LOD http://en.wikipedia.org/wiki/Dublin Caption as another row Column header represents types of information The values represent instances of that types http://en.wikipedia.org/wiki/Galway Infoboxes (attr-value) October 12, 2012 -- E. Muñoz Tables are inherently concise as well as information rich
  • 6. Reasoning over Wikipedia Tables http://en.wikipedia.org/wiki/Dublin Recovering Table Semantics … October 12, 2012 -- E. Muñoz Dublin is twinned with the following places:
  • 7. Reasoning over Wikipedia Tables dbpedia.org/resource/San_Jose,_California dbpedia.org/resource/Liverpool dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Barcelona dbpedia.org/resource/Beijing dbpedia.org/resource/United_States dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Japan dbpedia.org/resource/Spain dbpedia.org/resource/People’s_Republic_of_China dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since http://en.wikipedia.org/wiki/Dublin Entity annotation for cells, mappings to DBpedia resources (xsd:integer) October 12, 2012 -- E. Muñoz
  • 8. Reasoning over Wikipedia Tables dbpedia.org/resource/San_Jose,_California dbpedia.org/resource/Liverpool dbpedia.org/resource/Matsue,_Shimane dbpedia.org/resource/Barcelona dbpedia.org/resource/Beijing dbpedia.org/resource/United_States dbpedia.org/resource/United_Kingdom dbpedia.org/resource/Japan dbpedia.org/resource/Spain dbpedia.org/resource/People’s_Republic_of_China (xsd:integer) dbpedia.org/property/city dbpedia.org/property/nation dbpedia.org/property/since dbpedia.org/ontology/country dbpedia.org/property/subdivisionName is dbpedia.org/ontology/country of http://en.wikipedia.org/wiki/Dublin Extracting relations October 12, 2012 -- E. Muñoz
  • 9. Reasoning over Wikipedia Tables • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/People's_Republic_of_China> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/People's_Republic_of_China> . October 12, 2012 -- E. Muñoz
  • 10. Reasoning over Wikipedia Tables • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/San_Jose,_California> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_States> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Liverpool> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/United_Kingdom> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Matsue,_Shimane> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Japan> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Barcelona> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/Spain> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/property/subdivisionName> <http://dbpedia.org/resource/People's_Republic_of_China> . • <http://dbpedia.org/resource/Beijing> <http://dbpedia.org/ontology/country> <http://dbpedia.org/resource/People's_Republic_of_China> . October 12, 2012 -- E. Muñoz
  • 11. Reasoning over Wikipedia Tables • Let’s analyze these cases … • Liverpool • Matsue • Beijing October 12, 2012 -- E. Muñoz
  • 12. Not that simple… • Web tables usually don’t have explicit semantics by themselves. • Main issues: – Complex tables with spans – Captions inside the table as another row – Not well-formed tables (i.e., not a matrix) – We need filters (e.g., min 2 columns, 2 rows) • We are extracting relations at row level and between the main entity and the table resources October 12, 2012 -- E. Muñoz
  • 13. Parsing: Extracting Tables http://en.wikipedia.org/wiki/People%27s_Republic_of_China Caption as another row Table split October 12, 2012 -- E. Muñoz Rowspans with pictures First step: parsing Wiki format
  • 14. Parsing: Extracting Tables • Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 15. Parsing: Extracting Tables • Problems with parsing the cell’s content http://en.wikipedia.org/wiki/Danny_Kaye October 12, 2012 -- E. Muñoz
  • 16. Parsing: Extracting Tables Same page link Many different formats Anchor text vs. Content text http://en.wikipedia.org/wiki/List_of_animated_television_series_of_the_1990s October 12, 2012 -- E. Muñoz
  • 17. Extracting Relations A table containing tables http://en.wikipedia.org/wiki/AFC_Ajax October 12, 2012 -- E. Muñoz
  • 18. Extracting Relations • Also relations between the main entity and the entities in the table dbpedia.org/resource/AFC_Ajax 14 dbpedia.org/ontology/team 14 dbpedia.org/property/clubs 11 dbpedia.org/property/currentclub 3 dbpedia.org/property/youthclubs In his dbpedia page there is no mention to AFC Ajax http://en.wikipedia.org/wiki/AFC_Ajax 16 players October 12, 2012 -- E. Muñoz
  • 20. Our Dataset • enwiki dump from 2012-09-03 02:17:37 • 8.6 GB of Wikipedia pages that comprise – 10,531,986 documents (HTML pages) – Only 413,256 HTML contains tables – 2,989,098 tables – 905,929 tables after the filter • 27.7% of the whole tables – 0.46 tables per page (or 2.15 discarding pages without tables) October 12, 2012 -- E. Muñoz
  • 22. Ranking of Relationships • The current ranking function is naïve October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/AFC_Ajax 16 players freq relationship score 14 dbpedia.org/ontology/team 0,875 14 dbpedia.org/property/clubs 0,875 11 dbpedia.org/property/currentclub 0,6875 3 dbpedia.org/property/youthclubs 0,1875 𝑠𝑐𝑜𝑟𝑒 = 𝑓𝑟𝑒𝑙 𝑛 𝑟𝑜𝑤𝑠
  • 23. Ranking of Relationships • For this cases is not good and 𝑠𝑐𝑜𝑟𝑒 ∉ [0,1] October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/Danny_Kaye
  • 24. Ongoing Work and Challenges • Improve the ranking function for relations. • Store the 5.5M DBpedia (transitive) redirects locally (optimizing time). • Statistical analysis of Wikipedia tables – Number of columns, rows – Headers, Captions – External and internal links • The big following challenge is the evaluation. October 12, 2012 -- E. Muñoz
  • 25. What’s next? • Some ideas in mind: – Use the extracted relations to classify WikiTables – Define a similarity function for WikiTables English Italian October 12, 2012 -- E. Muñoz
  • 26. What’s next? October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/Electronegativity What means this number? Here there is no reference to those numbers!
  • 27. What’s next? October 12, 2012 -- E. Muñoz http://en.wikipedia.org/wiki/Electronegativity http://en.wikipedia.org/wiki/Chlorine Chlorous acid is a chlorite http://dbpedia.org/page/Chlorous_acid
  • 28. Open problems • Handle multiple-entities in the same cell • Improve the ranking function • Handle redirects before querying DBpedia • How to evaluate the outcome October 12, 2012 -- E. Muñoz Thanks! Q & A Thanks! Emir Muñoz Unit for Reasoning and Querying emir.munoz@deri.org