Information extraction for building knowledge basis

WeST – Web Science & Technologies
University of Koblenz Landau, Germany

Information Extraction
for
Building Knowledge Bases

Steffen Staab
Saqib Mir – European Bioinformatics Institute
Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy

A FEW SLIDES WHERE WEST
COMES FROM

WeST – Web Science & Steffen Staab Slide 2
Technologies staab@uni-koblenz.de

Institut WeST – Web Science & Technologies

Semantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS


We (co-)organize conferences and schools


We build applications and develop methods…

BTC 1. Prize 2011

1. Prize
German
Linked Open Gov Data
Competition 2012

BTC 1. Prize 2008 German KM 1. Prize 2011


We teach Web Science

Master in Master in eGov@Koblenz
Web Science@Koblenz  Free tuition
 Free tuition  Start Fall 2012
 Start Fall 2012  English
 English

2012 Web Science
Summer School
Lorentz Center, Leiden,
The Netherlands,
9-13 July 2012

We are active in joint projects

 EU Integrated Project ROBUST (10 Partners):
Risk and Opportunity management of huge-scale
BUSiness communiTy cooperation
 EU Live+Gov - Reality Sensing, Mining and Augmentation
for Mobile Citizen–Government Dialogue
 EU WeGov – where eGovernment meets the eSociety
 EU IP SocialSensor - Sensing User Generated Input for
Improved Media Discovery and Experience
 EU Net2 – a networked for networked knowledge
 EU MOST – Marrying ontologies and Software
Technologies


Steffen Staab,
Saqib Mir, European Bioinformatics Institute
Ermelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy

INFORMATION EXTRACTION
FOR
BUILDING KNOWLEDGE BASES

GENERAL MOTIVATION


General objective: Extracting to LOD

useAsExample hasLivedIn


General objective: Analysing LOD

useAsExample hasLivedIn


http://lisa.west.uni-koblenz.de/lisa-demo/
Family‘s analysis of Munich LOD + Open Street Map data


http://lisa.west.uni-koblenz.de/lisa-demo/
Entrepreneur‘s analysis of Munich LOD + Open Street Map data


OBSERVATIONS ON
INFORMATION EXTRACTION


Challenges & Opportunities for IE

Not all web pages are created equal



Some challenges are the same, e.g. finding type instances



Some challenges are the same, e.g. finding relation instances



Some contain concepts and their descriptions, some don‘t
No types here,
few relation types



Knowing that they are instances and of which type
Textual Positional
indication indication



To some extent
positional and layout
indications work across
languages and sites



owl:sameAs
We should not only think about
Web pages, but about Web sites


We should not only think about
Web pages, but about Web sites

owl:sameAs


Comparing related work to our objectives
Related work objectives Our objectives
 IE on Web pages  IE on Web sites
 Acquiring instances and  Acquiring items
relationship instances  Classifying items in
 Instances
 Concepts
 Relation instances
 Relationships
 IE also based
 IE based on linear text
on spatial position
There is overlap and there are few
exceptions in related work

Outline

The Social Media-Case The Bio-Case
 Motivation
 State-of-the-Art
 Core idea of SXPath
 SXPath Language
 Spatial Data Model
 Syntax & Semantics
 Complexity
 Implementation
 Evaluation


Presentation-oriented documents

Acquiring a music band
profile:
A music band photo that
has at east its
descriptive information

Music band profile

band photo

band name



• HTML DOM structure is site specific
• Spatial arrangements are rarely explicit
• Spatial layout is hidden in complex nesting of layout elements
• Intricate DOM treee structures are conceptually difficult to
query for the user (or a tool!)


Related Work

Web Query languages
 Xpath 1.0 and XQuery1.0
 Established
 Too difficult to use for scraping from intricate DOM structures

Visual languages
 Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency
 Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface
[Gottlob et al.] [Sahuguet et al.]
 generate XPath location paths of DOM nodes
 can benefit from using Spatial XPath

Outline

 Motivation
 SXPath Language
 Complexity
 Implementation
 Evaluation


Idea: Use Spatial Relations among DOM Nodes

b

e


Spatial DOM (SDOM)


Spatial Relations Among Nodes

Rectangular Cardinal Relations (RCR)

r1 E:NE r2

Spatial models allow for expressing
disjunctive relations among regions
Topological Relations


XPath Example


SXPath Example


From XPath 1.0 towards Spatial Querying with SXPath

SXPath features
 adopts intuitive path notation:
 axis::nodetest [pred]*
 adds to XPath
 spatial axes
 spatial position functions
 natural semantics for spatial querying
 maintains polynomial time combined complexity


Why SXPath?

resilient wrappers

an XPath for familiarity
Information extraction
Simplicity
human oriented
efficiency
web applications

Outline

 Motivation
 SXPath Language
 Complexity
 Implementation
 Evaluation


Spatial DOM (SDOM)


Spatial Navigation Axes


Syntax of SXPath


Complexity Results


Outline

 Motivation
 SXPath Language
 Complexity
 Implementation
 Evaluation


SXPath System Architecture


SXPath System


Results of Experiments


Formative User Study


Summative User Study


Existing Extensions to PDF


Page Header

Text Area and Paragraphs

Table

Item List

Page Number

Page Footer

Outline

The Social Media Case The Bio-Case
 Motivation  Motivation
 State-of-the-Art  The (Biochemical) Deep
 Core idea of SXPath Web
 SXPath Language  Contributions
 Spatial Data Model  Page-level wrapper
induction
 Site-wide wrapper
 Complexity
generation
 Implementation  Error Correction by
 Evaluation Mutual Reinforcement
 Conclusions and Future
Directions

>1000 Life Science DBs, number growing quickly


Biochemical Web Sites: Observations - 1

Labeled Data

Full survey:
http://sabio.villa-
bosch.de/labelsurvey.html (404)

Total Labeled Unlabeled Unlabeled
(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites



Dynamic Web Pages



Rich Site Structure



 Web Services
 Survey: 11 of 100 Databases1 provide APIs
 Incomplete coverage
 Varying granularity
 No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal
(http://www3.oup.co.uk/nar/database/). Complete survey available at
http://sabiork.villa-bosch.de/index.html/survey.html


Biochemical Web Sites: Implications

Induce Wrapper

Induce Wrapper

Induce Wrapper


Contributions

 Unsupervised Page-Level Wrapper Induction

 Unsupervised Site-Wide Wrapper Induction
(Site Structure Discovery)

 Automatic Error Detection and Correction by
Mutual Reinforcement


Page-Level Wrapper Induction – 1
D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}
O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

//*[text()]

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }
O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions


D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}
O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }
O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}



Selecting Labels for Data
html/…./table[1]/tr[8]/td[1]/…/code[1]/a[1]
(“1.1.1.47” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/
(“Reaction”)
html/…./table[1]/tr[8]/th[1]/…/code[1]/
(“Enzyme”)



Anchor the Path
Enzyme - html/table[1]/tr[8]/th[1]/code[1]/
html/table[1]/tr[8]/td[1]/code[1]/a[1]
html/table[1]/tr[8]/td[1]/code[1]/a[2]

//*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot Relative Generalize


Selected Sources

 KEGG, ChEBI, MSDChem
 Basic qualitative data
 Popular
 Overlapping/complementary data


Wrapper Induction - Evaluation

SOURCE #L #D #S TP FN FP P R

KEGG Compound 10 762 3 411 351 46 89.9 53.9
http://www.genome.jp/kegg/ compound/
15 759 3 0 100 99.6
KEGG Reaction 10 205 3 173 32 0 100 84.4
http://www.genome.jp/kegg/ reaction/
15 205 0 0 100 100
ChEBI 22 831 3 595 236 41 93.5 71.6
http://www.ebi.ac.uk/chebi
15 829 2 0 100 99.7
MSDChem 30 600 3 600 0 20 96.7 100
http://www.ebi.ac.uk/msd-srv/msdchem/
15 600 0 20 96.7 100
Average (based on final wrappers for each source) 99.1 99.8
Table 2: Page-level wrapper induction results, 20 test pages
(L=Labels, D=Data entries, S=Training pages)
~9 samples – ~99% P, ~98% R


Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers,
contact pages, navigational menus)
 An efficient approach should ignore these pages
 We dont need to learn the entire site-structure


Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive
pages of the same class.


Site-Wide Wrapper Induction: Observations - 3

 Pages belong to the same class describe the same
concepts
 Some concepts are sometimes omitted
 Ordering is always the same


Site-Wide Wrapper Induction

1. Start with C0 L1
S={C0}
2. Follow all classified
link-collections C0
C1
3. Generate wrappers L3
for each set of target
L2
pages
C2
4. Determine if new C3
class is formed
5. Add navigation step If C0 != Ci (i>0)
S=S+Ci;
6. Repeat 2 – 5 for each
Navigation Steps
new class formed in 4
W= {(C0 → L1→ C0),
(C0 → L2→ C2),
(C0 → L3→ C3)}


Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R

MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class
(C=Classes, C =Classes discovered, D=Data entries)


Error Detection and Correction:

Observation: Certain data reappear on more
than one class of pages


Error Detection and Correction:
 Reinforcement if reappearing data correctly classified as
Data
 Otherwise it points to misclassification
 Label-Data Mismatch
• Correction: Introduce more samples
 Label-Label Mismatch
• Cannot be detected


Where to go next?

 Reverse engineering production
1. LOD emitting RDF & RDFS
2. Navigation model what belongs to what
3. Interaction model (- not treated at all by us so far -)
4. Layout model spatial positioning

 Capture this generative model using machine learning
 Relational learning
• Markov logic programmes?
• …?


Bibliography

 Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath –
Extending XPath towards Spatial Querying on Web
Documents. In: PVLDB – Proceedings of the VLDB
Endowment, 4(2): 129-140, 2010.
 S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for
Life Science Deep Web Databases. In: DILS-2009 – Proc.
of the Data Integration in the Life Sciences Workshop,
Manchester, UK, July 20-22, LNCS, Springer, 2009.
 Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised
Approach for Acquiring Ontologies and RDF Data from
Online Life Science Databases. In: 7th Extended Semantic
Web Conference (ESWC2010), Heraklion, Greece, May
30-June 3, 2010, pp. 319-333.

WeST – Web Science & Technologies
University of Koblenz Landau, Germany

Thank you for your attention!

Information extraction for building knowledge basis

More Related Content

Similar to Information extraction for building knowledge basis (20)

More from Steffen Staab (20)

Recently uploaded (20)

Information extraction for building knowledge basis

Editor's Notes