The Road To The Semantic Web

The Road to
the Semantic
Web
Michael Genkin
SDBI 2010@HUJI

"The Semantic Web is not a separate
Web but an extension of the current
one, in which information is given well-
defined meaning, better enabling
computers and people to work in
cooperation."
Tim Berners-Lee, James Hendler and Ora Lassila; Scientific
American, May 2001
Michael Genkin (mishagenkin@cs.huji.ac.il)

Over 25 billion RDF triples
(October 2010)
More than 24 billion web pages
(June 2010)
Probably more than one
triple per page, lot more

How will we populate the
Semantic Web?
 Humans will enter structured data
 Data-store owners will share their data
 Computers will read unstructured data

Read the Web
http://rtw.ml.cmu.edu/rtw/
(or google it)

Roadmap
 Motivation
 Some definitions
 Natural language processing
 Machine learning
 Macro reading the web
 Coupled training
 NELL
 Demo
 Summary

Some Definitions
 Natural Language Processing
 Machine Learning

Natural Language Processing
 Part of Speech Tagging (e.g. noun, verb)
 Noun phrase: a phrase that normally
consists of a (modified) head noun.
 “pre-modified” (e.g. this, that, the red…)
 “post-modified” (e.g. …with long hair,
…where I live)
 Proper noun: a noun which represents an
unique entity (e.g. Jerusalem, Michael)
 Common noun: a noun which represents
a class of entities (e.g. car, university)

Learning: What is it?
 Assume there is some knowledge base KB.
 Let some algorithm 𝐴 𝐾𝐵to perform a set of task
T.
 Let a performance metric Perf.
 We will say that a computer program learns if:
 KB1 > KB2 ⇒ 𝑃𝑒𝑟𝑓 𝐴 𝐾𝐵1
𝑇 > 𝑃𝑒𝑟𝑓(𝐴 𝐾𝐵2
𝑇 )

Training Methods

 We have a set of examples (KB) and a
domain (D)
 Examples might be positive, or negative
 e.g. for every input 𝑥, 𝑦 ∈ 𝐾𝐵 ⇒ 𝑓 𝑥 = 𝑦 for
some 𝑓.
 The learning algorithm A would try to find
such 𝑓.
 𝑓 is called a classifier or regression
Supervised

 Distinguished from supervised learning by
that there are no labeled examples (KB=D).
 The unsupervised learning algorithm A will try
to find a classifier that when given some 𝑑 ∈
𝐷 as input, will return some arbitrary label.
 i.e. the algorithm A analyses the structure of D
Supervised Unsupervised

 A middle way between supervised and
unsupervised.
 Use a minimal amount of labeled
examples and a large amount of
unlabeled.
 Learn the structure of D in unsupervised
manner, but use the labeled examples to
constraint the results. Repeat.
 Known as bootstrapping.
Supervised
Semi-
Supervised
Unsupervised

Bootstrapping
 Iterative semi-supervised learning
Jerusalem
Tel Aviv
Haifa
mayor of arg1
life in arg1
Ness-Ziona
London
denial
anxiety
selfishness
Amsterdam
arg1 is home of
traits such as arg1
 Under constrained!
 Sematic drift

Macro Reading the Web
Populating the Semantic Web by Macro-Reading
Internet Text.
T.M. Mitchell, J. Betteridge, A. Carlson, E.R. Hruschka Jr.,
and R.C. Wang. Invited Paper, In Proceedings of the
International Semantic Web Conference (ISWC), 2009

Problem Specification (1): Input
 Initial ontology that contains:
 Dozens of categories and relations
 (e.g. Company, CompanyHeadquarteredInCity)
 Relations between categories and relations
 (e.g. mutual exclusion, type constraints)
 A few seed examples of each predicate in
ontology
 The web
 Occasional access to human trainer

Problem Specification (2): The Task
 Run forever (24x7)
 Each day:
 Run over ~500 million web pages.
 Extract new facts and relations from the
web to populate ontology.
 Perform better than the day before
 Populate the semantic web.

A Solution?
 An automatic, learning, macro-reader.

Micro vs. Macro Reading (1)
 Micro-reading: the traditional NLP task of
annotating a single web page to extract
the full body of information contained in
the document.
 NLP is hard!
 Macro-reading: the task of “reading” a
large corpus of web pages (e.g. the web)
and returning large collection of facts
expressed in the corpus.
 But not necessarily all the facts.

Micro vs. Macro Reading (2)
 Macro-reading is easier than micro-
reading. Why?
 Macro-reading doesn’t require extracting
every bit of information available.
 In text corpora as large as the web, many
important fact are stated redundantly,
thousands of times, using different wordings.
 Benefit by ignoring complex sentences.
 Benefit by statistically combining evidence
from many fragments to determine a belief in
a hypothesis.

Why an Input Ontology?
 The problem with understanding free text
is that it can mean virtually anything.
 By formulating the problem of macro-
reading as populating an ontology we
allow the system to focus only on relevant
documents.
 The ontology can define meta properties
of its categories and relations.
 Allows to populate parts of the semantic
web for which an ontology is available.

Machine Learning Methods
 Semi-supervised (use an ontology to learn).
 Learn textual patterns for extraction.
 Employ methods such as Coupled Training
to improve accuracy.
 Expand the ontology to improve
performance.

Coupled Training

Bootstrapping – Revised
 Iterative semi-supervised learning
Jerusalem
Tel Aviv
Haifa
mayor of arg1
life in arg1
Ness-Ziona
London
denial
anxiety
selfishness
Amsterdam
arg1 is home of
traits such as arg1

Coupled Training
 Couple the training of multiple functions to
make unlabeled data more informative
 Makes the learning task easier by adding constraints

Coupling (1):
Output Constraints
 We wish to train a function 𝑓: 𝑋 → 𝑌
 e.g. 𝑐𝑖𝑡𝑦: 𝑁𝑜𝑢𝑛𝑃ℎ𝑟𝑎𝑠𝑒 → {0,1}
 Assume we have 𝑓1: 𝑋1 → 𝑌, 𝑓2: 𝑋2 → 𝑌 two
different functions that assign the label
city, but receive different input.
 Coupling constraint: 𝑓1, 𝑓2 must agree over
unlabeled data.

Coupling (1):
Output Constraints
arg1: Nir Barkat is the mayor of Jerusalem
X1=arg1
Y=city?
X2=arg1
Y=country?≠
X2=arg1
Y=city?
=

Coupling (2):
Compositional Constraints
 Assume we have 𝑓1: 𝑋1 → 𝑌1, 𝑓2: 𝑋1 × 𝑋2 → 𝑌2
 Assume we have a constraint on valid
𝑦1, 𝑦2 pairs given 𝑥1, 𝑥2.
 Coupling constraint: 𝑓1, 𝑓2 must satisfy the
constraint on 𝑦1, 𝑦2.
 e.g. 𝑓1 “type checks” first argument for 𝑓2

Coupling (2):
Compositional Constraints
Nir Barkat is the mayor of Jerusalem
MayorOf(X1,X2)
city?
location?
politician?
city?
location?
politician?

Coupling (3):
Multi-view Agreement
 We have a function 𝑓: 𝑋 → 𝑌
 If X can be partitioned into two “views” 𝑋 =
< 𝑋1, 𝑋2 >.
 Assume 𝑋1 and 𝑋2 can predict Y.
 We wish to learn 𝑓1: 𝑋1 → 𝑌, 𝑓2: 𝑋2 → 𝑌
 Coupling constraint: 𝑓1, 𝑓2 must agree.

Coupling (3):
Multi-view Agreement
 Let Y a set of possible web page
categories
 Let X be a set of web pages
 Assume 𝑋1 represents the words in a page
 Assume 𝑋2 represents the words in
hyperlinks pointing to the page

NELL – Never-Ending
Language Learning
Coupled Semi-Supervised Learning for Information Extraction.
A. Carlson, J. Betteridge, R.C. Wang, E.R. Hruschka Jr. and T.M.
Mitchell. In Proceedings of the ACM International Conference on
Web Search and Data Mining (WSDM), 2010.
Never Ending Language Learning
Tom Mitchell's invited talk in the Univ. of Washington CSE
Distinguished Lecture Series, October 21, 2010.

Motivation
 Humans learn many things, for years, and
become better learners over time
 Why not machines?

Coupled Constraints (1)
 Mutual Exclusion:
 Two mutually exclusive predicates can’t be
both satisfied by the same input 𝑥.
 Relation argument type checking:
 Insure the noun phrases to satisfy each relation
correspond to the categories defined for this
relation.
 e.g. CompanyIsInEconomicSector relation has
arguments of Company and EconomicSector
categories.

Coupled Constraints (2)
 Unstructured and Semi-structured text
features:
 Noun phrases appear on the web in free
text context or semi-structured context.
 Structured and Semi-structured classifiers
will make independent mistakes
 But each is sufficient for classification
Both the classifiers must
agree.

Coupled Pattern Learner (CPL):
Overview
 Learns to extract
category and pattern
instances.
 Learns high-precision
textual patterns.
 e.g. arg1 scored a
goal for arg2

Extracting
 Runs forever, on each iteration bootstraps a
patterns promoted on the last iteration to
extract instances.
 Select the 1000 that co-occur with most patterns.
 Similar procedure for patterns, but using recently
promoted instances.
 Uses PoS heuristics to accomplish extraction
 e.g. per category proper/common noun
specification, pattern is a sequence of verbs
followed by adjectives, prepositions, or
determiners (and optionally preceded by nouns).

Filtering and Ranking
 Candidates are filtered to enforce mutual
exclusion and type constraints
 A candidate is rejected unless it co-occurs with
a promoted pattern at least three times more
than it co-occurs with mutually exclusive
predicates.
 Candidates are ranked as following:
 Instances: by the number of promoted patterns
the co-occur with.
 Patterns: by precision estimation
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑝 =
𝑖∈ℐ 𝑐𝑜𝑢𝑛𝑡 𝑖, 𝑝
𝑐𝑜𝑢𝑛𝑡 𝑝

Promoting Candidates
 For each predicate – promotes at most
100 instances and 5 patterns.
 Highest rated.
 Instances and patterns promoted only if
they co-occur with two promoted pattern
or instances.
 Relations instances are promoted only if
their arguments are candidates for the
specified categories.

Coupled SEAL (1)
 SEAL is an established wrapper induction
algorithm.
 Creates page specific extractors
 Independent of language
 Category wrappers defined by prefix and
postfix, relation wrappers defined by infix.
 Wrappers for each predicate learned
independently.

Coupled SEAL (2)
 Coupled SEAL adds mutual
exclusion and type checking
constrains to SEAL.
 Bootstraps recently promoted
wrappers.
 Filters candidates that are
mutually exclusive or not of
the right type for relation.
 Uses a single page per domain
for ranking.
 Promotes the top 100 instances extracted by at
least two wrappers.

Meta-Bootstrap Learner
 Couples the training of
multiple extraction
techniques.
 Intuition: different
extractors will make
independent errors.
 Replaces the PROMOTE
step of subordinate
extractor algorithms.
 Promotes any instance recommended by all
the extractors, as long as mutual exclusion and
type checks hold.

Learning New Constraints
 Data mine the KB to infer new beliefs.
 Generates probabilistic, first order, horn
clauses.
 Connects previously uncoupled predicates.
 Manually filter rules.

Demo Time
 http://rtw.ml.cmu.edu/rtw/kbbrowser/

Summary
Populating the semantic web by using NELL for
macro reading

Populating the Semantic Web
 Many ways to accomplish.
 Use initial ontology to focus, constrain the
learning task.
 Couple the learning of many, many
extractors.
 Macro Reading: instead of annotating a
single page each time, read many pages
simultaneously.
 A never ending task.

Macro-Reading
 Helps to improve accuracy.
 Still doesn’t help to annotate a single
page, but…
 Many things that are true for a single
page are also true for many pages
 Helps to populate databases with
frequently mentioned knowledge

Future Directions
 Coupling with external sources
 DBpedia, Freenode
 Ontology extension
 New relations through reading, Subcategories
 Use a macro-reader to train a micro-reader
 Self-reflection, Self-correction
 Distinguishing tokens from entities
 Active learning – crowd sourcing

Questions?
mishagenkin@cs.huji.ac.il

The Road To The Semantic Web

More Related Content

Similar to The Road To The Semantic Web (20)

More from Michael Genkin (9)

Recently uploaded (20)

The Road To The Semantic Web

Editor's Notes