SlideShare a Scribd company logo
Exploring Content
with Semantic Transformations
using Collaborative Knowledge Bases
Yegin Genc
Prof. Jeffrey V. Nickerson
OBJECTIVE

Understanding text automatically to support
search driven exploratory activities.
EXPLORATORY SEARCH

LOOKUP

Fact retrieval
Known item search
Navigation

Marchionini, G. (2006)

LEARN

Knowledge acquisition
Comprehension/interpretation
Comparison

INVESTIGATE

Accretion
Analysis
Exclusion/Negation
EXPLORATORY SEARCH
ILL-STRUCTURED PROBLEM
• No single right approach
• Problem definitions change as new
information is gathered
Foreign minorities, Germany
Text: “ Foreign Minorities Germany ”
Exploratory Search Task

Given a journal abstract, rank other abstracts
based on their relevancy to the seed abstract.

Evaluation is based on relevancy and diversity.
Concepts

Candidates
Seed
Document

(candidates that match
to a Wikipedia Page title
and connected through Ontology)

n-grams
(1 to 3)

CONCEPT– WORD
K (W x K)

d
Tf-idf(D)

DOCUMENT – CONCEPT
Θ (D x K)

k

DOCUMENT – W0RD
D (D x W )

k

*
D: Documents

=

d

Tf-idf(K)
K: Concepts

Argsort (row.sum(Θ) )

W: Words
EXTRACTING CONCEPT NETWORK
“Representation independence formally characterizes the
encapsulation provided by language constructs for data
abstraction and justifies reasoning by simulation.
Representation independence has been shown for a
variety of languages and constructs but not for shared
references to mutable state; indeed it fails in general for
such languages. This article formulates representation
independence for classes, in an imperative, objectoriented language with pointers, subclassing and dynamic
dispatch, class oriented visibility control, recursive types
and methods, and a simple form of module. An instance
of a class is considered to implement an abstraction using
private fields and so-called representation objects.
Encapsulation of representation objects is expressed by a
restriction,
called
confinement,
on
aliasing.
Representation independence is proved for programs
satisfying the confinement condition. A static analysis is
given for confinement that accepts common designs such
as the observer and factory patterns. The formalization
takes into account not only the usual interface between a
client and a class that provides an abstraction but also the
interface (often called protected") between the class
and its subclasses."
EXTRACTING CONCEPT NETWORK
“Representation independence formally characterizes the
encapsulation provided by language constructs for data
abstraction and justifies reasoning by simulation.
Representation independence has been shown for a
variety of languages and constructs but not for shared
references to mutable state; indeed it fails in general for
such languages. This article formulates representation
independence for classes, in an imperative, objectoriented language with pointers, subclassing and dynamic
dispatch, class oriented visibility control, recursive types
and methods, and a simple form of module. An instance
of a class is considered to implement an abstraction using
private fields and so-called representation objects.
Encapsulation of representation objects is expressed by a
restriction,
called
confinement,
on
aliasing.
Representation independence is proved for programs
satisfying the confinement condition. A static analysis is
given for confinement that accepts common designs such
as the observer and factory patterns. The formalization
takes into account not only the usual interface between a
client and a class that provides an abstraction but also the
interface (often called protected") between the class
and its subclasses."
WIKIPEDIA PAGES AS CONCEPTS
Solar System
“The Solar System[a] consists
of the Sun and the
astronomical objects
gravitationally bound in orbit
around it, all of which formed
from the collapse of a giant
molecular cloud
approximately 4.6 billion
years ago…”
(http://en.wikipedia.org/wiki/Solar
_System)

Word Stem

Occ. Freq.

abstract

53

0.056

program

44

0.046

langu

33

0.035

spec

16

0.017

comput

12

0.013

conceiv

12

0.013

dat

12

0.013

bk = p(Wi | k) =

{Wi Î k}
N

å {W Î k}
i

i

βk : Per-concept word distribution
RANKING DOCUMENTS

DOCUMENT – W0RD
D (D x W )

DOCUMENT – CONCEPT
Θ (D x K)

CONCEPT– WORD
K (W x K)

k

k
d

=

*

D: Documents

K: Concepts
W: Words

d
SORT DOCUMENTS

DOCUMENT – W0RD
D (D x W )

DOCUMENT – CONCEPT
Θ (D x K)

CONCEPT– WORD
K (W x K)

k

k
d

=

*

D: Documents

K: Concepts
W: Words

d
EXPERIMENT
Given a journal abstract, rank other abstracts based on
their relevancy to the seed abstract.

• Data: 619 abstracts of the Journal of the ACM
(JACM) and their references.
• Task: Select Top-k (5,10,15, and 20) relevant
abstracts.
• Observe: Relevancy (measured by LSA vector
similarity) and Diversity (measured through the
coverage of the references.)
MAXIMAL MARGINAL RELEVANCE
• a measure to increase the diversity of documents
retrieved by an IR system

-Similarity to query: BM25 (Xapian1)
-Similarity to results: LSA similarity (Gensim2)
1.
2.

http://xapian.org
http://radimrehurek.com/gensim/
MMR RESULTS
WIKI-BASED MODEL VS MMR
CONCLUDING REMARKS
• Our Wiki based technique provides high
diversity with low relevancy loss.
• Semantics embedded in concept networks
extracted from Wikipedia can improve
exploratory search tasks.

More Related Content

PPT
Searching Semantic Web Objects Based on Class Hierarchies
PPT
Ontologies in Ubiquitous Computing
PPT
Networked Digital Library Of Theses And Dissertations
PPTX
AnIML: A New Analytical Data Standard
PDF
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
PPTX
Concepts of oop1
PPT
OOP for java
Searching Semantic Web Objects Based on Class Hierarchies
Ontologies in Ubiquitous Computing
Networked Digital Library Of Theses And Dissertations
AnIML: A New Analytical Data Standard
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Concepts of oop1
OOP for java

What's hot (7)

PPT
A Rose by Any Other Name is Still a Rose
PPTX
Data Dictionary
PPTX
Presentation_euroCRIS_ES
PPTX
Master defence 2020 - Serhii Brodiuk - Concept Embedding and Network Analysis...
PPTX
Oopsinphp
PDF
Handout for Dublin Core Metadata Initiative Abstract Model
PPTX
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
A Rose by Any Other Name is Still a Rose
Data Dictionary
Presentation_euroCRIS_ES
Master defence 2020 - Serhii Brodiuk - Concept Embedding and Network Analysis...
Oopsinphp
Handout for Dublin Core Metadata Initiative Abstract Model
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Ad

Viewers also liked (15)

PDF
PDF
Servicing
PDF
Creative
PDF
Advertising
PPT
windward5
PDF
Build Your Community Subscription Services
PPTX
Discovering Context
PPTX
Semantic Transforms Using Collaborative Knowledge Bases
PDF
Planning
PPTX
Dan Reisner
PPTX
H0ly L4nd
PPT
Knights
PPTX
Forever Young Facewash: Digital Strategy
PPTX
Lay's India: Report
PPTX
Goodyear: Digital Marketing Case Study
Servicing
Creative
Advertising
windward5
Build Your Community Subscription Services
Discovering Context
Semantic Transforms Using Collaborative Knowledge Bases
Planning
Dan Reisner
H0ly L4nd
Knights
Forever Young Facewash: Digital Strategy
Lay's India: Report
Goodyear: Digital Marketing Case Study
Ad

Similar to Exploring Content with Wikipedia (20)

ODP
Ontology driven Annotation
PPTX
Assessing, Creating and Using Knowledge Graph Restrictions
PPT
Resources, resources, resources: the three rs of the Web
PDF
20160818 Semantics and Linkage of Archived Catalogs
PDF
bridging formal semantics and social semantics on the web
PDF
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
PPTX
Information retrieval and extraction
PDF
Metadata as Linked Data for Research Data Repositories
PPTX
Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012
PPTX
Deep Neural Methods for Retrieval
PDF
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
PDF
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
PPTX
Object Oriented Programming fundamentals.pptx
PPTX
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
PDF
Linked Open Data Visualization
PPTX
Object Oriented Programming Language is an oop
PDF
Extraction of common conceptual components from multiple ontologies
PPTX
Discovering Alignments in Ontologies of Linked Data
PDF
Diversified Social Media Retrieval for News Stories
PDF
Spotlight
Ontology driven Annotation
Assessing, Creating and Using Knowledge Graph Restrictions
Resources, resources, resources: the three rs of the Web
20160818 Semantics and Linkage of Archived Catalogs
bridging formal semantics and social semantics on the web
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Information retrieval and extraction
Metadata as Linked Data for Research Data Repositories
Franz et. al. 2012. Reconciling Succeeding Classifications, ESA 2012
Deep Neural Methods for Retrieval
Nicoletta Fornara and Fabio Marfia | Modeling and Enforcing Access Control Ob...
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Object Oriented Programming fundamentals.pptx
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Linked Open Data Visualization
Object Oriented Programming Language is an oop
Extraction of common conceptual components from multiple ontologies
Discovering Alignments in Ontologies of Linked Data
Diversified Social Media Retrieval for News Stories
Spotlight

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
August Patch Tuesday
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
project resource management chapter-09.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Assigned Numbers - 2025 - Bluetooth® Document
August Patch Tuesday
Zenith AI: Advanced Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
Mushroom cultivation and it's methods.pdf
Group 1 Presentation -Planning and Decision Making .pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Programs and apps: productivity, graphics, security and other tools
SOPHOS-XG Firewall Administrator PPT.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TLE Review Electricity (Electricity).pptx
OMC Textile Division Presentation 2021.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Hybrid model detection and classification of lung cancer
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
project resource management chapter-09.pdf
1 - Historical Antecedents, Social Consideration.pdf

Exploring Content with Wikipedia

  • 1. Exploring Content with Semantic Transformations using Collaborative Knowledge Bases Yegin Genc Prof. Jeffrey V. Nickerson
  • 2. OBJECTIVE Understanding text automatically to support search driven exploratory activities.
  • 3. EXPLORATORY SEARCH LOOKUP Fact retrieval Known item search Navigation Marchionini, G. (2006) LEARN Knowledge acquisition Comprehension/interpretation Comparison INVESTIGATE Accretion Analysis Exclusion/Negation
  • 4. EXPLORATORY SEARCH ILL-STRUCTURED PROBLEM • No single right approach • Problem definitions change as new information is gathered
  • 6. Text: “ Foreign Minorities Germany ”
  • 7. Exploratory Search Task Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract. Evaluation is based on relevancy and diversity.
  • 8. Concepts Candidates Seed Document (candidates that match to a Wikipedia Page title and connected through Ontology) n-grams (1 to 3) CONCEPT– WORD K (W x K) d Tf-idf(D) DOCUMENT – CONCEPT Θ (D x K) k DOCUMENT – W0RD D (D x W ) k * D: Documents = d Tf-idf(K) K: Concepts Argsort (row.sum(Θ) ) W: Words
  • 9. EXTRACTING CONCEPT NETWORK “Representation independence formally characterizes the encapsulation provided by language constructs for data abstraction and justifies reasoning by simulation. Representation independence has been shown for a variety of languages and constructs but not for shared references to mutable state; indeed it fails in general for such languages. This article formulates representation independence for classes, in an imperative, objectoriented language with pointers, subclassing and dynamic dispatch, class oriented visibility control, recursive types and methods, and a simple form of module. An instance of a class is considered to implement an abstraction using private fields and so-called representation objects. Encapsulation of representation objects is expressed by a restriction, called confinement, on aliasing. Representation independence is proved for programs satisfying the confinement condition. A static analysis is given for confinement that accepts common designs such as the observer and factory patterns. The formalization takes into account not only the usual interface between a client and a class that provides an abstraction but also the interface (often called protected") between the class and its subclasses."
  • 10. EXTRACTING CONCEPT NETWORK “Representation independence formally characterizes the encapsulation provided by language constructs for data abstraction and justifies reasoning by simulation. Representation independence has been shown for a variety of languages and constructs but not for shared references to mutable state; indeed it fails in general for such languages. This article formulates representation independence for classes, in an imperative, objectoriented language with pointers, subclassing and dynamic dispatch, class oriented visibility control, recursive types and methods, and a simple form of module. An instance of a class is considered to implement an abstraction using private fields and so-called representation objects. Encapsulation of representation objects is expressed by a restriction, called confinement, on aliasing. Representation independence is proved for programs satisfying the confinement condition. A static analysis is given for confinement that accepts common designs such as the observer and factory patterns. The formalization takes into account not only the usual interface between a client and a class that provides an abstraction but also the interface (often called protected") between the class and its subclasses."
  • 11. WIKIPEDIA PAGES AS CONCEPTS Solar System “The Solar System[a] consists of the Sun and the astronomical objects gravitationally bound in orbit around it, all of which formed from the collapse of a giant molecular cloud approximately 4.6 billion years ago…” (http://en.wikipedia.org/wiki/Solar _System) Word Stem Occ. Freq. abstract 53 0.056 program 44 0.046 langu 33 0.035 spec 16 0.017 comput 12 0.013 conceiv 12 0.013 dat 12 0.013 bk = p(Wi | k) = {Wi Î k} N å {W Î k} i i βk : Per-concept word distribution
  • 12. RANKING DOCUMENTS DOCUMENT – W0RD D (D x W ) DOCUMENT – CONCEPT Θ (D x K) CONCEPT– WORD K (W x K) k k d = * D: Documents K: Concepts W: Words d
  • 13. SORT DOCUMENTS DOCUMENT – W0RD D (D x W ) DOCUMENT – CONCEPT Θ (D x K) CONCEPT– WORD K (W x K) k k d = * D: Documents K: Concepts W: Words d
  • 14. EXPERIMENT Given a journal abstract, rank other abstracts based on their relevancy to the seed abstract. • Data: 619 abstracts of the Journal of the ACM (JACM) and their references. • Task: Select Top-k (5,10,15, and 20) relevant abstracts. • Observe: Relevancy (measured by LSA vector similarity) and Diversity (measured through the coverage of the references.)
  • 15. MAXIMAL MARGINAL RELEVANCE • a measure to increase the diversity of documents retrieved by an IR system -Similarity to query: BM25 (Xapian1) -Similarity to results: LSA similarity (Gensim2) 1. 2. http://xapian.org http://radimrehurek.com/gensim/
  • 18. CONCLUDING REMARKS • Our Wiki based technique provides high diversity with low relevancy loss. • Semantics embedded in concept networks extracted from Wikipedia can improve exploratory search tasks.

Editor's Notes

  • #2: However, majority of the inquires go beyond simple fact checks
  • #4: searches involving the cognitive processing and interpretation of new knowledgesearches requiring critical assessment before being integrated into knowledge basesSearch driven exploration activitiesExploratory Search relies on other information/cognitive behaviors:sense-making organizing and analyzing search resultsdecision making
  • #5: p.24: This kind of ill-structured problems 1) begin with a lack of information necessary to develop a solution or even precisely define the problem, 2) have no single right approach for solution, 3) have problem definitions that change as new information is gathered, and 4) have no identifiable ‘correct’ solution [3]. -- Highlighted jul 19, 2013
  • #6: It’s hard for search systems to identify concepts and their relationships --
  • #9: Concepts are characterized as distributions over observed words in Wikipedia pagesUse posterior expectations / approximate posterior inference: gibbs sampling, variational inference
  • #10: ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences.Ontologies can be used to model concepts and their interrelationships (Lanzenberger et al., 2010).In this sense, ontologies represent the relevant aspects of context. To effectively comprehend cross-lingual corpora, tools that can explore the dependencies between language and context are needed.
  • #11: ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences.Ontologies can be used to model concepts and their interrelationships (Lanzenberger et al., 2010).In this sense, ontologies represent the relevant aspects of context. To effectively comprehend cross-lingual corpora, tools that can explore the dependencies between language and context are needed.
  • #12: Concepts are characterized as distributions over observed words in Wikipedia pagesEach topic is a distribution over words
  • #15: Today, most user searches are of an exploratorynature, in the sense that users are interested inretrieving pieces of information that cover manyaspects of their information needs.
  • #16: retrieved by an IR systemThe principle is similar to TF-IDF where query terms are weighted based on frequency in a document (tf) and across the corpus (idf). In addition, ratio of the document length to the average document length is taken into account in K and BM25 is parameterized for further optimization. We used xapian’s implementation of BM25 with default parameters.\subsection{Maximal Marginal Relevance (MMR)}One approach to diversify search result is optimizing the search results based on two criteria: similarity to the query -- relevance, and dissimilarity to the other relevant documents -- novelty. Maximal Marginal Relevance (MMR) \cite{Carbonell:1998ja}, for example, work on this principle: similarity of a document to a query is adjusted based on its similarity to the other documents that are more similar to the query.\small\begin{displaymath}MMR=\underset { D_{ i }\in R\setminus S}{ argmax }\left[\lambda Sim_{ 1 }\left(D_{ i },Q \right)-\left(1-\lambda\right)\max_{ D_{ j }\in S } \left( Sim_{ 2 }\left( D_{ i },D_{ j } \right) \right) \right] \end{displaymath}
  • #19: can arguably be lessened, because the semantics strips away extraneous context while at the same time providing better diversity within the universe of relevant documents