Unsupervised Graph-based Topic Labelling using DBpedia

.

.

Unsupervised Graph-based Topic Labelling using
DBpedia
Authors: Ioana Hulpus, Conor Hayes, Derek Greene
SEXI/WSDM2013 読み会
@Quasi-quant2010

.

Unsupervised Graph-based Topic Labelling using DBpedia

.

.

.

.

June 30, 2013

.

1 / 21

Outline

. Content
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
フレームワーク
実行例
定式化

.
3

DBpedia からのグラフ作成
Sense Graph Connectivity within a Topic Graph
ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

2 / 21

Abstruct

動機

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

3 / 21

Abstruct

動機

. 文書からラベル抽出をする LDA モデル等には現実的でない仮定

1

正しいラベルは必ずしも文書に存在するとは限らない
正しいラベルを判定できるほどコーパスが十分とは限らない

. これらの問題を外部情報を付加する事で解決したい
. 著者が 2012 に発表した Eigen-WSD と DBpedia(外部情報) の組み
3
合わせモデルと、確率モデルとの比較実験を行った
2

Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial
topic models. In SIGKDD ’07, pages 490-499, 2007

.


.

.

.

.

June 30, 2013

.

4 / 21

Abstruct

主要結果

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

5 / 21

Abstruct

主要結果

. ラベルが持つ意味の包括範囲がベースラインモデルより向上
. ラベルの正確性がベースラインモデルより向上
2
1

Figure : 1, 縦軸:Precision, Coverage, 横軸 top-k. Precision is the relevance for
a topic at top-k. Coverage is the topics with at least one Hit at rank
.


.

.

.

.

June 30, 2013

.

6 / 21

分析の流れ


. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

7 / 21

分析の流れ


. The Canopy Framework : Four main components
. トピック抽出

1

コーパスに LDA を適用しトピックを抽出

. the word-sense disambiguation (WSD)

2

The WSD determines a set C θ of DBpedia concepts, where each
C ∈ C θ represents the identiﬁed sense of one of the top-k words of
a topic.

. グラフ抽出

3

a good candidate set by extracting a topic graph G from DBpedia
consisting of the close neighbours of concepts Ci and the links
between them
we investigate how to deﬁne the relation r (C θ , C ∗ )

. 抽出したグラフへのラべリング

4

We adopt principles from social network analysis to identify in G the
most prominent concepts for labelling a topic θ
.


.

.

.

.

June 30, 2013

.

8 / 21

分析の流れ

実行例

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

9 / 21

分析の流れ

実行例

Unsupervised Graph-based Topic Labelling using
. DBpedia

.


.

.

.

.

June 30, 2013

.

10 / 21

分析の流れ

定式化

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

11 / 21

分析の流れ

定式化

Let C θ be a set of n DBpedia concepts Ci , i = 1,...n, that
correspond to a subset of the top-k words representing one topic
The problem is to identify the concept C ∗ from all available
concepts in DBpedia, such that the relation r (C θ , C ∗ ) is done by
Centrality

.


.

.

.

.

June 30, 2013

.

12 / 21



. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

13 / 21



. 計測法
∑
PairConnectivity

Cθ

=

Ci ∈C θ ,Cj ∈C θ

IndicatorFunct(Vi ∩ Vj ̸= 0)
|C θ |(C θ − 1)

111 トピックによる検証では、PairConnectivity の基本統計量が以下のよ
うになった；
. NonRandom
1
平均 0.46
標準偏差 0.07

. RandomShufﬂe

2

平均 0.07
標準偏差 0.02.

従って、DBpedia を用いた Eigen-WSD により得られたトピックグラフ
内の意味グラフは互いに共通する偶然でない要素がある
.


.

.

.

.

June 30, 2013

.

14 / 21


ラべリング

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

15 / 21


ラべリング

. 中心性
. 一般的：最短経路のみ考慮

1

Closeness centrality
Betweenness centrality

. 最短経路でなく、ネットワークの接続全接続可能性を考慮

2

Information centrality
Random walk betweenness centrality

. 筆者が採用した方法

3

Focused Closeness Centrality(fCC)
Focused Information Centrality(fIC)
Focused Betweenness Centrality(fBC)
Focused Random Walk Betweenness Centrality(fRWB)

The above measures fCC; fIC; fBC and fRWB are the ones that
we experimented with for deﬁning the target function r, which
quantiﬁes the strength of the relation between each candidate
concept and all other concepts in the topic graph G
.


.

.

.

.

June 30, 2013

.

16 / 21

実験

データ

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

17 / 21

実験

データ

British AcademicWritten English Corpus
BBC corpus
StackExchange dataset
ただし、ストップ URL によりデータ圧縮

.


.

.

.

.

June 30, 2013

.

18 / 21

実験

評価方法

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

19 / 21

実験

評価方法

モニターユーザーに ”Good Fit”, ”Too Broad”, ”Related but not a good
label”, ”Unrelated” というラベルをつけさせ、評価には以下の 2 つのクラ
スに分類したデータを使用；
. Good Fit
1
Good Fit

. Good-Fit-or-Broader

2

Good Fit
Too Broad

Precision(k ) =
Coverage(k ) =

Hits with rank ≤ k
k
topics with at least one Hit at rank ≤ k
topics

.


.

.

.

.

June 30, 2013

.

20 / 21

実験

結果

. Outline
.
1

Abstruct
動機
主要結果

.
2

分析の流れ
実行例
定式化

.
3

ラべリング

.
4

実験
データ
評価方法
結果
.


.

.

.

.

June 30, 2013

.

21 / 21

実験

結果

.


.

.

.

.

June 30, 2013

.

21 / 21

Unsupervised Graph-based Topic Labelling using DBpedia

More Related Content

Similar to Unsupervised Graph-based Topic Labelling using DBpedia (20)

More from Takanori Nakai (17)

Unsupervised Graph-based Topic Labelling using DBpedia