Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

Michael Aleythe, Martin Voigt, Peter Wehner

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677

Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 1

Motivation
Newsroom

Quelle: ringier.com

Friday, 06.09.2013

Topic/S

Slide 2

Problem
In-house production
Archive

News agencies

Web, social media

Online

DPA

Twitter

Reuters

Facebook

KNA

Blogs

…

…

Overwhelming amount of data

e.g., WAZ  5000 articles/day from agencies and inhouse production
Friday, 06.09.2013

Topic/S

Slide 3

Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trending topics

Media
Assets

Named
Entities

Topics

E1
E2
E3

MA1

E4

Push them to the editor

T1

T2

E5
MA2
E6

T3

E7

Pre-Processing

Friday, 06.09.2013

Topic/S

Slide 4

Post-Processing

Structure

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detection

Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 5

Workflow

Friday, 06.09.2013

Topic/S

Slide 6

Workflow: Preprocessor
Language Recognition (Ger/Eng)
Rule based
Named Entity Extraction
word list + statistics

Source: onelanguageoneposter.com

Keyword Extraction
Lemmatization, word list
Categorisation
Source based
Friday, 06.09.2013

Topic/S

Slide 7

Structure

Topic/S Workflow
–
–
–
–

Overview
Storage
Topic Detection

Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 8

Semantic Model

Friday, 06.09.2013

Topic/S

Slide 9

Semantic Facts
Named Entities required but no lists available
SemItem

Number (with alt. names)

Person

1.504.341 (2.499.962)

Organization

63.332 (98.127)

Place

89.702 (95.178)

Keyword

1351

Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller

Triples without SemItems: 27,6 Mio.
Friday, 06.09.2013

Topic/S

Slide 10

Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Integrated querying of relational and
semantic data
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Benchmark of triple stores [Voigt2012]
Friday, 06.09.2013

Topic/S

Slide 11

Structure

Topic/S Workflow
–
–
–
–

Overview
Storage
Topic Detection

Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 12

Workflow: Topic Detection
Clustering

Friday, 06.09.2013

Topic/S

Slide 13

Clustering

Friday, 06.09.2013

Topic/S

Slide 14

Clustering
Obama
Merkel

Politics

Audi
Highway

Traffic

Friday, 06.09.2013

Topic/S

Slide 15

Clustering (Top Cluster 25.08.2013)
Article Name
43

Bundesliga, Fußball, Spieltag , 1. FC Union Berlin,
SC Paderborn 07 eV, FC Augsburg, FSV Frankfurt

Yes

25

Euro, SPD, Berlin, Griechenland, FDP, CDU,
Deutschland

Yes

19

Bericht, Diplomat, Google Inc , Anbieter, Berlin,
Deutschland, Auto

Yes

18

Veranstaltung, Bernd Lucke, Angreifer, Berlin,
Polizei, Angriff, Deutschland

Yes

15

Friday, 06.09.2013

HotTopic

Gericht, Prozess, Bo Xilai, Christian Wulff,
Anklage, Verfahren, Mord

Yes

Topic/S

Slide 16

Structure

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 17

Live Demo

Friday, 06.09.2013

Topic/S

Slide 18

Structure

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 19

Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination required
model needs to be optimized for
queries
dedicated user interface required
Outlook
prediction of topics with
causal/temporal relations

Friday, 06.09.2013

Topic/S

Quelle: ooltapulta.com

Quelle: business-strategy-innovation.com

Slide 20

Thanks! Questions?

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677

Named Entity Recognition
word list
Tool: LingPipe + Extension
Quelle: churchthought.com
Sources: LOD (DBPedia, Geonames, YAGO2, GND)
Advantages: controlled vocabulary, guarantied
recognition of entities

statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Friday, 06.09.2013

Topic/S

Slide 22

Categorization

Categoriser Reuters

Politics
Article DPA

Categoriser DPA

IPTC Media Topic

Categoriser OTS

Friday, 06.09.2013

Topic/S

Slide 23

Categorization - Quality
News-Agency
KNA

80,3 %

DPA

94,4 %

EPD

80,3 %

Reuters

90,8 %

OTS

93,5 %

AFP

86 %

Method

accuracy

One cat. for all agencies

85 %

One cat. per agency

Friday, 06.09.2013

accuracy

87,5 %

Topic/S

Slide 24

Keywords
Lemmatization

Quelle: hugdaily.org

Developing a word list

Extraction using the word list
Bonus: frequent terms of an article

Friday, 06.09.2013

Topic/S

Slide 25

Disambiguation

Quelle: de.wikipedia.org

Quelle: fansshare.com

Quelle: lounge.espdisk.com

Friday, 06.09.2013

Topic/S

Slide 26

Disambiguation
Identification of
Entity Cluster

Michael Jackson
Internal Facts

Beer
Michael Jackson
Beer

Whiskey
Michael Jackson
External Facts
(DBpedia, etc.)

Music
King of Pop

Problem: not all SemItems available in the LOD
Friday, 06.09.2013

Topic/S

Slide 27

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

More Related Content

Recently uploaded (20)

Featured (20)

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13