SlideShare a Scribd company logo
Michael Aleythe, Martin Voigt, Peter Wehner

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 1
Motivation
Newsroom

Quelle: ringier.com

Friday, 06.09.2013

Topic/S

Slide 2
Problem
In-house production
Archive

News agencies

Web, social media

Online

DPA

Twitter

Reuters

Facebook

KNA

Blogs

…

…

Overwhelming amount of data

e.g., WAZ  5000 articles/day from agencies and inhouse production
Friday, 06.09.2013

Topic/S

Slide 3
Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trending topics

Media
Assets

Named
Entities

Topics

E1
E2
E3

MA1

E4

Push them to the editor

T1

T2

E5
MA2
E6

T3

E7

Pre-Processing

Friday, 06.09.2013

Topic/S

Slide 4

Post-Processing
Structure
Motivation, Problems, and Goals

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detection

Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 5
Workflow

Friday, 06.09.2013

Topic/S

Slide 6
Workflow: Preprocessor
Language Recognition (Ger/Eng)
Rule based
Named Entity Extraction
word list + statistics

Source: onelanguageoneposter.com

Keyword Extraction
Lemmatization, word list
Categorisation
Source based
Friday, 06.09.2013

Topic/S

Slide 7
Structure
Motivation, Problems, and Goals

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detection

Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 8
Semantic Model

Friday, 06.09.2013

Topic/S

Slide 9
Semantic Facts
Named Entities required but no lists available
SemItem

Number (with alt. names)

Person

1.504.341 (2.499.962)

Organization

63.332 (98.127)

Place

89.702 (95.178)

Keyword

1351

Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller

Triples without SemItems: 27,6 Mio.
Friday, 06.09.2013

Topic/S

Slide 10
Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Integrated querying of relational and
semantic data
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Benchmark of triple stores [Voigt2012]
Friday, 06.09.2013

Topic/S

Slide 11
Structure
Motivation, Problems, and Goals

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detection

Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 12
Workflow: Topic Detection
Clustering

Friday, 06.09.2013

Topic/S

Slide 13
Workflow: Topic Detection
Clustering

Friday, 06.09.2013

Topic/S

Slide 14
Workflow: Topic Detection
Clustering
Obama
Merkel

Politics

Audi
Highway

Traffic

Friday, 06.09.2013

Topic/S

Slide 15
Workflow: Topic Detection
Clustering (Top Cluster 25.08.2013)
Article Name
43

Bundesliga, Fußball, Spieltag , 1. FC Union Berlin,
SC Paderborn 07 eV, FC Augsburg, FSV Frankfurt

Yes

25

Euro, SPD, Berlin, Griechenland, FDP, CDU,
Deutschland

Yes

19

Bericht, Diplomat, Google Inc , Anbieter, Berlin,
Deutschland, Auto

Yes

18

Veranstaltung, Bernd Lucke, Angreifer, Berlin,
Polizei, Angriff, Deutschland

Yes

15

Friday, 06.09.2013

HotTopic

Gericht, Prozess, Bo Xilai, Christian Wulff,
Anklage, Verfahren, Mord

Yes

Topic/S

Slide 16
Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 17
Live Demo

Friday, 06.09.2013

Topic/S

Slide 18
Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 19
Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination required
model needs to be optimized for
queries
dedicated user interface required
Outlook
prediction of topics with
causal/temporal relations

Friday, 06.09.2013

Topic/S

Quelle: ooltapulta.com

Quelle: business-strategy-innovation.com

Slide 20
Thanks! Questions?

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Workflow: Preprocessor
Named Entity Recognition
word list
Tool: LingPipe + Extension
Quelle: churchthought.com
Sources: LOD (DBPedia, Geonames, YAGO2, GND)
Advantages: controlled vocabulary, guarantied
recognition of entities

statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Friday, 06.09.2013

Topic/S

Slide 22
Workflow: Preprocessor
Categorization

Categoriser Reuters

Politics
Article DPA

Categoriser DPA

IPTC Media Topic

Categoriser OTS

Friday, 06.09.2013

Topic/S

Slide 23
Workflow: Preprocessor
Categorization - Quality
News-Agency
KNA

80,3 %

DPA

94,4 %

EPD

80,3 %

Reuters

90,8 %

OTS

93,5 %

AFP

86 %

Method

accuracy

One cat. for all agencies

85 %

One cat. per agency

Friday, 06.09.2013

accuracy

87,5 %

Topic/S

Slide 24
Workflow: Preprocessor
Keywords
Lemmatization

Quelle: hugdaily.org

Developing a word list

Extraction using the word list
Bonus: frequent terms of an article

Friday, 06.09.2013

Topic/S

Slide 25
Disambiguation

Quelle: de.wikipedia.org

Quelle: fansshare.com

Quelle: lounge.espdisk.com

Friday, 06.09.2013

Topic/S

Slide 26
Disambiguation
Identification of
Entity Cluster

Michael Jackson
Internal Facts

Beer
Michael Jackson
Beer

Whiskey
Michael Jackson
External Facts
(DBpedia, etc.)

Music
King of Pop

Problem: not all SemItems available in the LOD
Friday, 06.09.2013

Topic/S

Slide 27

More Related Content

PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PPT
Presentatie Barbare van Dommelen
PPTX
SENSE: Vorschlagsgenerierung bei freier Oberflächenkomposition
PPTX
SENSE: Medien-Demonstrator
PDF
Themen- und Trenderkennung in Agenturmeldungen, LSWT2013
PDF
huGO®/ePaper - Das Beste zweier Welten
PDF
Semantic-guided Communication & Composition in a Widget/Dashboard Environment...
PDF
Towards Topics-based, Semantics-assisted News Search | WIMS13
2024 Trend Updates: What Really Works In SEO & Content Marketing
Presentatie Barbare van Dommelen
SENSE: Vorschlagsgenerierung bei freier Oberflächenkomposition
SENSE: Medien-Demonstrator
Themen- und Trenderkennung in Agenturmeldungen, LSWT2013
huGO®/ePaper - Das Beste zweier Welten
Semantic-guided Communication & Composition in a Widget/Dashboard Environment...
Towards Topics-based, Semantics-assisted News Search | WIMS13

Recently uploaded (20)

PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
STKI Israel Market Study 2025 version august
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Hindi spoken digit analysis for native and non-native speakers
PPT
What is a Computer? Input Devices /output devices
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
observCloud-Native Containerability and monitoring.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
WOOl fibre morphology and structure.pdf for textiles
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
STKI Israel Market Study 2025 version august
Zenith AI: Advanced Artificial Intelligence
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
A novel scalable deep ensemble learning framework for big data classification...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Module 1.ppt Iot fundamentals and Architecture
Hindi spoken digit analysis for native and non-native speakers
What is a Computer? Input Devices /output devices
Programs and apps: productivity, graphics, security and other tools
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Chapter 5: Probability Theory and Statistics
A comparative study of natural language inference in Swahili using monolingua...
O2C Customer Invoices to Receipt V15A.pptx
Ad
Ad

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13