SlideShare a Scribd company logo
1
Linked	Data	Experiences	at	Springer	
Nature
Michele	Pasin	
Lead	Data	Architect	
Knowledge	Graph	Team
Linked	Data	Experiences	at	Springer	Nature	
Leipzig,	09/2016
2
Outline	
•Who	we	are	
•	Why	semantic	technologies		
•	Our	work	so	far	
•	The	Scigraph	project	
•	Looking	ahead
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
3
Who	We	Are
4
Formed in May 2015 through the merger of Nature Publishing
Group, Palgrave Macmillan, Macmillan Education and Springer
Science+Business Media
5
4
5
1
14
2
13k employees in over 50 countries, EUR 1.5 billion turnover
6
[Pre-Merger]		Springer	Science	+	Business	Media	brands
7
[Pre-Merger]		Macmillan	Science	&	Education	brands
Holtzbrinck
Publishing
Group
8
We	publish	a	lot	of	science!	(since	1815)
13M documents
7M articles, 4M chapters
4k journals, 700k books
9
..and	generate	a	lot	of	traffic
11.5M monthly visitors
(nature.com)
260M visits per year
600M downloads per year
(link.springer.com)
Linked Data Experiences at Springer Nature
> Collaborative effort between Springer Nature and
Digital Science
> Supporting internal use cases,but also contributing
to an emerging web of linked science data
> Not just publications data but a wealth of other
related information
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
12
Why	Semantic	Technologies
13
Why	is	Semantics	Important	To	Us?
Challenges: Data Silos
● Data is fragmented
● Data gets duplicated
● Data is hardcoded into applications
Change Drivers
● Digital first workflow
● User-centric design
● Unified Springer Nature domain
For	example:	our	sites	are	currently	organised	around	arTcles,	
journals	and	issues…
However,	scienTsts	are	interested	in	answering	quesTons	about	real	
world	things…
Search	engines	do	not	know	we	have	content	about	these	things…
1st	hit	from	nature.com…
Not	linked	to/from..
17
PDF
XML
ePub
HTML
TIFF
Today: Content base Tomorrow: Knowledge Graph
We publish science We manage knowledge
Vision
The Knowledge Graph is
about collecting
information about objects
in the real world
…so that we can do a better job of
providing users with what they're
looking for
reads / writes
is about
interested in
Three areas of knowledge we care about
Reads / Writes
Works for
Funds
Lead researcher in
Produces
Studies Located at
In
proceedings
C
ontains
Cites
Has learning
resource
Attends
Has topicProduces
21
Research/
Manuscript
Creation
Manuscript
Submission
Peer Review/
Proposal Stage
Planning
Production
Publication
Distribution/
Sales
Discovery
Researcher /
Author
Editorial /
Publisher
Reviewer
Opportunities:	Tools	&	Services	Along	the	Publishing	Life	Cycle
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
22
Our	Work	So	Far
Our	Work	So	Far	
2014
2013
2012
2015
2016
NPG Linked Data Platform
Nature Ontologies Portal
Springer Materials
Springer Conferences
Scigraph
Content Hub
Scigraph
prototype
Nero
Project
Linnaeus
Project
Springer
Protocols
CURI Semantic
Annotation Project
Deliverables (2012–2014)
● Prototype for external use
● SPARQL query service
● Two RDF dataset releases in 2012
– April 2012 (22m triples)
– July 2012 (270m triples)
● Live updates to query endpoint
Led to (2014–)
● Focus on internal use-cases
● Publish ontology pages
● Periodic data snapshots
NPG	Linked	Data	Platform	(2012)
Features
● Hybrid RDF + XML architecture
– MarkLogic for XML, RDF/XML
– Triplestore (TDB) for RDF validation
● Repo’s for binary assets
Layout
! Semantic RDF/XML includes in XML
● RDF objects serialized in list order
● Application XML for subject hierarchy

Indexes
● Indexes over all elements
● Range indexes for datatypes (e.g. dates)
NPG	Content	Hub	(2014):		Hybrid	Architecture
Subject	Pages	(2014)
27
NPG	Ontologies	Portal	(2015):	Data	Publishing
28
Springer	Materials	(2014)
29
Springer	Conferences	Portal	(2015)
30
Scigraph	Project	(2016):	main	objectives
Data Integration
> Consolidation of existing LD efforts via a single domain mode
> Ingestion and normalisation of third party datasets
Discoverability
> Better end user applications [B2C]
> Metadata delivery & validation [B2B]
> Data publishing [B2developers]
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
31
Scigraph	
what’s	in	it	
>	data	architecture,	taxonomies,	ontologies	
how	it	works	
>	ETL,	naming,	validation,	identity
32
Data	Landscape
Citations / References
160M
Articles
7M
Chapters
3.6M
Journals
4K
Books
700k
Subjects
4K
Article
Types
Grants
2M
Organizations
60K
Conferen
ces
10K
Funders
Publishers
Universities
Scigraph
Core
Persons
1M
Relations
Publish
states
Vocabularies
a DB/OO
scheme
Arbitrary relations plus
axioms, constraints
and rules expressed
in a logical languagea glossary
an axiomatized
theory
a thesaurus
a taxonomy
Taxonomy plus
related terms;
captures synonymy,
homonymy etc.
Complexity (ontological depth)
A controlled
vocabulary with NL
definitions (e.g.
lexicon)
- Publishers
- Relations
- Publish-states
A c.v. that captures
broaderThan /
narrowerThan
relationships
- Subjects,
- Article Types
Relational model:
unconstrained use
of arbitrary relations
Scigraph
Core ontology
Ontologies	and	Taxonomies:	overview
34
The	Core	Ontology
- Language: OWL 2, Profile: ALCHI(D)
- Entities: ~73 classes, ~250 properties
- Principles: Incremental Formalization/ Enterprise Integration / Model Coherence
http://www.nature.com/ontologies/core/
35
The	Core	Ontology:	mappings
:Asset
:Thing
:Publication
:Concept
:Event
:Subject
:Type
:Agent
:ArticleType
:Publishing
Event
:Aggregation
Event
:Component
:Document
:Serial
cidoc-crm:
Information_Carrier
cidoc-crm:
Conceptual_Object
dbpedia:Agent
dc:Agent
dcterms:Agent
cidoc-crm:Agent
vcard:Agent
foaf:Agent
event:Event
bibo:Event
schema:Event
cidoc-crm:
TemporalEntity
cidoc-crm:Type
vcard:Type
fabio:SubjectTerm
bibo:Document
cidoc-crm:Document
foaf:Document
bibo:Periodical
fabio:Periodical
schema:Periodical
bibo:DocumentPart
fabio:Expression
cidoc-crm:InformationObject
= owl:equivalentClass
36
SKOS	taxonomies:	Poolparty	integration
37
SKOS	taxonomies:	Subjects
- Structure: SKOS, ~2500 concepts, multi hierarchical tree, 6 branches, 7 levels of depth
- Mappings: 100% of terms, using skos:broadMatch or skos:closeMatch, (Dbpedia and
MESH)
- Document tagging: mostly manual, different workflows, often costly and inconsistent
38
Semi-Automatic	tagging	with	Dimensions	(from	UberResearch)
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
39
Scigraph	
what’s	in	it	
>	data	architecture,	taxonomies,	ontologies	
how	it	works	
>	ETL,	naming,	validation,	identity
40
Naming	Architecture:	federated	model
> Dereference and 303 redirects:
- http://name.scigraph.com/{things}/
- http://data.scigraph.com/{things}/
> Two patterns: schemas and instances
- http://name.scigraph.com/ontologies/{domain}/
- http://name.scigraph.com/{domain}/{things}/
> Prefixes for schemas and instances
- @prefix sg: <http://name.scigraph.com/ontologies/core/> .
> Entity names follow a robust convention
- camel-case for naming terms, with an initial uppercase for
classes and an initial lowercase for properties.
> Named graphs used to track provenance
41
Scigraph	-	Data	Flow
Peer
Review
DDS
Core
Media
UNSILO TARGET
Uber
Research
DBPedia etc..
KNOWLEDGE GRAPH
JSON-LD API DDS Adapter TTL Loader RDF Loader ..
data
sources
integration
layer
real time
services
Peer Review
Service
Search Service
(Content Hub)
applications Peer Review Oscar Search
data is delivered to
applications via fast APIs
data is extracted and
denormalised so to support
applications
data is normalised and
mapped to SN ontologies
42
ETL	Architecture:	main	features	[in	evolution]
Tech stack
> Airflow framework (Airbnb)
> Amazon S3 to make backups
> GraphDB triplestore (staging and presentation)
> Elastic search and APIs
Components & Principles
> Graph must be ‘ephemeral’
> Data sources versioning algorithm
> Identity Persistence service
> Validation via SHACL (TopBraid API)
43
ETL	Architecture
Persons
zip
XML
RDF
JSON
CSV
Articles
DB
Publishers
Dataset
Books
API
Sources
Data Store
Amazon S3
Data Staging
Triplestore
Data Presentation
Triplestore
Linked
Data
Browser
Analytics
Reporting
APIs
✴ Extraction
✴ Validation
✴ Identity Persistence
✴ Updating / Replacing
named graphs
✴ Versioning service
✴ (md5 checksum,
timestamps, origin
version, etc...)
✴ Integration
(union graph)
✴ Inference
Named Graphs
Identity	Persistence
Identity Persistence
Module
J1
(xml)
J2
(xml)
RDF
Extractor
journals:
76as67fda76sd67a
id: 1
DOI: 123
issn: ABC
id: 2
issn: ABC
J1
(xml)
id: 1
DOI: 123
issn: ABC
ingest #1
ingest #2
ingest #3
Identity Registry
sgo:core Ontology
sg:Journal
a owl:Class ;
sg:hasKeyProperty sg:doi .
sg:hasKeyProperty sg:issn
sg:hasKeyProperty sg:eissn
....
45
Data	Validation:	from	SPIN	to	SHACL
> SPIN SPARQL syntax
(2011, TopQuadrant)
> Example: “if a Journal
instance has no short
title, raise an Exception”
> Main drawback: hard to
maintain and to read by
non specialists
46
Data	Validation:	from	SPIN	to	SHACL
> SHACL - Shapes
Constraint Language
(2016, TopQuadrant)
> Example: “all article
instances should have a
valid DOI”
> Example: “all grants
instances should have
max 1 start year and end
year”
> Approach: polish data
before entering the
triplestore, use triplestore
inference primarily for
integration
Linked	Data	Experiences	at	Springer	Nature	-	
Leipzig,	09/2016
47
Next	Steps
48
Looking	Ahead	
Summary
● Scigraph is our latest LD platform - public version live in late 2016
● SW tech allows for scalable enterprise-level metadata management
● It is crucial to distinguish between data Integration VS (real time) data delivery
● Still a work in progress… suggestions or feedback very welcome!
Ongoing Work
● Ontology: federated model, more advanced inferencing capabilities
● Build internal/external APIs (JSON-LD) by integrating also NoSQL
● Tools for analytics, reporting, visualisation, interactive exploration of the graph
● Entities extraction: scientific entities, places, people, events etc..
● We’re looking to collaborate… Crossref, W3C, building a Linked Science Web
Future:	a	scientific	article	X-ray?
50
The	Knowledge	Graph	team
CORE TEAM
*Markus Kaindl: Product Owner
*Ben Kirkley: Project Manager
* Michele Pasin: Lead Data Architect
*Tony Hammond: Data Architect
* Matias Piipari: Lead Engineer
* Hilverd Reker: Software Engineer
*Artur Konczak: Software Engineer
*<blankNode>: Data Scientist
*<blankNode>: Data Engineer
DIGITAL SCIENCE
* Martin Szomszor: Data Scientist
*Richard Koks: Data Scientist
* Mario Diwersy: CTO, Uber Research
PROGRAM SPONSOR
* Henning Schoenenberger: Director Data &
Metadata
Linked	Data	Experiences	at	Springer	
Nature	-	Leipzig,	09/2016
51
Thanks	
michele.pasin@nature.com

More Related Content

PDF
Danish Business Authority: Explainability and causality in relation to ML Ops
PDF
ODI Summit 2016 - Linked Open Data at Springer Nature
PDF
Semantic Web Approaches in Digital History: an Introduction
PPTX
Identifying Springer's Author (with ORCID iD) on SpringerLink (H. Aziz)
PDF
Prosopography and Computer Ontologies: Towards a Formal Representation of the...
PDF
Content ist nicht (mehr) alles springer library summit 17.06.2015
PDF
Konsequenz in allen dingen... 104. bibliothekartag 27.05.2015
PPTX
Springer
Danish Business Authority: Explainability and causality in relation to ML Ops
ODI Summit 2016 - Linked Open Data at Springer Nature
Semantic Web Approaches in Digital History: an Introduction
Identifying Springer's Author (with ORCID iD) on SpringerLink (H. Aziz)
Prosopography and Computer Ontologies: Towards a Formal Representation of the...
Content ist nicht (mehr) alles springer library summit 17.06.2015
Konsequenz in allen dingen... 104. bibliothekartag 27.05.2015
Springer

Viewers also liked (18)

PDF
A Survey: Taxonomy Building Tools
PDF
SpringerNature and its sharing strategy on ReadCube
PDF
KR Workshop 1 - Ontologies
ODP
Pozoblanco Patricia Molina Palma
PDF
KQ 1Q 2015 FINALFinal
PDF
POSHAN District Nutrition Profile_Aurangabad_Bihar
PPT
Detection of vulnerable plaque by nis
PPT
IPTeL Overview
PPT
Hans schaffers smartcities
PPT
Pathways and Signposts
PPT
Perennial Dilemmas in English Education
PDF
Couponomy handout
PDF
Informe semestral ene octubre 2015 proyectos promovidos.
PPTX
Don't quit your day job: Finding the right balance for you - David Brodie & C...
PDF
ANWR - Oil or Wilderness?
PDF
PDF
2016 Springer - publishing scientific research - dublin
PDF
8th DBpedia meeting / California 2016
A Survey: Taxonomy Building Tools
SpringerNature and its sharing strategy on ReadCube
KR Workshop 1 - Ontologies
Pozoblanco Patricia Molina Palma
KQ 1Q 2015 FINALFinal
POSHAN District Nutrition Profile_Aurangabad_Bihar
Detection of vulnerable plaque by nis
IPTeL Overview
Hans schaffers smartcities
Pathways and Signposts
Perennial Dilemmas in English Education
Couponomy handout
Informe semestral ene octubre 2015 proyectos promovidos.
Don't quit your day job: Finding the right balance for you - David Brodie & C...
ANWR - Oil or Wilderness?
2016 Springer - publishing scientific research - dublin
8th DBpedia meeting / California 2016
Ad

Similar to Linked Data Experiences at Springer Nature (20)

PDF
Linked data experience at Macmillan: Building discovery services for scientif...
PPTX
The nature.com ontologies portal: nature.com/ontologies
PPTX
Iswc 2014-hammond-pasin-presentation-final
PDF
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
PDF
The Nature.com ontologies portal - Linked Science 2015
PDF
The Future of Semantics on the Web
PDF
Session 0.0 poster minutes madness
PPTX
Linked Data for Biopharma
PPTX
SKOS as the focal point of linked data strategies
PPT
Web 3.0 Emerging
PPTX
Why I don't use Semantic Web technologies anymore, event if they still influe...
PDF
Keynote - TUT W3C Web Technology Day: Linked Data for Science and Industry, 2...
PDF
Connecting Publications & Data: Raising visibility of local data collections...
PPT
Pragmatic Approaches to the Semantic Web
PDF
Resource and Metadata Management with a Linked Data perspective
PDF
Introduction to Knowledge Graphs for Information Architects.pdf
PPTX
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
PPTX
Linked Data past, present and futures
PDF
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
PDF
Semantische Technologien (nicht nur) für die verbesserte Suche in SharePoint
Linked data experience at Macmillan: Building discovery services for scientif...
The nature.com ontologies portal: nature.com/ontologies
Iswc 2014-hammond-pasin-presentation-final
ICIC 2017: Building a Linked Data Knowledge Graph for the Scholarly Publishin...
The Nature.com ontologies portal - Linked Science 2015
The Future of Semantics on the Web
Session 0.0 poster minutes madness
Linked Data for Biopharma
SKOS as the focal point of linked data strategies
Web 3.0 Emerging
Why I don't use Semantic Web technologies anymore, event if they still influe...
Keynote - TUT W3C Web Technology Day: Linked Data for Science and Industry, 2...
Connecting Publications & Data: Raising visibility of local data collections...
Pragmatic Approaches to the Semantic Web
Resource and Metadata Management with a Linked Data perspective
Introduction to Knowledge Graphs for Information Architects.pdf
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
Linked Data past, present and futures
Data Integration & Disintegration: Managing SN SciGraph with SHACL and OWL
Semantische Technologien (nicht nur) für die verbesserte Suche in SharePoint
Ad

More from Michele Pasin (10)

PDF
Designing great dashboards: a slidedeck for dashboard developers
PDF
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
PDF
How do philosophers think their own disciplines?
PDF
Exploring highly interconnected humanities data: are faceted browsers always ...
PDF
Digital Humanities 2009 - Laying out the conceptual foundations for data inte...
PDF
An Ontological View of Canonical Citations
PDF
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
PDF
Livecoding with impromptu
PDF
Introducing FRBR-OO (CCH KR workshop 2.2)
PDF
Introducing CIDOC-CRM (Cch KR workshop #2.1)
Designing great dashboards: a slidedeck for dashboard developers
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
How do philosophers think their own disciplines?
Exploring highly interconnected humanities data: are faceted browsers always ...
Digital Humanities 2009 - Laying out the conceptual foundations for data inte...
An Ontological View of Canonical Citations
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
Livecoding with impromptu
Introducing FRBR-OO (CCH KR workshop 2.2)
Introducing CIDOC-CRM (Cch KR workshop #2.1)

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
Lecture1 pattern recognition............
PDF
Business Analytics and business intelligence.pdf
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Knowledge Engineering Part 1
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to machine learning and Linear Models
Lecture1 pattern recognition............
Business Analytics and business intelligence.pdf
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
.pdf is not working space design for the following data for the following dat...
Qualitative Qantitative and Mixed Methods.pptx
Mega Projects Data Mega Projects Data

Linked Data Experiences at Springer Nature