SlideShare a Scribd company logo
REUTERS / Danish Ismail
BUILDING A KNOWLEDGE GRAPH
DAN BENNETT - GRAPH DAY 2018
@nonodename
SEPTEMBER, 2018
AGENDA
• A little on TR
• What’s a knowledge graph?
• Quick reset on RDF - if needed
• Data engineering for our knowledge graph
• Lessons learned
• Q&A
A LITTLE ON THOMSON REUTERS
THOMSON REUTERS - THE ANSWER COMPANY
• Information, technology and
expertise for professionals
• Focus on finance, risk, media,
legal, tax and accounting
markets
• 87% recurring revenue, 93%
electronic, global footprint
• My role: big data & NLP within
central technology group
supporting market aligned
business units
REUTERS/Amit Dave
CONTENT, NOT DATA
WHAT’S A KNOWLEDGE GRAPH?
WHAT IS A KNOWLEDGE GRAPH?
• Open world representation of
information
• Every entry point is equal cost
• Underpin Cortana, Google
Assistant, Siri, Alexa
• Typically (but doesn’t have to
be) expressed in RDF
Score
Team
Team
Game
6-1
Venue
Panama
England
Nizhy Novgorod Stadium
Score
Score
Score
Stones8
hasName
hasLogo
hasFinalScore played
hasName
hasLogo
playedAt
hasName
hasQuarter
atTime
byPlayer
UH OH
QUICK RESET ON RDF
SCHEMA ON WRITE
• Fixed data model
• Slow to change
• Strong enforcement
SCHEMA ON READ
• Capture everything
• Apply logic (schema) on read
• No standards
RDF: SCHEMA ON READ, OPTIONAL ON WRITE
Schema on Read Schema on Write
Accuracy
Difficult & slow to
change
Anything goes
Federated
RDF
Standards
(potentially) verbose
Triggers/Stored Procs/IDs
Referential integrity
on write
Referential integrity
on read
Super flexible
Capture everything
Flexible
HOW CAN THAT BE? (SIMPLIFIED!)
ID Date Amount Customer
1 30-Aug-2016 56.84 1
2 31-Aug-2016 42.36 2
3 1-Sep-2016 98.45 1
4 1-Sep-2016 23.54 3
ID Name
1 Barack Obama
2 Richard Nixon
3 Ronald Reagan
4 Bill Clinton
Orders Customers
Subject Predicate Object
http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830
http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84
http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/
customers/1http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/order
http://tr.com/orders/2 http://ont.tr.com/orders/order_date 20160831
http://tr.com/orders/2 http://ont.tr.com/orders/order_amount 42.36
http://tr.com/orders/2 http://ont.tr.com/orders/order_customer http://tr.com/
customers/2http://tr.com/orders/2 http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/order
… … …
http://tr.com/
customers/1
http://ont.tr.com/customers/name Barack Obama
http://tr.com/
customers/1
http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/
customerhttp://tr.com/
customers/2
http://ont.tr.com/customers/name Richard Nixon
http://tr.com/
customers/2
http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/
customer… … …
RelationalRDF
• URI = primary key
• New column = new
rows
• Sparse if row missing
• Object a relation or
literal
SCHEMA, QUERY & FEDERATION
Subject Predicate Object
http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830
http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84
http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/customers/1
http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://ont.tr.com/order
http://tr.com/orders/1 http://ont.salesforce.com/crm/customer_spend 9856.45
http://tr.com/customers/1 http://www.w3.org/2002/07/owl#sameAs http://en.wikipedia.org/wiki/Richard_Nixon
http://en.wikipedia.org/wiki/Richard_Nixon http://owl.wikipedia.org/born 19130109
Federated data
(spend from
CRM)
Relation to
external data
Schema (Ontology)
More than one can
apply to a subject
• Sparql - like SQL. Sum all orders:
SELECT sum(?amount)

WHERE {

?order <http://ont.tr.com/orders/order_amount> ?amount .

?order <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ont.tr.com/order>

}
WHY RDF FOR A KNOWLEDGE GRAPH?
RDF DB Property Graph DB
Open Yes Maybe
Incremental Load Via named graph or
SPARQL
Maybe
Federated Data Yes No
Modelling tools Yes Unlikely
Types/Classes/higher
abstractions Yes No
OUR ARCHITECTURE
PHYSICAL
Snaplogic
or Hadoop
ETL
Sources
(Relational,
Proprietary)
RDF
CM-Well:
RDF Store
Mart/

Products
Pull
Push
Batch
REST Based
Publishing
HTTPS
FTP/HTTP
JDBC
Warehouse
Remote
Read
Replicas
RDF
Full text
mining
RDBMS
Web services
Sed ut perspiciatis unde omnis iste
natus error sit voluptatem
accusantium doloremque laudantium,
totam rem aperiam, eaque ipsa quae
ab illo inventore veritatis et quasi
architecto beatae vitae dicta sunt
explicabo. Nemo enim ipsam
voluptatem quia voluptas sit
aspernatur aut odit aut fugit, sed quia
consequuntur magni dolores eos qui
ratione voluptatem sequi nesciunt.
Neque porro quisquam est, qui
dolorem ipsum quia dolor sit amet,
consectetur, adipisci velit, sed quia
non numquam eius modi tempora
incidunt ut labore et dolore magnam
aliquam quaerat voluptatem. Ut enim
ad minima veniam, quis nostrum
exercitationem ullam corporis suscipit
Neptune
Elastic
RDBMS
Filesystem
Filesystem
LOGICAL
SPARQL
SPARQL Triggers
As captured
• Mechanistic conversion
• Minimal validation
• Named graph for W3C
provenance & update
Target Model
• “Canonical Graph”
• Curated ontologies
• Normalized
representation
Selective
Product Models
• Slice & dice
• Store/retrieve using
whatever works
• Not necessarily graph
OUR GRAPH WAREHOUSE: CM-WELL
HA Proxy
…
REST/HTTP
REST/HTTP
• NOT a triple store!
Focus is on data
movement
• No master node
• Linear scaling
• Stateless
• JVM isolation
• Query based
subscription
• Logical replication
• Available on GitHub
CM-Well Node
Cassandra
Elastic Kafka
Web
Server
Background
Processing
User
workload
Health Control Layer
CM-Well Node
Cassandra
Elastic Kafka
Web
Server
Background
Processing
User
workload
Health Control Layer
Roaming
Grid
CM-Well Node
Cassandra
Elastic Kafka
Web
Server
Background
Processing
User
workload
Health Control Layer
POPULATING THE GRAPH
RELATIONAL
• Map primary keys into
own namespace (or
assign surrogate keys)
• Map dimensions to
existing entities if
possible
• Concentrate on the
relations and
attributes that matter
• Can always return to
the source for details
<https://permid.org/1-4297089638>

a tr-org:Organization ;

tr-common:hasPermId "4297089638"^^xsd:string ;

tr-org:isIncorporatedIn <http://sws.geonames.org/6252001/> ;

fibo-be-le-cb:isDomiciledIn <http://sws.geonames.org/6252001/> ;

vcard:hasURL <https://www.tesla.com/> ;

vcard:organization-name "Tesla Inc"^^xsd:string .
<https://permid.org/1-34421840245>

a tr-person:Person ;

vcard:family-name "Musk"^^xsd:string ;

vcard:given-name "Elon"^^xsd:string .
<https://permid.org/2-497b8953cd00ec12589126c0f1116e2ca8fb484b80722
person:hasPositionType o:1-10010134 ;

person:hasReportedTitle "Chairman of the Board" ;

person:isPositionIn o:1-4297089638 .
Surrogate Key
Relationship with
properties
Existing ontologies
Existing
dimension
FULL TEXT
• Link to source (Retain confidence)
• Provenance in quad for updates <https://data.tr.com/sc/4297089638_4295869694>

a tr-sc:SupplyChainAgreement ;

tr-sc:aggregateConfidence “0.9999976445274502”^^xs
tr-sc:supplier <https://permid.org/1-4297089638>;

tr-sc:customer <https://permid.org/1-4295869694>.
<https://data.tr.com/sc/snippet/4297089638_429586969
a tr-sc:Snippet ;

tr-sc:snippetText "~~~Tesla~~~ is supplying electr
tr-sc:confidence "0.999"^^xsd:float;

tr-sc:source “nL1N0IL11N-2013-10-31"^^xsd:string.
Article primary key
PROVENANCE IS INVALUABLE
• W3C Provenance applied by named graph:
• Can also be used to model bi-temporality if needed
• Example
Source A states
<S>, <P>, “O”
Source B states
<S>, <P>, <O>
Append unique
named Graph
on load
<S>, <P>, “O”, <G1>
<S>, <P>, <O>, <G2>
<G1>, <prov:wasGeneratedBy>, “Snaplogic”
<G1>, <prov:wasDerivedFrom>, “Database source”
etc.
Graph URI could be hash of
S,P,O or GUID, etc.
Consider idempotence and
determinism
MODELLING BI-TEMPORALITY
• Not inherently supported in RDF
• Possible solutions
• Ignore!
• Model for particular values (potentially
using blank nodes)
• Model on named graph
• Reification
• Use RDF* & SPARQL* (Reification Done
Right - only in BlazeGraph…)
Name

“Apple Computer”
From: 1977-03-01

To: 2007-09-01
Name

Apple Inc

From: 2007-09-01
Organization

Apple
Has Name
Has Name
Specific model approach
org:2-xyz {

org:1-4295905573 org:hasName "Apple Computer" .

}

org:2-xyz 

bt:effectiveFrom "1977-03-01"^^xsd:date ;

bt:effectiveTo "2007-09-01"^^xsd:date .
Named Graph Approach
Temporality on Named Graph
LESSONS LEARNED
RDF IS DIFFERENT, IA IS KING
• Early education is key
• Strong information architecture really helps
• Modeling tools
• OWL invaluable, consider SHACL
Closed world on
top of open world
Open world
MAPPING TO AUTHORITIES
Mapping approaches:
• Simple match
• Fuzzy match (Soundex,
Levenshtein)
• Full text search
• Normalize then search/
match
• Concordance (TAMR etc)
• Ensemble of the above
STILL BLEEDING EDGE
• …but now being used in real world solutions
• Have clear goals
• Be prepared to change direction & solutions
• Getting easier as vendor solutions increase and mature
DON’T OVERTHINK ETL
• Doesn’t have to be within Hadoop
• Does have to be repeatable
• Pervert existing ETL to treat as 3 column table
• A RDF REST API can be sufficient
• But
• Has to fit with overarching IA
• Need to accommodate idempotence & determinism (can’t be different named
graph on each run)
Dan Bennett
@nonodename
dan.bennett <at> tr.com
https://github.com/thomsonreuters/cm-well
permid.org
QUESTIONS?

More Related Content

PDF
Hadoop application architectures - using Customer 360 as an example
PDF
Architecting a next-generation data platform
PPTX
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
Architecting a next generation data platform
PDF
Architecting next generation big data platform
PDF
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
PDF
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - using Customer 360 as an example
Architecting a next-generation data platform
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Architecting a next generation data platform
Architecting next generation big data platform
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Hadoop application architectures - Fraud detection tutorial

What's hot (20)

PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
PPTX
Design Patterns For Real Time Streaming Data Analytics
PDF
Deconstructing Lambda
PDF
Streaming architecture patterns
PDF
Architectural Patterns for Streaming Applications
PDF
What no one tells you about writing a streaming app
PPTX
Bigdata : Big picture
PDF
Data Aggregation At Scale Using Apache Flume
PPTX
Design Patterns for Large-Scale Real-Time Learning
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
PPTX
Streaming in the Wild with Apache Flink
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PDF
Top 5 mistakes when writing Streaming applications
PDF
Pivotal Real Time Data Stream Analytics
PDF
SnappyData Toronto Meetup Nov 2017
PPTX
Big data clustering
PDF
Advanced Natural Language Processing with Apache Spark NLP
Build a Time Series Application with Apache Spark and Apache HBase
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Yahoo! Mail antispam - Bay area Hadoop user group
Design Patterns For Real Time Streaming Data Analytics
Deconstructing Lambda
Streaming architecture patterns
Architectural Patterns for Streaming Applications
What no one tells you about writing a streaming app
Bigdata : Big picture
Data Aggregation At Scale Using Apache Flume
Design Patterns for Large-Scale Real-Time Learning
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Streaming in the Wild with Apache Flink
Why apache Flink is the 4G of Big Data Analytics Frameworks
Big Data Anti-Patterns: Lessons From the Front LIne
Top 5 mistakes when writing Streaming applications
Pivotal Real Time Data Stream Analytics
SnappyData Toronto Meetup Nov 2017
Big data clustering
Advanced Natural Language Processing with Apache Spark NLP
Ad

Similar to Building a Knowledge Graph @ Graph Day 2018 (20)

PDF
Building a Knowledge Graph
PDF
Property graph vs. RDF Triplestore comparison in 2020
PPTX
Knowledge Graph Introduction
PDF
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
PPTX
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
PDF
Introduction to Graph Databases
PPTX
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
PDF
Debunking some “RDF vs. Property Graph” Alternative Facts
PPTX
Lotico oct 2010
PDF
Knowledge Graphs and Their Application.pdf
PDF
Semantic Web and It's Application - Kabul Kurniawan
PDF
Knowledge Graphs - The Power of Graph-Based Search
PDF
Democratizing Data at Airbnb
PDF
Two graph data models : RDF and Property Graphs
PDF
What’s the big deal with Graph Databases?
PPTX
Enterprise knowledge graphs
PPTX
Follow the money with graphs
PDF
Modelling context and statement-level metadata in knowledge graphs
PDF
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
PDF
RDF Analytics... SPARQL and Beyond
Building a Knowledge Graph
Property graph vs. RDF Triplestore comparison in 2020
Knowledge Graph Introduction
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
Introduction to Graph Databases
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Debunking some “RDF vs. Property Graph” Alternative Facts
Lotico oct 2010
Knowledge Graphs and Their Application.pdf
Semantic Web and It's Application - Kabul Kurniawan
Knowledge Graphs - The Power of Graph-Based Search
Democratizing Data at Airbnb
Two graph data models : RDF and Property Graphs
What’s the big deal with Graph Databases?
Enterprise knowledge graphs
Follow the money with graphs
Modelling context and statement-level metadata in knowledge graphs
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
RDF Analytics... SPARQL and Beyond
Ad

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Machine Learning_overview_presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
Machine Learning_overview_presentation.pptx
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25-Week II
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)

Building a Knowledge Graph @ Graph Day 2018

  • 1. REUTERS / Danish Ismail BUILDING A KNOWLEDGE GRAPH DAN BENNETT - GRAPH DAY 2018 @nonodename SEPTEMBER, 2018
  • 2. AGENDA • A little on TR • What’s a knowledge graph? • Quick reset on RDF - if needed • Data engineering for our knowledge graph • Lessons learned • Q&A
  • 3. A LITTLE ON THOMSON REUTERS
  • 4. THOMSON REUTERS - THE ANSWER COMPANY • Information, technology and expertise for professionals • Focus on finance, risk, media, legal, tax and accounting markets • 87% recurring revenue, 93% electronic, global footprint • My role: big data & NLP within central technology group supporting market aligned business units REUTERS/Amit Dave
  • 7. WHAT IS A KNOWLEDGE GRAPH? • Open world representation of information • Every entry point is equal cost • Underpin Cortana, Google Assistant, Siri, Alexa • Typically (but doesn’t have to be) expressed in RDF Score Team Team Game 6-1 Venue Panama England Nizhy Novgorod Stadium Score Score Score Stones8 hasName hasLogo hasFinalScore played hasName hasLogo playedAt hasName hasQuarter atTime byPlayer
  • 10. SCHEMA ON WRITE • Fixed data model • Slow to change • Strong enforcement
  • 11. SCHEMA ON READ • Capture everything • Apply logic (schema) on read • No standards
  • 12. RDF: SCHEMA ON READ, OPTIONAL ON WRITE Schema on Read Schema on Write Accuracy Difficult & slow to change Anything goes Federated RDF Standards (potentially) verbose Triggers/Stored Procs/IDs Referential integrity on write Referential integrity on read Super flexible Capture everything Flexible
  • 13. HOW CAN THAT BE? (SIMPLIFIED!) ID Date Amount Customer 1 30-Aug-2016 56.84 1 2 31-Aug-2016 42.36 2 3 1-Sep-2016 98.45 1 4 1-Sep-2016 23.54 3 ID Name 1 Barack Obama 2 Richard Nixon 3 Ronald Reagan 4 Bill Clinton Orders Customers Subject Predicate Object http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830 http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84 http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/ customers/1http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/order http://tr.com/orders/2 http://ont.tr.com/orders/order_date 20160831 http://tr.com/orders/2 http://ont.tr.com/orders/order_amount 42.36 http://tr.com/orders/2 http://ont.tr.com/orders/order_customer http://tr.com/ customers/2http://tr.com/orders/2 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/order … … … http://tr.com/ customers/1 http://ont.tr.com/customers/name Barack Obama http://tr.com/ customers/1 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/ customerhttp://tr.com/ customers/2 http://ont.tr.com/customers/name Richard Nixon http://tr.com/ customers/2 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/ customer… … … RelationalRDF • URI = primary key • New column = new rows • Sparse if row missing • Object a relation or literal
  • 14. SCHEMA, QUERY & FEDERATION Subject Predicate Object http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830 http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84 http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/customers/1 http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://ont.tr.com/order http://tr.com/orders/1 http://ont.salesforce.com/crm/customer_spend 9856.45 http://tr.com/customers/1 http://www.w3.org/2002/07/owl#sameAs http://en.wikipedia.org/wiki/Richard_Nixon http://en.wikipedia.org/wiki/Richard_Nixon http://owl.wikipedia.org/born 19130109 Federated data (spend from CRM) Relation to external data Schema (Ontology) More than one can apply to a subject • Sparql - like SQL. Sum all orders: SELECT sum(?amount)
 WHERE {
 ?order <http://ont.tr.com/orders/order_amount> ?amount .
 ?order <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ont.tr.com/order>
 }
  • 15. WHY RDF FOR A KNOWLEDGE GRAPH? RDF DB Property Graph DB Open Yes Maybe Incremental Load Via named graph or SPARQL Maybe Federated Data Yes No Modelling tools Yes Unlikely Types/Classes/higher abstractions Yes No
  • 17. PHYSICAL Snaplogic or Hadoop ETL Sources (Relational, Proprietary) RDF CM-Well: RDF Store Mart/
 Products Pull Push Batch REST Based Publishing HTTPS FTP/HTTP JDBC Warehouse Remote Read Replicas RDF Full text mining RDBMS Web services Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit Neptune Elastic RDBMS Filesystem Filesystem
  • 18. LOGICAL SPARQL SPARQL Triggers As captured • Mechanistic conversion • Minimal validation • Named graph for W3C provenance & update Target Model • “Canonical Graph” • Curated ontologies • Normalized representation Selective Product Models • Slice & dice • Store/retrieve using whatever works • Not necessarily graph
  • 19. OUR GRAPH WAREHOUSE: CM-WELL HA Proxy … REST/HTTP REST/HTTP • NOT a triple store! Focus is on data movement • No master node • Linear scaling • Stateless • JVM isolation • Query based subscription • Logical replication • Available on GitHub CM-Well Node Cassandra Elastic Kafka Web Server Background Processing User workload Health Control Layer CM-Well Node Cassandra Elastic Kafka Web Server Background Processing User workload Health Control Layer Roaming Grid CM-Well Node Cassandra Elastic Kafka Web Server Background Processing User workload Health Control Layer
  • 21. RELATIONAL • Map primary keys into own namespace (or assign surrogate keys) • Map dimensions to existing entities if possible • Concentrate on the relations and attributes that matter • Can always return to the source for details <https://permid.org/1-4297089638>
 a tr-org:Organization ;
 tr-common:hasPermId "4297089638"^^xsd:string ;
 tr-org:isIncorporatedIn <http://sws.geonames.org/6252001/> ;
 fibo-be-le-cb:isDomiciledIn <http://sws.geonames.org/6252001/> ;
 vcard:hasURL <https://www.tesla.com/> ;
 vcard:organization-name "Tesla Inc"^^xsd:string . <https://permid.org/1-34421840245>
 a tr-person:Person ;
 vcard:family-name "Musk"^^xsd:string ;
 vcard:given-name "Elon"^^xsd:string . <https://permid.org/2-497b8953cd00ec12589126c0f1116e2ca8fb484b80722 person:hasPositionType o:1-10010134 ;
 person:hasReportedTitle "Chairman of the Board" ;
 person:isPositionIn o:1-4297089638 . Surrogate Key Relationship with properties Existing ontologies Existing dimension
  • 22. FULL TEXT • Link to source (Retain confidence) • Provenance in quad for updates <https://data.tr.com/sc/4297089638_4295869694>
 a tr-sc:SupplyChainAgreement ;
 tr-sc:aggregateConfidence “0.9999976445274502”^^xs tr-sc:supplier <https://permid.org/1-4297089638>;
 tr-sc:customer <https://permid.org/1-4295869694>. <https://data.tr.com/sc/snippet/4297089638_429586969 a tr-sc:Snippet ;
 tr-sc:snippetText "~~~Tesla~~~ is supplying electr tr-sc:confidence "0.999"^^xsd:float;
 tr-sc:source “nL1N0IL11N-2013-10-31"^^xsd:string. Article primary key
  • 23. PROVENANCE IS INVALUABLE • W3C Provenance applied by named graph: • Can also be used to model bi-temporality if needed • Example Source A states <S>, <P>, “O” Source B states <S>, <P>, <O> Append unique named Graph on load <S>, <P>, “O”, <G1> <S>, <P>, <O>, <G2> <G1>, <prov:wasGeneratedBy>, “Snaplogic” <G1>, <prov:wasDerivedFrom>, “Database source” etc. Graph URI could be hash of S,P,O or GUID, etc. Consider idempotence and determinism
  • 24. MODELLING BI-TEMPORALITY • Not inherently supported in RDF • Possible solutions • Ignore! • Model for particular values (potentially using blank nodes) • Model on named graph • Reification • Use RDF* & SPARQL* (Reification Done Right - only in BlazeGraph…) Name
 “Apple Computer” From: 1977-03-01
 To: 2007-09-01 Name
 Apple Inc
 From: 2007-09-01 Organization
 Apple Has Name Has Name Specific model approach org:2-xyz {
 org:1-4295905573 org:hasName "Apple Computer" .
 }
 org:2-xyz 
 bt:effectiveFrom "1977-03-01"^^xsd:date ;
 bt:effectiveTo "2007-09-01"^^xsd:date . Named Graph Approach Temporality on Named Graph
  • 26. RDF IS DIFFERENT, IA IS KING • Early education is key • Strong information architecture really helps • Modeling tools • OWL invaluable, consider SHACL Closed world on top of open world Open world
  • 27. MAPPING TO AUTHORITIES Mapping approaches: • Simple match • Fuzzy match (Soundex, Levenshtein) • Full text search • Normalize then search/ match • Concordance (TAMR etc) • Ensemble of the above
  • 28. STILL BLEEDING EDGE • …but now being used in real world solutions • Have clear goals • Be prepared to change direction & solutions • Getting easier as vendor solutions increase and mature
  • 29. DON’T OVERTHINK ETL • Doesn’t have to be within Hadoop • Does have to be repeatable • Pervert existing ETL to treat as 3 column table • A RDF REST API can be sufficient • But • Has to fit with overarching IA • Need to accommodate idempotence & determinism (can’t be different named graph on each run)
  • 30. Dan Bennett @nonodename dan.bennett <at> tr.com https://github.com/thomsonreuters/cm-well permid.org QUESTIONS?