SlideShare a Scribd company logo
REUTERS / Danish Ismail
BUILDING A KNOWLEDGE GRAPH
DAN BENNETT
@nonodename
JANUARY, 2018
AGENDA
• A little on TR
• What’s a knowledge graph?
• Quick reset on RDF
• Logical & Physical architecture for our knowledge graph
• Lessons learned
• Q&A
A LITTLE ON THOMSON REUTERS
THOMSON REUTERS - THE ANSWER COMPANY
• Information, technology and
expertise for professionals
• Focus on finance, risk, media, legal,
tax and accounting markets
• 87% recurring revenue, 93%
electronic, global footprint
• My role: big data & NLP within
central technology group
supporting market aligned business
units
REUTERS/Amit Dave
CONTENT, NOT DATA
WHAT’S A KNOWLEDGE GRAPH?
WHAT IS A KNOWLEDGE GRAPH?
• Open world representation of
information
• Every entry point is equal cost
• Underpin Cortana, Google
Assistant, Siri, Alexa
• Typically (but doesn’t have to be)
expressed in RDF
• No longer a solution in search of
a problem!
Team
Team
Game
7-38
Venue
Eagles
Vikings
Lincoln Financial Field
Quarter
7-71
hasName
hasLogo
hasFinalScore played
hasName
hasLogo
playedAt
hasName
hasQuarter
hasPeriod
hasScore
WHY WOULD YOU BUILD A KNOWLEDGE GRAPH?
• Underpin new product features
• Organize, manage and discover internal data
• To sell!
QUICK RESET ON RDF
SCHEMA ON WRITE
• Fixed data model
• Slow to change
• Strong enforcement
SCHEMA ON READ
• Capture everything
• Apply logic (schema) on read
• No standards
RDF: SCHEMA ON READ, OPTIONAL ON WRITE
Schema on Read Schema on Write
Accuracy
Difficult & slow to
change
Anything goes
Federated
RDF
Standards
(potentially) verbose
Triggers/Stored Procs/IDs
Referential integrity
on write
Referential integrity
on read
Super flexible
Capture everything
Flexible
HOW CAN THAT BE? (SIMPLIFIED!)
ID Date Amount Customer
1 30-Aug-2016 56.84 1
2 31-Aug-2016 42.36 2
3 1-Sep-2016 98.45 1
4 1-Sep-2016 23.54 3
ID Name
1 Barack Obama
2 Richard Nixon
3 Ronald Reagan
4 Bill Clinton
Orders Customers
Subject Predicate Object
http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830
http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84
http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/
customers/1http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/order
http://tr.com/orders/2 http://ont.tr.com/orders/order_date 20160831
http://tr.com/orders/2 http://ont.tr.com/orders/order_amount 42.36
http://tr.com/orders/2 http://ont.tr.com/orders/order_customer http://tr.com/
customers/2http://tr.com/orders/2 http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/order
… … …
http://tr.com/
customers/1
http://ont.tr.com/customers/name Barack Obama
http://tr.com/
customers/1
http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/
customerhttp://tr.com/
customers/2
http://ont.tr.com/customers/name Richard Nixon
http://tr.com/
customers/2
http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/
customer… … …
RelationalRDF
• URI = primary key
• New column = new
rows
• Sparse if row missing
• Object a relation or
literal
OUR ARCHITECTURE
PHYSICAL
Hadoop/
SPARK
Snaplogic
Web
Services
Sources
(Relational,
Proprietary)
CM-Well:
RDF Store
Mart/

Products
Sync Pull
Async Feed
Batch Compute
REST Based
Publishing
REST
FTP/HTTP
JDBC/Sqoop
Warehouse
ETL or Direct
Publish
Remote
Read Replicas
Elastic
Neptune
Relational
Relational
File
System
File
System
LOGICAL
SPARQL Triggers
SPARQL Triggers
As captured
• Mechanistic conversion
• Minimal validation
• Named graph for W3C
provenance
Target Model
• “Canonical Graph”
• Curated ontologies
• Normalized
representation
Selective replication
Product Models
• Slice & dice
• Store/retrieve using
whatever works
• Not necessarily graph
CM-WELL - OUR OPEN SOURCE GRAPH WAREHOUSE
CM-Well Node
HA Proxy
Cassandra
Elastic Jena
Web
Server
CM-Well Node
Cassandra
Elastic Jena
Web
Server
CM-Well Node
Cassandra
Elastic Jena
Web
Server
…
REST/HTTP
REST/HTTP
Background
Processing &
Health
User
workload
Background
Processing &
Health
User
workload
Background
Processing &
Health
User
workload
Roaming
Grid
Manager
• NOT a triple store!
• No master node
• Linear scaling
• Stateless
• JVM isolation
• Query based
subscription
• Logical replication
• Available on GitHub
LESSONS LEARNED
RDF IS DIFFERENT, IA IS KING
• Early education is key
• Strong information architecture really helps
• Modeling tools
• OWL invaluable, consider SHACL
AUTHORITIES REALLY HELP
PROVENANCE IS INVALUABLE
• Provenance applied by named graph:
Source A states
<S>, <P>, “O”
Source B states
<S>, <P>, <O>
Append unique
named Graph
on load
<S>, <P>, “O”, <G1>
<S>, <P>, <O>, <G2>
<G1>, <prov:wasGeneratedBy>, “ETL Tool”
<G1>, <prov:wasDerivedFrom>, “Database source”
etc.
Graph URI could be hash
of S,P,O or GUID, etc
STILL BLEEDING EDGE
• Have clear goals
• Be prepared to change direction & solutions
• Getting easier as vendor solutions increase and mature
DON’T OVERTHINK ETL
• Doesn’t have to be within Hadoop
• Does have to be repeatable
• An RDF REST API can be sufficient
• But
• Has to fit with overarching IA
• Need to accommodate idempotence & determinism (can’t be different named graph
on each run)
Dan Bennett
@nonodename
dan.bennett <at> tr.com
https://github.com/thomsonreuters/cm-well
permid.org
QUESTIONS?

More Related Content

PDF
Rio info 2013 - Linked Data at Globo.com
PDF
Conceptual modelling from natural language
PDF
Building a Knowledge Graph @ Graph Day 2018
PPTX
Data Scientist's Daily Life
PDF
Data Modelling at Scale
PPT
2004 05 intelligence processing seminar
PDF
Metadata and the Power of Pattern-Finding
PPTX
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Rio info 2013 - Linked Data at Globo.com
Conceptual modelling from natural language
Building a Knowledge Graph @ Graph Day 2018
Data Scientist's Daily Life
Data Modelling at Scale
2004 05 intelligence processing seminar
Metadata and the Power of Pattern-Finding
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...

Similar to Building a Knowledge Graph (20)

PPT
SystemT: Declarative Information Extraction
PDF
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
PPT
Gala Webminar September 2013
PPTX
Quick tour all handout
PDF
Keynote: GraphTour Toronto
PDF
Virtual Cleared Job Fair Job Seeker Handbook May 7, 2020
PDF
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
PPT
2004 06 intelligence analysis seminar
PPTX
In-Memory Computing Webcast. Market Predictions 2017
PDF
[系列活動] 資料探勘速遊
PPTX
Clickstream data with spark
PDF
Simple fuzzy name matching in elasticsearch paris meetup
PPTX
BIG DATA ANALYTICS USING R
PPTX
Lightning Talk: Get Even More Value from MongoDB Applications
PDF
Mastering Your Customer Data on Apache Spark by Elliott Cordo
PPTX
Quantifying Fan Engagement using Social Media
PDF
Schema.org Structured data the What, Why, & How
PDF
These are not the Apes you’re looking for: Why copyright and NFTs don’t work ...
PDF
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
PDF
DNA March 2013 CENTR Presentation
SystemT: Declarative Information Extraction
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Gala Webminar September 2013
Quick tour all handout
Keynote: GraphTour Toronto
Virtual Cleared Job Fair Job Seeker Handbook May 7, 2020
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...
2004 06 intelligence analysis seminar
In-Memory Computing Webcast. Market Predictions 2017
[系列活動] 資料探勘速遊
Clickstream data with spark
Simple fuzzy name matching in elasticsearch paris meetup
BIG DATA ANALYTICS USING R
Lightning Talk: Get Even More Value from MongoDB Applications
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Quantifying Fan Engagement using Social Media
Schema.org Structured data the What, Why, & How
These are not the Apes you’re looking for: Why copyright and NFTs don’t work ...
Kafka Summit SF 2017 - Keynote - Managing Data at Scale: The Unreasonable Eff...
DNA March 2013 CENTR Presentation
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Tartificialntelligence_presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
1. Introduction to Computer Programming.pptx
Spectroscopy.pptx food analysis technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Tartificialntelligence_presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
A comparative analysis of optical character recognition models for extracting...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Electronic commerce courselecture one. Pdf
Ad

Building a Knowledge Graph

  • 1. REUTERS / Danish Ismail BUILDING A KNOWLEDGE GRAPH DAN BENNETT @nonodename JANUARY, 2018
  • 2. AGENDA • A little on TR • What’s a knowledge graph? • Quick reset on RDF • Logical & Physical architecture for our knowledge graph • Lessons learned • Q&A
  • 3. A LITTLE ON THOMSON REUTERS
  • 4. THOMSON REUTERS - THE ANSWER COMPANY • Information, technology and expertise for professionals • Focus on finance, risk, media, legal, tax and accounting markets • 87% recurring revenue, 93% electronic, global footprint • My role: big data & NLP within central technology group supporting market aligned business units REUTERS/Amit Dave
  • 7. WHAT IS A KNOWLEDGE GRAPH? • Open world representation of information • Every entry point is equal cost • Underpin Cortana, Google Assistant, Siri, Alexa • Typically (but doesn’t have to be) expressed in RDF • No longer a solution in search of a problem! Team Team Game 7-38 Venue Eagles Vikings Lincoln Financial Field Quarter 7-71 hasName hasLogo hasFinalScore played hasName hasLogo playedAt hasName hasQuarter hasPeriod hasScore
  • 8. WHY WOULD YOU BUILD A KNOWLEDGE GRAPH? • Underpin new product features • Organize, manage and discover internal data • To sell!
  • 10. SCHEMA ON WRITE • Fixed data model • Slow to change • Strong enforcement
  • 11. SCHEMA ON READ • Capture everything • Apply logic (schema) on read • No standards
  • 12. RDF: SCHEMA ON READ, OPTIONAL ON WRITE Schema on Read Schema on Write Accuracy Difficult & slow to change Anything goes Federated RDF Standards (potentially) verbose Triggers/Stored Procs/IDs Referential integrity on write Referential integrity on read Super flexible Capture everything Flexible
  • 13. HOW CAN THAT BE? (SIMPLIFIED!) ID Date Amount Customer 1 30-Aug-2016 56.84 1 2 31-Aug-2016 42.36 2 3 1-Sep-2016 98.45 1 4 1-Sep-2016 23.54 3 ID Name 1 Barack Obama 2 Richard Nixon 3 Ronald Reagan 4 Bill Clinton Orders Customers Subject Predicate Object http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830 http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84 http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/ customers/1http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/order http://tr.com/orders/2 http://ont.tr.com/orders/order_date 20160831 http://tr.com/orders/2 http://ont.tr.com/orders/order_amount 42.36 http://tr.com/orders/2 http://ont.tr.com/orders/order_customer http://tr.com/ customers/2http://tr.com/orders/2 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/order … … … http://tr.com/ customers/1 http://ont.tr.com/customers/name Barack Obama http://tr.com/ customers/1 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/ customerhttp://tr.com/ customers/2 http://ont.tr.com/customers/name Richard Nixon http://tr.com/ customers/2 http://www.w3.org/1999/02/22-rdf-syntax- ns#type http://ont.tr.com/ customer… … … RelationalRDF • URI = primary key • New column = new rows • Sparse if row missing • Object a relation or literal
  • 15. PHYSICAL Hadoop/ SPARK Snaplogic Web Services Sources (Relational, Proprietary) CM-Well: RDF Store Mart/
 Products Sync Pull Async Feed Batch Compute REST Based Publishing REST FTP/HTTP JDBC/Sqoop Warehouse ETL or Direct Publish Remote Read Replicas Elastic Neptune Relational Relational File System File System
  • 16. LOGICAL SPARQL Triggers SPARQL Triggers As captured • Mechanistic conversion • Minimal validation • Named graph for W3C provenance Target Model • “Canonical Graph” • Curated ontologies • Normalized representation Selective replication Product Models • Slice & dice • Store/retrieve using whatever works • Not necessarily graph
  • 17. CM-WELL - OUR OPEN SOURCE GRAPH WAREHOUSE CM-Well Node HA Proxy Cassandra Elastic Jena Web Server CM-Well Node Cassandra Elastic Jena Web Server CM-Well Node Cassandra Elastic Jena Web Server … REST/HTTP REST/HTTP Background Processing & Health User workload Background Processing & Health User workload Background Processing & Health User workload Roaming Grid Manager • NOT a triple store! • No master node • Linear scaling • Stateless • JVM isolation • Query based subscription • Logical replication • Available on GitHub
  • 19. RDF IS DIFFERENT, IA IS KING • Early education is key • Strong information architecture really helps • Modeling tools • OWL invaluable, consider SHACL
  • 21. PROVENANCE IS INVALUABLE • Provenance applied by named graph: Source A states <S>, <P>, “O” Source B states <S>, <P>, <O> Append unique named Graph on load <S>, <P>, “O”, <G1> <S>, <P>, <O>, <G2> <G1>, <prov:wasGeneratedBy>, “ETL Tool” <G1>, <prov:wasDerivedFrom>, “Database source” etc. Graph URI could be hash of S,P,O or GUID, etc
  • 22. STILL BLEEDING EDGE • Have clear goals • Be prepared to change direction & solutions • Getting easier as vendor solutions increase and mature
  • 23. DON’T OVERTHINK ETL • Doesn’t have to be within Hadoop • Does have to be repeatable • An RDF REST API can be sufficient • But • Has to fit with overarching IA • Need to accommodate idempotence & determinism (can’t be different named graph on each run)
  • 24. Dan Bennett @nonodename dan.bennett <at> tr.com https://github.com/thomsonreuters/cm-well permid.org QUESTIONS?