SlideShare a Scribd company logo
Data Science meets Linked
Data
Alasdair J G Gray
http://www.alasdairjggray.co.uk
@gray_alasdair
A.J.G.Gray@hw.ac.uk
SICSA Data Science Theme Launch
3 July 2014
BBC World Cup
3 July 2014SICSA Data Science Theme Launch
1
BBC Linked Data Platform
3 July 2014SICSA Data Science Theme Launch
2
Olympics 2012
3 July 2014SICSA Data Science Theme Launch
3
Linking Data
3 July 2014SICSA Data Science Theme Launch
4
1. Global ID – URI
2. Resolvable ID
3. Useful content
HTML for humans
RDF for machines
4. Link to other resources
Like the Web, but for data!
Linked Data Principles
3 July 2014SICSA Data Science Theme Launch
5
“RDF and OWL do not
solve the interoperability
problem, they just lay it
bare on the table!”
Challenge 1: Matching
Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014
John Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Grant
Iain Grant
Born: 1860
Messy data
Probabilistic
matches
Schema matching
Gleevec® = Imatinib Mesylate
3 July 2014 SICSA Data Science Theme Launch 7
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Challenge 2: Reusing mappings
3 July 2014 SICSA Data Science Theme Launch 8
Link: skos:closeMatch
Reason: non-salt form
Link: skos:exactMatch
Reason: drug name
Link: owl:sameAs
Challenge: Multiple Identities
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
3 July 2014SICSA Data Science Theme Launch
9
P12047
X31045
GB:29384
http://rdf.ebi.ac.uk/resource/ch
embl/molecule/CHEMBL1642
https://www.ebi.ac.uk/chembl/co
mpound/inspect/CHEMBL1642
Challenge Open Data:
Licenses
5★ of linked data
Licenses who can
reuse the data
 Interoperability of
licenses
 Non-commercial:
academic use,
teaching, industry
3 July 2014SICSA Data Science Theme Launch
10
Challenges: Privacy
11
3 July 2014SICSA Data Science Theme Launch
Challenge: Query Performance
Response time
Data freshness
Reliability
Volume of
requests
Hosting
resources
3 July 2014SICSA Data Science Theme Launch
12
Queries Queries
In Data we Trust
How can we trust
the data we’ve got
back?
How can we ensure
that it hasn’t been
tampered on the
way?
Trusty URIs
3 July 2014SICSA Data Science Theme Launch
13
http://www.intelsat.com/wp-
content/uploads/2014/03/Red-padlock.jpg
Contact Details
www.alasdairjggray.co.uk
A.J.G.Gray@hw.ac.uk
@gray_alasdair
3 July 2014SICSA Data Science Theme Launch
15
“There is lots of data we all use every day, and it’s not part of the web. I
can see my bank statements on the web, and my photographs, and I can
see my appointments in a calendar. But can I see my photos in a calendar
to see what I was doing when I took them? Can I see bank statement lines
in a calendar?
No. Why not? Because we don’t have a web of data. Because data is
controlled by applications and each application keeps it to itself.”
Tim Berners-Lee

More Related Content

PDF
NASA Johnson Space Center Data Science Day 2.0
PPTX
Climate advocacy emails content analysis ICA 2015 San Juan PR
PPTX
Data Cafe - Focus in Australia
PDF
Alison Alexander and Ruth Wiltsher: Creative and Tactile Physics
PPTX
ODIN Final Event - Submission to datacentres
PPT
MSP2: Middles School Portal 2 Math & Science Pathways
PDF
ODIN Final Event - The Care and Feeding of Scientific Data
PPTX
What can linked data do for me? / Janet Aucock (University of St Andrews)
NASA Johnson Space Center Data Science Day 2.0
Climate advocacy emails content analysis ICA 2015 San Juan PR
Data Cafe - Focus in Australia
Alison Alexander and Ruth Wiltsher: Creative and Tactile Physics
ODIN Final Event - Submission to datacentres
MSP2: Middles School Portal 2 Math & Science Pathways
ODIN Final Event - The Care and Feeding of Scientific Data
What can linked data do for me? / Janet Aucock (University of St Andrews)

Similar to Data Science meets Linked Data (20)

PPTX
Data Science & Analytics (light overview)
PPTX
Opening up data – Jisc and CNI conference 10 July 2014
PPTX
The Evolution of Open Data
PDF
2018 GIS in Development: Semantic Web
PDF
Poster: Very Open Data Project
PDF
Preparing Data for Sharing: The FAIR Principles
PDF
Open Data Initiatives – Empowering Students to Make More Informed Choices? - ...
PPTX
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
PPTX
Big and Small Web Data
PDF
Linked Data: opening Scotland’s library content to the world
DOCX
2014 11-17 crichton institute talk on open data
PPTX
Data science unit1
PDF
EDF2012: The Web of Data and its Five Stars
PDF
Big Data on the Web – What We Will Do
PDF
FAIR data: LOUD for all audiences
PDF
Global lodlam_communities and open cultural data
PPT
Data curation issues for repositories
PDF
Open, Linked, Hacked
PPT
Exploring the Semantic Web
PPTX
Session 01 designing and scoping a data science project
Data Science & Analytics (light overview)
Opening up data – Jisc and CNI conference 10 July 2014
The Evolution of Open Data
2018 GIS in Development: Semantic Web
Poster: Very Open Data Project
Preparing Data for Sharing: The FAIR Principles
Open Data Initiatives – Empowering Students to Make More Informed Choices? - ...
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Big and Small Web Data
Linked Data: opening Scotland’s library content to the world
2014 11-17 crichton institute talk on open data
Data science unit1
EDF2012: The Web of Data and its Five Stars
Big Data on the Web – What We Will Do
FAIR data: LOUD for all audiences
Global lodlam_communities and open cultural data
Data curation issues for repositories
Open, Linked, Hacked
Exploring the Semantic Web
Session 01 designing and scoping a data science project
Ad

More from Alasdair Gray (20)

PPTX
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
PPTX
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
PPTX
An Identifier Scheme for the Digitising Scotland Project
PPTX
Supporting Dataset Descriptions in the Life Sciences
PPTX
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
PPTX
Validata: A tool for testing profile conformance
PPTX
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
PPTX
Open PHACTS: The Data Today
PPTX
Project X
PPTX
Data Integration in a Big Data Context: An Open PHACTS Case Study
PPTX
Data Integration in a Big Data Context
PPTX
Data Linkage
PPTX
Scientific lenses to support multiple views over linked chemistry data
PPTX
Scientific Lenses over Linked Data An approach to support multiple integrate...
PPTX
Describing Scientific Datasets: The HCLS Community Profile
PPTX
SensorBench
PPTX
Sensors and Big Data for Health and Well-being
PPTX
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
PPTX
Dataset Descriptions in Open PHACTS and HCLS
PPTX
Computing Identity Co-Reference Across Drug Discovery Datasets
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
An Identifier Scheme for the Digitising Scotland Project
Supporting Dataset Descriptions in the Life Sciences
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Validata: A tool for testing profile conformance
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
Open PHACTS: The Data Today
Project X
Data Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context
Data Linkage
Scientific lenses to support multiple views over linked chemistry data
Scientific Lenses over Linked Data An approach to support multiple integrate...
Describing Scientific Datasets: The HCLS Community Profile
SensorBench
Sensors and Big Data for Health and Well-being
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS p...
Dataset Descriptions in Open PHACTS and HCLS
Computing Identity Co-Reference Across Drug Discovery Datasets
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Electronic commerce courselecture one. Pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Assigned Numbers - 2025 - Bluetooth® Document
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
Electronic commerce courselecture one. Pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
MIND Revenue Release Quarter 2 2025 Press Release

Data Science meets Linked Data

  • 1. Data Science meets Linked Data Alasdair J G Gray http://www.alasdairjggray.co.uk @gray_alasdair A.J.G.Gray@hw.ac.uk SICSA Data Science Theme Launch 3 July 2014
  • 2. BBC World Cup 3 July 2014SICSA Data Science Theme Launch 1
  • 3. BBC Linked Data Platform 3 July 2014SICSA Data Science Theme Launch 2
  • 4. Olympics 2012 3 July 2014SICSA Data Science Theme Launch 3
  • 5. Linking Data 3 July 2014SICSA Data Science Theme Launch 4
  • 6. 1. Global ID – URI 2. Resolvable ID 3. Useful content HTML for humans RDF for machines 4. Link to other resources Like the Web, but for data! Linked Data Principles 3 July 2014SICSA Data Science Theme Launch 5 “RDF and OWL do not solve the interoperability problem, they just lay it bare on the table!”
  • 7. Challenge 1: Matching Administrative Data Research Centre - Scotland | Alasdair J G Gray| 3 July 2014 John Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Grant Iain Grant Born: 1860 Messy data Probabilistic matches Schema matching
  • 8. Gleevec® = Imatinib Mesylate 3 July 2014 SICSA Data Science Theme Launch 7 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  • 9. Challenge 2: Reusing mappings 3 July 2014 SICSA Data Science Theme Launch 8 Link: skos:closeMatch Reason: non-salt form Link: skos:exactMatch Reason: drug name Link: owl:sameAs
  • 10. Challenge: Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 3 July 2014SICSA Data Science Theme Launch 9 P12047 X31045 GB:29384 http://rdf.ebi.ac.uk/resource/ch embl/molecule/CHEMBL1642 https://www.ebi.ac.uk/chembl/co mpound/inspect/CHEMBL1642
  • 11. Challenge Open Data: Licenses 5★ of linked data Licenses who can reuse the data  Interoperability of licenses  Non-commercial: academic use, teaching, industry 3 July 2014SICSA Data Science Theme Launch 10
  • 12. Challenges: Privacy 11 3 July 2014SICSA Data Science Theme Launch
  • 13. Challenge: Query Performance Response time Data freshness Reliability Volume of requests Hosting resources 3 July 2014SICSA Data Science Theme Launch 12 Queries Queries
  • 14. In Data we Trust How can we trust the data we’ve got back? How can we ensure that it hasn’t been tampered on the way? Trusty URIs 3 July 2014SICSA Data Science Theme Launch 13 http://www.intelsat.com/wp- content/uploads/2014/03/Red-padlock.jpg
  • 15. Contact Details www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk @gray_alasdair 3 July 2014SICSA Data Science Theme Launch 15 “There is lots of data we all use every day, and it’s not part of the web. I can see my bank statements on the web, and my photographs, and I can see my appointments in a calendar. But can I see my photos in a calendar to see what I was doing when I took them? Can I see bank statement lines in a calendar? No. Why not? Because we don’t have a web of data. Because data is controlled by applications and each application keeps it to itself.” Tim Berners-Lee

Editor's Notes

  • #3: Many of you will have visited this site recently Lot of sport coverage, how do the BBC cope within their resources?
  • #4: 700+ pages on teams, groups and players Minimal journalist involvement Automatic aggregation and links to relevant stories Article tagged with Frank Lampard, inference used to link team, group ,etc
  • #5: Coverage of 10,000+ athletes, 200+ countries, 400-500 disciplines and 30 venues Page for every athlete and country drawing on open data
  • #6: Internally DBPedia and Geonames
  • #7: Linked Data hugely successful since inception in 2009 About 300 datasets published Wide range of topics
  • #8: Familiar with birth, marriage and death records. Aligning individuals is hard Also applies to schema matching
  • #9: Data’s been aligned now what? Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases Different results Data is messy!
  • #10: sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link. Links need provenance to enable reuse – James’s talk
  • #11: Each captures a subtly different view of the world Are they the same? … depends on your point of view Different URIs for different representations (content negotiation)
  • #13: Not all data should be open Consider your interaction with the health service – its unique to you Need statistical aggregation to anonymise data As much about educating the public – Public relations