SlideShare a Scribd company logo
The FAIR data
movement and
data protection
Alan Morrison
BrightTALK Data Protection for
the Digital Enterprise Summit
Presented on Feb. 22, 2023
1
Topics covered in this talk:
● Problem: Data sprawl and opacity
● Solution: FAIR data architecture
● The FAIR data movement
● Semantics and data-centric architecture
● Decentralization and identity data
● Final thoughts
2
Problem: Data sprawl and opacity
3
AI’s data/knowledge problem: Provincial IT legacy infrastructure
4
● Thousands of databases per enterprise (siloing)
● Thousands of applications (code sprawl)
● Data models buried in the app code
● Every app a special snowflake with its own data model
Related problem: Data duplication at unprecedented scale
"Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,”
– Brian Bailey in Semiconductor Engineering, November 2022
Conclusions from recent research:
Machine learning inefficiency worsening
● Lack of generalization and context in machine learning training sets leads to huge amounts of near-duplicate data.
Legacy application-centric architectures strand and duplicate large swaths of both code and data
● If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it
currently does.
● IDC predicts that the world will generate 178 zettabytes of data annually by 2025. At that pace, “The Yottabyte Era” would
succeed The Zettabyte Era by 2030, if not earlier.
● CMSwire reported in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant,
40 percent of the world’s economic output would be consumed in 2035 by just storing data.
5
Solution: Focus on FAIR data architecture and sovereignty for protection
and compliance
● Garbage in, garbage out still applies. Boost your quality data input by adopting FAIR
principles to streamline and scale operations
● Shrink your data risk footprint by supporting self-sovereign identity
○ Support decentralized identity (W3C DID standard)
○ Push correlatable PII to data-at-rest on-device matching
● Embrace data-centric architecture and FAIR to avoid creating orphan data that’s
○ Siloed
○ Not self-describing
○ Not connected
○ Not generated for reusability
● Consider ontology-driven and semantic digital twin development
○ Applications are written to use the description or relationship logic the graph describes–10x less code needed
○ Ontologies (semantic metadata) provide logical connections and context that allow reuse and thereby reduce
the need for duplication
○ FAIR twins and agents can be a means of managing at scale
6
Rationalization of data-related departments and
economies of scale
7
“Data management” (structured data,
mostly)
Knowledge management (internally
shared)
Content management (externally
shared)
Learning management (internal
coursework)
FAIR data and
associated
description
logic
The FAIR data movement
8
What is FAIR data?
9
When it comes to data, FAIR stands for findable, accessible, interoperable, and
reusable.
FAIR data is data users can have confidence in more than once, for more than
one purpose.
For humans to develop, find and use FAIR data, machines have to help them.
Machine-readable context helps to disambiguate data, enable data abstractions
and automate processes. Ontologies (semantic graph data models) articulate
those contexts.
Adoption of FAIR data principles would be a major step towards what DARPA calls
the Third Wave of AI: Contextual computing.
The three waves of AI
10
Semantics is the path to FAIR, smart, siloless data sharing
11
James Kobelius, 2016
Association of European Libraries, 2017
Compare FAIR and TRUST principles
12
Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020).
https://doi.org/10.1038/s41597-020-0486-7
FAIR data leads to TRUSTed data
repositories.
Who’s behind the FAIR data movement? Big pharma, for
one.
“From 2023, drug submissions to the European Medicines Agency (EMA) must
comply with select Identification of Medicinal Products (IDMP) standards. By
developing an IDMP-compliant ontology with machine-ready data, the Alliance will
support the move to automate this process, improving efficiency and patient
safety, reducing costs and time burden, and driving innovation in the drug
development pipeline.
“The project is managed by the Pistoia Alliance, with a project team of
experts from Bayer, Novartis, Roche, Merck KGaA, and GSK.”
13
–Erik Schultes, et al., ”FAIR Digital Twins for Data-Intensive Research,”
PERSPECTIVE article
Front. Big Data, 11 May 2022
Sec. Data Science
Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.883341
Semantics and data-centric
architecture
14
Humans-in-the loop = second-order cybernetics:
Involving users and SMEs to create context with the help
of machines
15
First order
(Engineer
outside box)
Second order
(Users and
domain
experts inside
box)
Stewart Brand, et al., Co-Evolution Quarterly, 1976
Semantics includes the
meaning humans and
machines create together.
Feedback loops provide a
means of iteration and
incremental refinement.
Terpsichore: Human-in-the-loop semantic data lifecycle for
urban heritage/smart cities
16
An iterative, bottom-up,
user-driven process:
● User engagement
● Collection
● Digestion
● Semantic
classification
● Automated
suggestion loops
Results:
● Enrichment of
useful data
collections
● Improved dialogue
between user
communities
Artopoulos, Giorgos & Smaniotto
Costa, Carlos. (2019). Data-Driven
Processes in Participatory Urbanism:
The “Smartness” of Historical Cities.
Architecture and Culture. 7. 1-19.
10.1080/20507828.2019.1631061.
An effective data model describes and unifies the contexts necessary for true data
integration. It gives machines enough clues to detect and discover layered context.
“What is data integration?
Let's start with a short list of what data integration is not:
● It's not shoveling data around between systems.
● It's not calling an API.
● It's not creating a data connection to a source system.
It can include one or more of the jobs in the list here above, but what is the ingredient
that cannot be missing?
It's connecting data from different source systems together in a consistent and
coherent data model.”
–Wouter Trappers, CDAO
What’s a data model? What is data integration?
17
Big pharma has been adopting web semantics to help
them achieve their FAIR data objectives
18
John Sowa, AWS, 2020
Semantics is the
science of shared
meaning in the form of
contextualized data
Web semantics
harnesses the power of
machine-readable
knowledge models to
create quality data
shared at scale
First step: Build knowledge graphs and link them
19
Linked Open Data Cloud, 2022
Starter triple for a knowledge graph
A standard knowledge graph consists of triplified, relationship-rich
data. The data model, or ontology, is also described in triples and
lives with the rest of the data. Ontologies can also be managed as
data. Linking triples merely requires a verb (or predicate, or
described edge) to link them.
Semantic standards allow a desiloed data landscape
20
How shared graph semantics helps
● Boosts meaningful results (result of lack of data and logic transparency and
cohesiveness) and relevancy
● Contextualizes data for better management and reuse with relationship logic
● Scales meaningful connections between contexts (relevant relationships living
with entities)
● Enables Metcalfe’s network of networks effect (network_effectN
)
● Enables model-driven development (code once, reuse anywhere)
● Scale efficiencies and economies so that energy consumption is reduced
21
Organic data
22
Organic data grows from a seed into a tree
23
Zero-copy integration
Case study examples
24
Blue Brain Nexus–graph-based Bioinformatics collaboration
25
Blue Brain Nexus knowledge graph uses
26
Serves to unify most
data handling,
management and
transformation functions
Starting point: Find out
what we can about the
neocortical
microcircuits of rats,
given ten years’s worth
of heterogeneous data
on these circuits.
Montefiore Health’s Patient-centered Analytical Learning
Machine – (“PALM”) – Personalized medicine at scale
27
Human-machine interaction from a
FAIR data lifecycle perspective
28
Policy engine-based access control: Open Policy Agent
29
“OPA generates policy decisions by evaluating the query input against policies and data. OPA and Rego
are domain-agnostic so you can describe almost any kind of invariant in your policies. For example:
● Which users can access which resources.
● Which subnets egress traffic is allowed to.
● Which clusters a workload must be deployed to.
● Which registries binaries can be downloaded from.
● Which OS capabilities a container can execute with.
● Which times of day the system can be accessed at.”
“…OPA policies are expressed in a high-level declarative language called Rego. Rego (pronounced
“ray-go”) is purpose-built for expressing policies over complex hierarchical data structures.”
– Open Policy Agent site at https://www.openpolicyagent.org/
Decentralization and data ownership
30
Simple web hosting + legacy Client-Server
storage
Early Web (on Client-Server)
Compute and storage more loosely coupled,
virtualized, controlled and data-centric
“Decoupled” and “Decentralized” Cloud
Application Distribution via Proprietary
and IP Networking
Client-Server and Desktops
Commodity servers + storage + some
virtualization
Distributed Cloud and Mobile Devices
1st
2nd
3rd
4th
5th
Centralized storage and compute, with
minimal networking
Mainframe and Green Screens
The Five Commingled Phases of Compute, Networking and Storage
31
Less
centralized
Time
More
centralized
Application
Centric
Data
Centric
All phases are
still active and
evolving
File:Decentralization.jpg, by Adam Aladdin, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=35018016
Data centralization versus decentralization
32
Ethereum’s contribution:
Each peer node can play a
role in confirming blocks of
transactions.
This method also enables
tamperproof smart
contracts, or legal
agreements expressed in
self-executing code.
P2P data networks such as
IPFS + blockchains =
decentralized infrastructure
that enables dApps
Has a host, but one
that’s less of a
bottleneck
Shared transactions require tamperproof ledgers
33
Blockchains are
shared tamperproof
ledgers of concise,
deterministic
transaction
messages.
The graph
provides the
iterative
collaboration
and refined
data and logic
sharing loop.
Without the
data quality of a
knowledge
graph,
blockchains are
garbage
in/garbage out.
Decentralized identity: Custody and
control of your own personal data
34
Data ownership and control is becoming a major bone of
contention
35
“Every time you drive (a post-2017 Tesla), it records the whole track of
where you drive, the GPS coordinates and certain other metrics for
every mile driven.
“They say that they are anonymizing the trigger results, but you could
probably match everything to a single person if you wanted to.”
–Anonymous reverse engineer of Tesla data, as quoted by Mark Harris in IEEE Spectrum, Aug 2022
Self-sovereign identity = personal or B2B data ownership/control
36
Markus Sabadello, “Decentralized IDentifers (DIDs),” W3C Workshop on Privacy and Linked Data, Vienna, 2018
Amazon controls
the user
agreements, data
and how it’s stored
User controls PII
and grants
permission and
access; PII stays in
place
PII = Personally
Identifiable
Information
Content addressing = rich, end-to-end encrypted identities
for represented entities
37
https://commons.wikimedia.org/wiki/File:Identity-concept.svg
Representation,
linking and
encryption are all
automated and
built into P2P data
networks.
You choose
whether or not to
share your content
addressed graph
with others, and if
so, how.
Decentralized knowledge graphs
38
Example dCloud services base infrastructure today:
IPFS
39
“In IPFS, content* is delivered from the closest peers
that possess a copy of the content removing the
single-node pressure and improving the user
experience.”
–zK Capital Research, “IPFS: The Interplanetary File
system,” 2018
*Content infrastructure and management = data infrastructure and
management.
IPFS = Interplanetary File System
P2P
The InterPlanetary File System versus HTTP
40
Rachael Zisk, “Lockheed and Filecoin Foundation Partner to Deploy IPFS,” Payload, May 2022
Enterprise decentralized app environment: OriginTrail.io
41
https://origintrail.io/
Web3/knowledge graph dSaaS stack: OriginTrail.io
42
https://origintrail.io/
OriginTrail + BSI’s supply chain tracking and tracing
43
OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020
The Monasteriven
whiskey produced in
Ireland is tracked and
traced from “grain to
glass” with the
OriginTrail.io
approach.
OT uses
decentralized
knowledge graph that
connects to one of
several different
blockchains.
This method enables
shared data reuse
and other synergies
across the supply
chain.
SOLID: Federated storage and decentralized apps
44
Ruben Verborgh, “Decentralizing personal data management with Solid: a hands-on workshop,” SEMIC Workshop, October 2020
SOLID shared, federated XaaS: Construction industry
45
“TrinPod™: World's first conceptually indexed space-time
digital twin using Solid,” Graphmetrix, 2022,
https://graphmetrix.com/trinpod
Company-specific SOLID storage pods and access
control can be managed by each supply chain partner.
Graphmetrix as digital twin provider manages the
system and system-level apps.
Digital twins and agents: Better data sharing than APIs?
46
Autonomous agents
Digital twins
Locale: Portsmouth, UK
Sensor nets
Iotics, 2019
and 2023
Final thoughts
47
How FAIR benefits governance, risk and compliance efforts
● Transparency at the data layer
● Safe collection as a part of the data lifecycle
● Prioritization of best open data assets
“FAIR data principles align with data compliance and privacy standards, helping
businesses prioritize the safe collection, use, and sharing of data as an up-front,
transparent, and communal responsibility.”
– Sharat Endapally, TDWI, January 17, 2023,
https://tdwi.org/articles/2023/01/17/diq-all-who-is-responsible-for-fair-data.aspx
48
Seven obstacles to adoption of semantic connected data
sharing environments
49
Q&A
50
Feel free to ping me anytime with questions, etc.
Alan Morrison
Data Science Central
LinkedIn | Twitter | Quora | Slideshare
+1 408 205 5109
a.s.morrison@gmail.com

More Related Content

PDF
FAIR data_ Superior data visibility and reuse without warehousing.pdf
PPTX
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
PDF
ReaConverter Pro Download (Latest 2025)
PDF
K7 Ultimate Security Crack FREE latest version 2025
PDF
Adobe InDesign Crack FREE Download 2025 link
PDF
Remote Desktop Manager Enterprise 2024.3.29
PDF
Download iTop VPN Crack Latest Version 2025?
PPTX
Building Data Ecosystems for Accelerated Discovery
FAIR data_ Superior data visibility and reuse without warehousing.pdf
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ReaConverter Pro Download (Latest 2025)
K7 Ultimate Security Crack FREE latest version 2025
Adobe InDesign Crack FREE Download 2025 link
Remote Desktop Manager Enterprise 2024.3.29
Download iTop VPN Crack Latest Version 2025?
Building Data Ecosystems for Accelerated Discovery

Similar to The FAIR data movement and 22 Feb 2023.pdf (20)

PDF
DCA Symposium 6 Feb 2023.pdf
PDF
Data modeling techniques used for big data in enterprise networks
PDF
Smart Data for Smart Labs
DOCX
Big data (word file)
PPTX
VODAN Africa IN.pptx
PDF
Unlock Your Data for ML & AI using Data Virtualization
PDF
Essential+Data+Science+Notes+-+A+Concise+PDF+Guide.pdf
PPTX
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
DOCX
Global Data Management: Governance, Security and Usefulness in a Hybrid World
PPT
using big-data methods analyse the Cross platform aviation
PPT
The Internet of Things: What's next?
PDF
Introduction to Data Analytics and data analytics life cycle
PDF
Ck34520526
PPTX
Data centric business and knowledge graph trends
PDF
IRJET- Swift Retrieval of DNA Databases by Aggregating Queries
PPTX
Cognitive data
PDF
FAIR Data Knowledge Graphs–from Theory to Practice
PPTX
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
PPTX
Fighting COVID-19 with Artificial Intelligence
 
PPTX
The Science of Data Science
DCA Symposium 6 Feb 2023.pdf
Data modeling techniques used for big data in enterprise networks
Smart Data for Smart Labs
Big data (word file)
VODAN Africa IN.pptx
Unlock Your Data for ML & AI using Data Virtualization
Essential+Data+Science+Notes+-+A+Concise+PDF+Guide.pdf
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Global Data Management: Governance, Security and Usefulness in a Hybrid World
using big-data methods analyse the Cross platform aviation
The Internet of Things: What's next?
Introduction to Data Analytics and data analytics life cycle
Ck34520526
Data centric business and knowledge graph trends
IRJET- Swift Retrieval of DNA Databases by Aggregating Queries
Cognitive data
FAIR Data Knowledge Graphs–from Theory to Practice
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Fighting COVID-19 with Artificial Intelligence
 
The Science of Data Science
Ad

More from Alan Morrison (10)

PDF
DCAF 2023 1 and 2.pdf
PDF
Graph Foundations for Advanced Analytics and Collaboration
PDF
Dcaf transformation & kg adoption 2022 -alan morrison
PDF
Paths to more personal and collaborative knowledge graphs
PPTX
Scaling the mirrorworld with knowledge graphs
PDF
The boom in Xaas and the knowledge graph
PDF
Data-centric design and the knowledge graph
PPTX
Data-centric market status, case studies and outlook
PDF
Data-Centric Business Transformation Using Knowledge Graphs
PDF
Blockchain demystified
DCAF 2023 1 and 2.pdf
Graph Foundations for Advanced Analytics and Collaboration
Dcaf transformation & kg adoption 2022 -alan morrison
Paths to more personal and collaborative knowledge graphs
Scaling the mirrorworld with knowledge graphs
The boom in Xaas and the knowledge graph
Data-centric design and the knowledge graph
Data-centric market status, case studies and outlook
Data-Centric Business Transformation Using Knowledge Graphs
Blockchain demystified
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
1_Introduction to advance data techniques.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Computer network topology notes for revision
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Moving the Public Sector (Government) to a Digital Adoption
1_Introduction to advance data techniques.pptx
Reliability_Chapter_ presentation 1221.5784
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Computer network topology notes for revision
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Major-Components-ofNKJNNKNKNKNKronment.pptx
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
climate analysis of Dhaka ,Banglades.pptx
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Ppt On Nestle.pptx huunnnhhgfvu

The FAIR data movement and 22 Feb 2023.pdf

  • 1. The FAIR data movement and data protection Alan Morrison BrightTALK Data Protection for the Digital Enterprise Summit Presented on Feb. 22, 2023 1
  • 2. Topics covered in this talk: ● Problem: Data sprawl and opacity ● Solution: FAIR data architecture ● The FAIR data movement ● Semantics and data-centric architecture ● Decentralization and identity data ● Final thoughts 2
  • 3. Problem: Data sprawl and opacity 3
  • 4. AI’s data/knowledge problem: Provincial IT legacy infrastructure 4 ● Thousands of databases per enterprise (siloing) ● Thousands of applications (code sprawl) ● Data models buried in the app code ● Every app a special snowflake with its own data model
  • 5. Related problem: Data duplication at unprecedented scale "Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” – Brian Bailey in Semiconductor Engineering, November 2022 Conclusions from recent research: Machine learning inefficiency worsening ● Lack of generalization and context in machine learning training sets leads to huge amounts of near-duplicate data. Legacy application-centric architectures strand and duplicate large swaths of both code and data ● If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. ● IDC predicts that the world will generate 178 zettabytes of data annually by 2025. At that pace, “The Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier. ● CMSwire reported in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data. 5
  • 6. Solution: Focus on FAIR data architecture and sovereignty for protection and compliance ● Garbage in, garbage out still applies. Boost your quality data input by adopting FAIR principles to streamline and scale operations ● Shrink your data risk footprint by supporting self-sovereign identity ○ Support decentralized identity (W3C DID standard) ○ Push correlatable PII to data-at-rest on-device matching ● Embrace data-centric architecture and FAIR to avoid creating orphan data that’s ○ Siloed ○ Not self-describing ○ Not connected ○ Not generated for reusability ● Consider ontology-driven and semantic digital twin development ○ Applications are written to use the description or relationship logic the graph describes–10x less code needed ○ Ontologies (semantic metadata) provide logical connections and context that allow reuse and thereby reduce the need for duplication ○ FAIR twins and agents can be a means of managing at scale 6
  • 7. Rationalization of data-related departments and economies of scale 7 “Data management” (structured data, mostly) Knowledge management (internally shared) Content management (externally shared) Learning management (internal coursework) FAIR data and associated description logic
  • 8. The FAIR data movement 8
  • 9. What is FAIR data? 9 When it comes to data, FAIR stands for findable, accessible, interoperable, and reusable. FAIR data is data users can have confidence in more than once, for more than one purpose. For humans to develop, find and use FAIR data, machines have to help them. Machine-readable context helps to disambiguate data, enable data abstractions and automate processes. Ontologies (semantic graph data models) articulate those contexts. Adoption of FAIR data principles would be a major step towards what DARPA calls the Third Wave of AI: Contextual computing.
  • 10. The three waves of AI 10
  • 11. Semantics is the path to FAIR, smart, siloless data sharing 11 James Kobelius, 2016 Association of European Libraries, 2017
  • 12. Compare FAIR and TRUST principles 12 Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020). https://doi.org/10.1038/s41597-020-0486-7 FAIR data leads to TRUSTed data repositories.
  • 13. Who’s behind the FAIR data movement? Big pharma, for one. “From 2023, drug submissions to the European Medicines Agency (EMA) must comply with select Identification of Medicinal Products (IDMP) standards. By developing an IDMP-compliant ontology with machine-ready data, the Alliance will support the move to automate this process, improving efficiency and patient safety, reducing costs and time burden, and driving innovation in the drug development pipeline. “The project is managed by the Pistoia Alliance, with a project team of experts from Bayer, Novartis, Roche, Merck KGaA, and GSK.” 13 –Erik Schultes, et al., ”FAIR Digital Twins for Data-Intensive Research,” PERSPECTIVE article Front. Big Data, 11 May 2022 Sec. Data Science Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.883341
  • 15. Humans-in-the loop = second-order cybernetics: Involving users and SMEs to create context with the help of machines 15 First order (Engineer outside box) Second order (Users and domain experts inside box) Stewart Brand, et al., Co-Evolution Quarterly, 1976 Semantics includes the meaning humans and machines create together. Feedback loops provide a means of iteration and incremental refinement.
  • 16. Terpsichore: Human-in-the-loop semantic data lifecycle for urban heritage/smart cities 16 An iterative, bottom-up, user-driven process: ● User engagement ● Collection ● Digestion ● Semantic classification ● Automated suggestion loops Results: ● Enrichment of useful data collections ● Improved dialogue between user communities Artopoulos, Giorgos & Smaniotto Costa, Carlos. (2019). Data-Driven Processes in Participatory Urbanism: The “Smartness” of Historical Cities. Architecture and Culture. 7. 1-19. 10.1080/20507828.2019.1631061.
  • 17. An effective data model describes and unifies the contexts necessary for true data integration. It gives machines enough clues to detect and discover layered context. “What is data integration? Let's start with a short list of what data integration is not: ● It's not shoveling data around between systems. ● It's not calling an API. ● It's not creating a data connection to a source system. It can include one or more of the jobs in the list here above, but what is the ingredient that cannot be missing? It's connecting data from different source systems together in a consistent and coherent data model.” –Wouter Trappers, CDAO What’s a data model? What is data integration? 17
  • 18. Big pharma has been adopting web semantics to help them achieve their FAIR data objectives 18 John Sowa, AWS, 2020 Semantics is the science of shared meaning in the form of contextualized data Web semantics harnesses the power of machine-readable knowledge models to create quality data shared at scale
  • 19. First step: Build knowledge graphs and link them 19 Linked Open Data Cloud, 2022 Starter triple for a knowledge graph A standard knowledge graph consists of triplified, relationship-rich data. The data model, or ontology, is also described in triples and lives with the rest of the data. Ontologies can also be managed as data. Linking triples merely requires a verb (or predicate, or described edge) to link them.
  • 20. Semantic standards allow a desiloed data landscape 20
  • 21. How shared graph semantics helps ● Boosts meaningful results (result of lack of data and logic transparency and cohesiveness) and relevancy ● Contextualizes data for better management and reuse with relationship logic ● Scales meaningful connections between contexts (relevant relationships living with entities) ● Enables Metcalfe’s network of networks effect (network_effectN ) ● Enables model-driven development (code once, reuse anywhere) ● Scale efficiencies and economies so that energy consumption is reduced 21
  • 23. Organic data grows from a seed into a tree 23 Zero-copy integration
  • 25. Blue Brain Nexus–graph-based Bioinformatics collaboration 25
  • 26. Blue Brain Nexus knowledge graph uses 26 Serves to unify most data handling, management and transformation functions Starting point: Find out what we can about the neocortical microcircuits of rats, given ten years’s worth of heterogeneous data on these circuits.
  • 27. Montefiore Health’s Patient-centered Analytical Learning Machine – (“PALM”) – Personalized medicine at scale 27
  • 28. Human-machine interaction from a FAIR data lifecycle perspective 28
  • 29. Policy engine-based access control: Open Policy Agent 29 “OPA generates policy decisions by evaluating the query input against policies and data. OPA and Rego are domain-agnostic so you can describe almost any kind of invariant in your policies. For example: ● Which users can access which resources. ● Which subnets egress traffic is allowed to. ● Which clusters a workload must be deployed to. ● Which registries binaries can be downloaded from. ● Which OS capabilities a container can execute with. ● Which times of day the system can be accessed at.” “…OPA policies are expressed in a high-level declarative language called Rego. Rego (pronounced “ray-go”) is purpose-built for expressing policies over complex hierarchical data structures.” – Open Policy Agent site at https://www.openpolicyagent.org/
  • 31. Simple web hosting + legacy Client-Server storage Early Web (on Client-Server) Compute and storage more loosely coupled, virtualized, controlled and data-centric “Decoupled” and “Decentralized” Cloud Application Distribution via Proprietary and IP Networking Client-Server and Desktops Commodity servers + storage + some virtualization Distributed Cloud and Mobile Devices 1st 2nd 3rd 4th 5th Centralized storage and compute, with minimal networking Mainframe and Green Screens The Five Commingled Phases of Compute, Networking and Storage 31 Less centralized Time More centralized Application Centric Data Centric All phases are still active and evolving
  • 32. File:Decentralization.jpg, by Adam Aladdin, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=35018016 Data centralization versus decentralization 32 Ethereum’s contribution: Each peer node can play a role in confirming blocks of transactions. This method also enables tamperproof smart contracts, or legal agreements expressed in self-executing code. P2P data networks such as IPFS + blockchains = decentralized infrastructure that enables dApps Has a host, but one that’s less of a bottleneck
  • 33. Shared transactions require tamperproof ledgers 33 Blockchains are shared tamperproof ledgers of concise, deterministic transaction messages. The graph provides the iterative collaboration and refined data and logic sharing loop. Without the data quality of a knowledge graph, blockchains are garbage in/garbage out.
  • 34. Decentralized identity: Custody and control of your own personal data 34
  • 35. Data ownership and control is becoming a major bone of contention 35 “Every time you drive (a post-2017 Tesla), it records the whole track of where you drive, the GPS coordinates and certain other metrics for every mile driven. “They say that they are anonymizing the trigger results, but you could probably match everything to a single person if you wanted to.” –Anonymous reverse engineer of Tesla data, as quoted by Mark Harris in IEEE Spectrum, Aug 2022
  • 36. Self-sovereign identity = personal or B2B data ownership/control 36 Markus Sabadello, “Decentralized IDentifers (DIDs),” W3C Workshop on Privacy and Linked Data, Vienna, 2018 Amazon controls the user agreements, data and how it’s stored User controls PII and grants permission and access; PII stays in place PII = Personally Identifiable Information
  • 37. Content addressing = rich, end-to-end encrypted identities for represented entities 37 https://commons.wikimedia.org/wiki/File:Identity-concept.svg Representation, linking and encryption are all automated and built into P2P data networks. You choose whether or not to share your content addressed graph with others, and if so, how.
  • 39. Example dCloud services base infrastructure today: IPFS 39 “In IPFS, content* is delivered from the closest peers that possess a copy of the content removing the single-node pressure and improving the user experience.” –zK Capital Research, “IPFS: The Interplanetary File system,” 2018 *Content infrastructure and management = data infrastructure and management. IPFS = Interplanetary File System P2P
  • 40. The InterPlanetary File System versus HTTP 40 Rachael Zisk, “Lockheed and Filecoin Foundation Partner to Deploy IPFS,” Payload, May 2022
  • 41. Enterprise decentralized app environment: OriginTrail.io 41 https://origintrail.io/
  • 42. Web3/knowledge graph dSaaS stack: OriginTrail.io 42 https://origintrail.io/
  • 43. OriginTrail + BSI’s supply chain tracking and tracing 43 OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020 The Monasteriven whiskey produced in Ireland is tracked and traced from “grain to glass” with the OriginTrail.io approach. OT uses decentralized knowledge graph that connects to one of several different blockchains. This method enables shared data reuse and other synergies across the supply chain.
  • 44. SOLID: Federated storage and decentralized apps 44 Ruben Verborgh, “Decentralizing personal data management with Solid: a hands-on workshop,” SEMIC Workshop, October 2020
  • 45. SOLID shared, federated XaaS: Construction industry 45 “TrinPod™: World's first conceptually indexed space-time digital twin using Solid,” Graphmetrix, 2022, https://graphmetrix.com/trinpod Company-specific SOLID storage pods and access control can be managed by each supply chain partner. Graphmetrix as digital twin provider manages the system and system-level apps.
  • 46. Digital twins and agents: Better data sharing than APIs? 46 Autonomous agents Digital twins Locale: Portsmouth, UK Sensor nets Iotics, 2019 and 2023
  • 48. How FAIR benefits governance, risk and compliance efforts ● Transparency at the data layer ● Safe collection as a part of the data lifecycle ● Prioritization of best open data assets “FAIR data principles align with data compliance and privacy standards, helping businesses prioritize the safe collection, use, and sharing of data as an up-front, transparent, and communal responsibility.” – Sharat Endapally, TDWI, January 17, 2023, https://tdwi.org/articles/2023/01/17/diq-all-who-is-responsible-for-fair-data.aspx 48
  • 49. Seven obstacles to adoption of semantic connected data sharing environments 49
  • 50. Q&A 50 Feel free to ping me anytime with questions, etc. Alan Morrison Data Science Central LinkedIn | Twitter | Quora | Slideshare +1 408 205 5109 a.s.morrison@gmail.com