The FAIR data movement and 22 Feb 2023.pdf

The FAIR data
movement and
data protection
Alan Morrison
BrightTALK Data Protection for
the Digital Enterprise Summit
Presented on Feb. 22, 2023
1

Topics covered in this talk:
● Problem: Data sprawl and opacity
● Solution: FAIR data architecture
● The FAIR data movement
● Semantics and data-centric architecture
● Decentralization and identity data
● Final thoughts
2

Problem: Data sprawl and opacity
3

AI’s data/knowledge problem: Provincial IT legacy infrastructure
4
● Thousands of databases per enterprise (siloing)
● Thousands of applications (code sprawl)
● Data models buried in the app code
● Every app a special snowflake with its own data model

Related problem: Data duplication at unprecedented scale
"Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,”
– Brian Bailey in Semiconductor Engineering, November 2022
Conclusions from recent research:
Machine learning inefficiency worsening
● Lack of generalization and context in machine learning training sets leads to huge amounts of near-duplicate data.
Legacy application-centric architectures strand and duplicate large swaths of both code and data
● If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it
currently does.
● IDC predicts that the world will generate 178 zettabytes of data annually by 2025. At that pace, “The Yottabyte Era” would
succeed The Zettabyte Era by 2030, if not earlier.
● CMSwire reported in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant,
40 percent of the world’s economic output would be consumed in 2035 by just storing data.
5

Solution: Focus on FAIR data architecture and sovereignty for protection
and compliance
● Garbage in, garbage out still applies. Boost your quality data input by adopting FAIR
principles to streamline and scale operations
● Shrink your data risk footprint by supporting self-sovereign identity
○ Support decentralized identity (W3C DID standard)
○ Push correlatable PII to data-at-rest on-device matching
● Embrace data-centric architecture and FAIR to avoid creating orphan data that’s
○ Siloed
○ Not self-describing
○ Not connected
○ Not generated for reusability
● Consider ontology-driven and semantic digital twin development
○ Applications are written to use the description or relationship logic the graph describes–10x less code needed
○ Ontologies (semantic metadata) provide logical connections and context that allow reuse and thereby reduce
the need for duplication
○ FAIR twins and agents can be a means of managing at scale
6

Rationalization of data-related departments and
economies of scale
7
“Data management” (structured data,
mostly)
Knowledge management (internally
shared)
Content management (externally
shared)
Learning management (internal
coursework)
FAIR data and
associated
description
logic

What is FAIR data?
9
When it comes to data, FAIR stands for findable, accessible, interoperable, and
reusable.
FAIR data is data users can have confidence in more than once, for more than
one purpose.
For humans to develop, find and use FAIR data, machines have to help them.
Machine-readable context helps to disambiguate data, enable data abstractions
and automate processes. Ontologies (semantic graph data models) articulate
those contexts.
Adoption of FAIR data principles would be a major step towards what DARPA calls
the Third Wave of AI: Contextual computing.

Semantics is the path to FAIR, smart, siloless data sharing
11
James Kobelius, 2016
Association of European Libraries, 2017

Compare FAIR and TRUST principles
12
Lin, D., Crabtree, J., Dillo, I. et al. The TRUST Principles for digital repositories. Sci Data 7, 144 (2020).
https://doi.org/10.1038/s41597-020-0486-7
FAIR data leads to TRUSTed data
repositories.

Who’s behind the FAIR data movement? Big pharma, for
one.
“From 2023, drug submissions to the European Medicines Agency (EMA) must
comply with select Identification of Medicinal Products (IDMP) standards. By
developing an IDMP-compliant ontology with machine-ready data, the Alliance will
support the move to automate this process, improving efficiency and patient
safety, reducing costs and time burden, and driving innovation in the drug
development pipeline.
“The project is managed by the Pistoia Alliance, with a project team of
experts from Bayer, Novartis, Roche, Merck KGaA, and GSK.”
13
–Erik Schultes, et al., ”FAIR Digital Twins for Data-Intensive Research,”
PERSPECTIVE article
Front. Big Data, 11 May 2022
Sec. Data Science
Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.883341

Semantics and data-centric
architecture
14

Humans-in-the loop = second-order cybernetics:
Involving users and SMEs to create context with the help
of machines
15
First order
(Engineer
outside box)
Second order
(Users and
domain
experts inside
box)
Stewart Brand, et al., Co-Evolution Quarterly, 1976
Semantics includes the
meaning humans and
machines create together.
Feedback loops provide a
means of iteration and
incremental refinement.

Terpsichore: Human-in-the-loop semantic data lifecycle for
urban heritage/smart cities
16
An iterative, bottom-up,
user-driven process:
● User engagement
● Collection
● Digestion
● Semantic
classification
● Automated
suggestion loops
Results:
● Enrichment of
useful data
collections
● Improved dialogue
between user
communities
Artopoulos, Giorgos & Smaniotto
Costa, Carlos. (2019). Data-Driven
Processes in Participatory Urbanism:
The “Smartness” of Historical Cities.
Architecture and Culture. 7. 1-19.
10.1080/20507828.2019.1631061.

An effective data model describes and unifies the contexts necessary for true data
integration. It gives machines enough clues to detect and discover layered context.
“What is data integration?
Let's start with a short list of what data integration is not:
● It's not shoveling data around between systems.
● It's not calling an API.
● It's not creating a data connection to a source system.
It can include one or more of the jobs in the list here above, but what is the ingredient
that cannot be missing?
It's connecting data from different source systems together in a consistent and
coherent data model.”
–Wouter Trappers, CDAO
What’s a data model? What is data integration?
17

Big pharma has been adopting web semantics to help
them achieve their FAIR data objectives
18
John Sowa, AWS, 2020
Semantics is the
science of shared
meaning in the form of
contextualized data
Web semantics
harnesses the power of
machine-readable
knowledge models to
create quality data
shared at scale

First step: Build knowledge graphs and link them
19
Linked Open Data Cloud, 2022
Starter triple for a knowledge graph
A standard knowledge graph consists of triplified, relationship-rich
data. The data model, or ontology, is also described in triples and
lives with the rest of the data. Ontologies can also be managed as
data. Linking triples merely requires a verb (or predicate, or
described edge) to link them.

Semantic standards allow a desiloed data landscape
20

How shared graph semantics helps
● Boosts meaningful results (result of lack of data and logic transparency and
cohesiveness) and relevancy
● Contextualizes data for better management and reuse with relationship logic
● Scales meaningful connections between contexts (relevant relationships living
with entities)
● Enables Metcalfe’s network of networks effect (network_effectN
)
● Enables model-driven development (code once, reuse anywhere)
● Scale efficiencies and economies so that energy consumption is reduced
21

Organic data grows from a seed into a tree
23
Zero-copy integration

Blue Brain Nexus–graph-based Bioinformatics collaboration
25

Blue Brain Nexus knowledge graph uses
26
Serves to unify most
data handling,
management and
transformation functions
Starting point: Find out
what we can about the
neocortical
microcircuits of rats,
given ten years’s worth
of heterogeneous data
on these circuits.

Montefiore Health’s Patient-centered Analytical Learning
Machine – (“PALM”) – Personalized medicine at scale
27

Human-machine interaction from a
FAIR data lifecycle perspective
28

Policy engine-based access control: Open Policy Agent
29
“OPA generates policy decisions by evaluating the query input against policies and data. OPA and Rego
are domain-agnostic so you can describe almost any kind of invariant in your policies. For example:
● Which users can access which resources.
● Which subnets egress traffic is allowed to.
● Which clusters a workload must be deployed to.
● Which registries binaries can be downloaded from.
● Which OS capabilities a container can execute with.
● Which times of day the system can be accessed at.”
“…OPA policies are expressed in a high-level declarative language called Rego. Rego (pronounced
“ray-go”) is purpose-built for expressing policies over complex hierarchical data structures.”
– Open Policy Agent site at https://www.openpolicyagent.org/

Decentralization and data ownership
30

Simple web hosting + legacy Client-Server
storage
Early Web (on Client-Server)
Compute and storage more loosely coupled,
virtualized, controlled and data-centric
“Decoupled” and “Decentralized” Cloud
Application Distribution via Proprietary
and IP Networking
Client-Server and Desktops
Commodity servers + storage + some
virtualization
Distributed Cloud and Mobile Devices
1st
2nd
3rd
4th
5th
Centralized storage and compute, with
minimal networking
Mainframe and Green Screens
The Five Commingled Phases of Compute, Networking and Storage
31
Less
centralized
Time
More
centralized
Application
Centric
Data
Centric
All phases are
still active and
evolving

File:Decentralization.jpg, by Adam Aladdin, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=35018016
Data centralization versus decentralization
32
Ethereum’s contribution:
Each peer node can play a
role in confirming blocks of
transactions.
This method also enables
tamperproof smart
contracts, or legal
agreements expressed in
self-executing code.
P2P data networks such as
IPFS + blockchains =
decentralized infrastructure
that enables dApps
Has a host, but one
that’s less of a
bottleneck

Shared transactions require tamperproof ledgers
33
Blockchains are
shared tamperproof
ledgers of concise,
deterministic
transaction
messages.
The graph
provides the
iterative
collaboration
and refined
data and logic
sharing loop.
Without the
data quality of a
knowledge
graph,
blockchains are
garbage
in/garbage out.

Decentralized identity: Custody and
control of your own personal data
34

Data ownership and control is becoming a major bone of
contention
35
“Every time you drive (a post-2017 Tesla), it records the whole track of
where you drive, the GPS coordinates and certain other metrics for
every mile driven.
“They say that they are anonymizing the trigger results, but you could
probably match everything to a single person if you wanted to.”
–Anonymous reverse engineer of Tesla data, as quoted by Mark Harris in IEEE Spectrum, Aug 2022

Self-sovereign identity = personal or B2B data ownership/control
36
Markus Sabadello, “Decentralized IDentifers (DIDs),” W3C Workshop on Privacy and Linked Data, Vienna, 2018
Amazon controls
the user
agreements, data
and how it’s stored
User controls PII
and grants
permission and
access; PII stays in
place
PII = Personally
Identifiable
Information

Content addressing = rich, end-to-end encrypted identities
for represented entities
37
https://commons.wikimedia.org/wiki/File:Identity-concept.svg
Representation,
linking and
encryption are all
automated and
built into P2P data
networks.
You choose
whether or not to
share your content
addressed graph
with others, and if
so, how.

Decentralized knowledge graphs
38

Example dCloud services base infrastructure today:
IPFS
39
“In IPFS, content* is delivered from the closest peers
that possess a copy of the content removing the
single-node pressure and improving the user
experience.”
–zK Capital Research, “IPFS: The Interplanetary File
system,” 2018
*Content infrastructure and management = data infrastructure and
management.
IPFS = Interplanetary File System
P2P

The InterPlanetary File System versus HTTP
40
Rachael Zisk, “Lockheed and Filecoin Foundation Partner to Deploy IPFS,” Payload, May 2022

Enterprise decentralized app environment: OriginTrail.io
41
https://origintrail.io/

Web3/knowledge graph dSaaS stack: OriginTrail.io
42
https://origintrail.io/

OriginTrail + BSI’s supply chain tracking and tracing
43
OriginTrail and the British Standards Institute (BSI), https://twitter.com/origin_trail/status/1339606640887152642?s=20, Dec. 2020
The Monasteriven
whiskey produced in
Ireland is tracked and
traced from “grain to
glass” with the
OriginTrail.io
approach.
OT uses
decentralized
knowledge graph that
connects to one of
several different
blockchains.
This method enables
shared data reuse
and other synergies
across the supply
chain.

SOLID: Federated storage and decentralized apps
44
Ruben Verborgh, “Decentralizing personal data management with Solid: a hands-on workshop,” SEMIC Workshop, October 2020

SOLID shared, federated XaaS: Construction industry
45
“TrinPod™: World's first conceptually indexed space-time
digital twin using Solid,” Graphmetrix, 2022,
https://graphmetrix.com/trinpod
Company-specific SOLID storage pods and access
control can be managed by each supply chain partner.
Graphmetrix as digital twin provider manages the
system and system-level apps.

Digital twins and agents: Better data sharing than APIs?
46
Autonomous agents
Digital twins
Locale: Portsmouth, UK
Sensor nets
Iotics, 2019
and 2023

How FAIR benefits governance, risk and compliance efforts
● Transparency at the data layer
● Safe collection as a part of the data lifecycle
● Prioritization of best open data assets
“FAIR data principles align with data compliance and privacy standards, helping
businesses prioritize the safe collection, use, and sharing of data as an up-front,
transparent, and communal responsibility.”
– Sharat Endapally, TDWI, January 17, 2023,
https://tdwi.org/articles/2023/01/17/diq-all-who-is-responsible-for-fair-data.aspx
48

Seven obstacles to adoption of semantic connected data
sharing environments
49

Q&A
50
Feel free to ping me anytime with questions, etc.
Alan Morrison
Data Science Central
LinkedIn | Twitter | Quora | Slideshare
+1 408 205 5109
a.s.morrison@gmail.com

The FAIR data movement and 22 Feb 2023.pdf

More Related Content

Similar to The FAIR data movement and 22 Feb 2023.pdf (20)

More from Alan Morrison (10)

Recently uploaded (20)

The FAIR data movement and 22 Feb 2023.pdf