SlideShare a Scribd company logo
Principles for proper Data Management and
Re-Use – an RDA view
Peter Wittenburg
Max Planck Society
2
ļ‚§ does RDA have one view – yes & no
ļ‚§ RDA is basically a bottom-up organization driven by the many
ā€œcreativeā€ minds who want to change data practices
ļ‚§ RDA has now about 2000 members – so we have 2000 opinions?
ļ‚§ we have an intensive discussion process since 2012 (ICRI
Conference Copenhagen) and we can see that there are a number
of trends and principles all or most seem to agree with
ļ‚§ still RDA is a very young initiative and needs
much attention and grease
Clarification
3
Why is this all relevant?
ļ‚§ Naoyuki Tsunematsu (JST ):
• Data exchange (and thus the need for proper data
management) difficult to convey in Japanese Science
• parallel trends observed for Japanese Science
• not so often included in collaborations anymore
• not so often represented in the top papers
• enormous decrease in international ranking
• serious worries about counterproductive encapsulation
• this concern seems to be relevant for all of us
4
Trends I – Volume, Complexity
from simple
structures ...
... towards
complex
relationships
5
Trends II - Anonymity
direct exchange between known colleagues
Domain of Repositories
6
Trends III – Re-Usage
Domain of
trusted
Repositories
• Data will be re-used in different contexts
• Data needs to be findable, accessible, combinable and
interpretable for others
7
Data Practices I – Survey
ļ‚§ ~120 Interviews/Interactions
ļ‚§ 2 Workshops with Leading Scientists (EU, US)
ļ‚§ too much manual or via ad hoc scripts
ļ‚§ too much in Legacy formats (no PID & MD)
ļ‚§ there are lighthouse projects etc. but ...
ļ‚§ DM and DP not efficient and too expensive
(Biologist for 75% of his time data manager)
ļ‚§ federating data incl. logical information much too expensive
ļ‚§ hardly usage of automated workflows and lack of
reproducibility
8
Data Practices I – Survey
ļ‚§ ~120 Interviews/Interactions
ļ‚§ 2 Workshops with Leading Scientists (EU, US)
ļ‚§ too much manual or via ad hoc scripts
ļ‚§ too much in Legacy formats (no PID & MD)
ļ‚§ there are lighthouse projects etc. but ...
ļ‚§ DM and DP not efficient and too expensive
(Biologist for 75% of his time data manager)
ļ‚§ federating data incl. logical information much too expensive
ļ‚§ hardly usage of automated workflows and lack of
reproducibility
9
12 21 26
95 95 96 97
266
676
DIF DwC DC EML FGDC Open
GIS
ISO My Lab none
Metadata standards
Data Practices III - Metadata
slide von Bill Michener, DataONE
10
ļ‚§ lack of proper documentation,
schemas, semantics, relations, etc.
ļ‚§ directory structures, spreadsheets etc.
are ad hoc creations and knowledge
fades away
ļ‚§ etc.
Data Practices II – Data Entropy
11
Community Center
Common Data
Center
Changes needed – EUDAT and others
many excellent projects
are working on
changes: ESFRI
projects, DataNet
projects, e-
Infrastructures, national
projects
RDA needs to build on
experiences and
expertise
12
RDA widely agreed I – time to change
 management of data objects is widely type and discipline
independent
 still every project defines its own strategies leading to huge stack of
software that will not be maintainable
13
RDA widely agreed II –time to change
what
Value Added
Services
Data
Sources
Persistent
Identifiers
Persistent
Reference
Analysis Citation
Apps
Custom
Clients
Plug-Ins
Resolution System Typing
PID
Local Storage Cloud Computed
Data Sets RDBMS Files
Digital Objects
PID record
attributes
bit sequence
(instance)
metadata
attributes
points to instances
describes properties
describes
properties
& context
point to
each other
14
RDA Results I: common data model
• PIDs at the beginning of trust chain
• have a worldwide, independent and robust PID system
worldwide (DONA Handles – DOIs are Handles)!
• metadata are essential in anonymous data world
taken from RDA WG Data
Foundation & Terminology
15
ļ‚§ result: a registry for data types
ļ‚§ you get an unknown file,
pull it on DTR and content is being
visualized
ļ‚§ extended MIME Type concept
ļ‚§ no free lunch: someone needs to
register and define type
ļ‚§ code available begin 2015
ļ‚§ PIT Demo already working with
DTR
RDA Results II: Data Type Registry
Federated Set of
Type Registries
Visualization
Data Processing10100
11010
101…. Data Set
Dissemination
10100
11010
101….
10100
11010
101….
Terms:…
Rights
Agree
Visualization
Processing
Interpretation
3
Domain of
Services
2
1
Human or Machine
Consumers
4
• NIST is already working with
communities on fargoing ideas
16
ļ‚§ result: a generic API and a set of basic attributes
ļ‚§ a PID Record is like a Passport (Number, Photo, Exp-Date, etc.)
ļ‚§ if all PID Service-Provider agree on one API and talk the same language
(registered terms) SW development will become easy
ļ‚§ Test-Installation
in operation
together with
DTR
RDA Results III: PID Information Types
LOC location, path
CKSM checksum
CKSM_T checksum type
RoR owning repository
MD path to MD
17
ļ‚§ due to unforeseen circumstances need until P5
ļ‚§ Practical Policies = executable Workflow Statements
ļ‚§ result at P5: a set of Best Practice PPs for a number of typical DM/DP
tasks (Integrity Check, Replication, etc.)
ļ‚§ currently a large collection of PPs, currently being evaluated
ļ‚§ you could add your policies
RDA Results IV: Practical Policies
replication policy X
replication policy Y
integrity policy A
integrity policy B
integrity policy C
md extraction policy l
md extraction policy k
etc.
Policy Inventory
Repository
selection
implementation
execution
data manager
18
ļ‚§ need to place many RDA WGs & IGs on a common landscape since
finally everything needs to fit together -> Data Fabric
RDA ongoing: Data Fabric
19
1973
Changes take long ...
1990 1993
TCP/IP
Specification
1977
TCP/IP
Stress-test
WWW-Mosaic
available
worldwide
adoption
ļ‚§ many different suggestion & protocols
ļ‚§ first no advantage for TCP/IP
ļ‚§ at the beginning discussion about different email systems
ļ‚§ at the beginning no interest from researchers and also industry
(toi of some freaks)
ļ‚§ required some top-down decisions to enforce unification
20 years!
20
RDA is about global bridge building
20
RDA is about building the social and technical bridges that
enable global open sharing of data.
Researchers, scientists, data practitioners from around the
world are invited to work together to achieve the vision
Funders: NSF, EC, AU Gov, Japan, Brazil, DE?, UK?, ZA?, FI?,
etc.
21
Thanks for your attention.
http://www.rd-alliance.org
http://europe.rd-alliance.org
22
ļ‚§ see Science 2.0 Initiative of EC
ļ‚§ nr. of researchers increases enormously
ļ‚§ there is a pressure in the direction of Grand Challenges
and those topics relevant for societies
ļ‚§ research is increasingly often data intensive
ļ‚§ border-crossing research is a fact (countries, disciplines)
ļ‚§ faster cycles (hypothesis – analysis – publications –
reviews)
Trends IV: research is changing
23
bottom-up
process
top-down
process
uptake
to come
RDA is about global bridge building
24
EUDAT Services
24
EUDAT Box
dropbox-like service
easy sharing
local synching
Semantic Anno
checking , referencing and
annotating
Dynamic Data
immediate handling
Generic Workflow
automating data
processing
B2DROP B2NOTE

More Related Content

PPT
Research Data Alliance Member Statistics October 2015
PPT
Research Data Alliance Member Statistics September 2015
PPT
Research Data Alliance Member Statistics June 2015
PPT
Research Data Alliance Member Statistics August 2015
PPT
Research Data Alliance Member Statistics July 2015
PPTX
FSCI Persistent Identifiers
Ā 
PPT
Update on the Research Data Alliance 11 December 2014
PPTX
Research Data Alliance Member Statistics October 2015
Research Data Alliance Member Statistics September 2015
Research Data Alliance Member Statistics June 2015
Research Data Alliance Member Statistics August 2015
Research Data Alliance Member Statistics July 2015
FSCI Persistent Identifiers
Ā 
Update on the Research Data Alliance 11 December 2014

What's hot (20)

PPT
RDC Jane Fry, Chantal Ripp - Data Interoperability I
Ā 
PDF
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
PPT
RDA Members Monthly Statistics - May 2015
PPTX
Rda in a_nutshell_february_2017_updated
PPTX
SoBigData. European Research Infrastructure for Big Data and Social Mining
PPTX
OSGIS: an introduction to the research data alliance
PPT
Research Data Alliance: Current Activities and Expected Impact
PDF
An Comprehensive Study of Big Data Environment and its Challenges.
PPT
Mapping the content ecosystem
PPTX
Research engagement in EUDAT| www.eudat.eu |
Ā 
PPTX
DCC and FAIR initiatives
PPTX
RDA in a Nutshell - September 2020
PPTX
The Future of LOD
PPTX
The Value of the Research Data Alliance to Individuals
PDF
Metadata Standards
PPTX
PiDs for research - Natasha Simons - May 24, 2017
Ā 
PDF
2015 05-27-congrés archivoscatalunya
PPTX
Open Science and Identifiers
PPT
MIDESS
PPTX
Rebecca Grant - DRI/ARA(I) Training: Introduction to EAD - Metadata and Metad...
RDC Jane Fry, Chantal Ripp - Data Interoperability I
Ā 
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
RDA Members Monthly Statistics - May 2015
Rda in a_nutshell_february_2017_updated
SoBigData. European Research Infrastructure for Big Data and Social Mining
OSGIS: an introduction to the research data alliance
Research Data Alliance: Current Activities and Expected Impact
An Comprehensive Study of Big Data Environment and its Challenges.
Mapping the content ecosystem
Research engagement in EUDAT| www.eudat.eu |
Ā 
DCC and FAIR initiatives
RDA in a Nutshell - September 2020
The Future of LOD
The Value of the Research Data Alliance to Individuals
Metadata Standards
PiDs for research - Natasha Simons - May 24, 2017
Ā 
2015 05-27-congrés archivoscatalunya
Open Science and Identifiers
MIDESS
Rebecca Grant - DRI/ARA(I) Training: Introduction to EAD - Metadata and Metad...
Ad

Similar to Principles for proper data management and reuse--An RDA view (20)

PPTX
Research Data Alliance .. The Why, How, What ...
PDF
Research Data Alliance: Creating the culture and technology for an internatio...
PDF
NordForsk Open Access Reykjavik 14-15/8-2014:Rda
PDF
RDA Work Groups Outputs and Adoption - Early WG Report back session
PPTX
RDA in a Nutshell - December 2016
PPTX
Removing Barriers to Data Sharing: the Research Data Alliance
PDF
Rda nitrd 2015 berman - final
PPTX
RDA Governance
PPT
RDA, EOSC and FAIR
Ā 
PPTX
Rda in a_nutshell_january_2017
PDF
Developing institutional RDM services
PDF
Open Data is not Enough (final version)
PPTX
Rda in a nutshell august 2019
PPTX
Rda in a_nutshell_december_2018
PPT
RDA - The Research Data Alliance in a Nutshell
PPTX
Rda in a_nutshell_october_2018
PPTX
Rda in a_nutshell_march_2017
PPTX
Rda in a_nutshell_november_2018
PPTX
Research Data Alliance in a nutshell - Fotis Karayannis
PPTX
Rda in a_nutshell_june_2017
Research Data Alliance .. The Why, How, What ...
Research Data Alliance: Creating the culture and technology for an internatio...
NordForsk Open Access Reykjavik 14-15/8-2014:Rda
RDA Work Groups Outputs and Adoption - Early WG Report back session
RDA in a Nutshell - December 2016
Removing Barriers to Data Sharing: the Research Data Alliance
Rda nitrd 2015 berman - final
RDA Governance
RDA, EOSC and FAIR
Ā 
Rda in a_nutshell_january_2017
Developing institutional RDM services
Open Data is not Enough (final version)
Rda in a nutshell august 2019
Rda in a_nutshell_december_2018
RDA - The Research Data Alliance in a Nutshell
Rda in a_nutshell_october_2018
Rda in a_nutshell_march_2017
Rda in a_nutshell_november_2018
Research Data Alliance in a nutshell - Fotis Karayannis
Rda in a_nutshell_june_2017
Ad

More from Research Data Alliance (20)

PPTX
RDA in a Nutshell - August 2020
PPTX
RDA in a Nutshell - July 2020
PPTX
RDA in a Nutshell - June 2020
PPTX
RDA in a Nutshell - May 2020
PPTX
RDA in a Nutshell - April 2020
PPTX
RDA in a Nutshell - March 2020
PPTX
RDA in a Nutshell - February 2020
PPTX
RDA in a Nutshell - January 2020
PPTX
Rda in a Nutshell - December 2019
PPTX
Rda in a Nutshell - November 2019
PPTX
RDA in a Nutshell - October 2019
PPTX
The Value of the Research Data Alliance to Individuals
PPTX
RDA Value for Infrastructure Providers
PPTX
Rda in a nutshell september 2019
PPTX
The Value of the Rda Value for Organisations Performing Research
PPTX
RDA Value for Libraries
PPTX
The Value of the RDA for Funders
PPTX
Rda value for regions
PPTX
Rda in-a-nutshell-july-2019
PPTX
Rda in a nutshell - June 2019
RDA in a Nutshell - August 2020
RDA in a Nutshell - July 2020
RDA in a Nutshell - June 2020
RDA in a Nutshell - May 2020
RDA in a Nutshell - April 2020
RDA in a Nutshell - March 2020
RDA in a Nutshell - February 2020
RDA in a Nutshell - January 2020
Rda in a Nutshell - December 2019
Rda in a Nutshell - November 2019
RDA in a Nutshell - October 2019
The Value of the Research Data Alliance to Individuals
RDA Value for Infrastructure Providers
Rda in a nutshell september 2019
The Value of the Rda Value for Organisations Performing Research
RDA Value for Libraries
The Value of the RDA for Funders
Rda value for regions
Rda in-a-nutshell-july-2019
Rda in a nutshell - June 2019

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Computer network topology notes for revision
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Introduction to Business Data Analytics.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Fluorescence-microscope_Botany_detailed content
.pdf is not working space design for the following data for the following dat...
Computer network topology notes for revision
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Supervised vs unsupervised machine learning algorithms
Major-Components-ofNKJNNKNKNKNKronment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Business Data Analytics.
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Business Ppt On Nestle.pptx huunnnhhgfvu
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx

Principles for proper data management and reuse--An RDA view

  • 1. Principles for proper Data Management and Re-Use – an RDA view Peter Wittenburg Max Planck Society
  • 2. 2 ļ‚§ does RDA have one view – yes & no ļ‚§ RDA is basically a bottom-up organization driven by the many ā€œcreativeā€ minds who want to change data practices ļ‚§ RDA has now about 2000 members – so we have 2000 opinions? ļ‚§ we have an intensive discussion process since 2012 (ICRI Conference Copenhagen) and we can see that there are a number of trends and principles all or most seem to agree with ļ‚§ still RDA is a very young initiative and needs much attention and grease Clarification
  • 3. 3 Why is this all relevant? ļ‚§ Naoyuki Tsunematsu (JST ): • Data exchange (and thus the need for proper data management) difficult to convey in Japanese Science • parallel trends observed for Japanese Science • not so often included in collaborations anymore • not so often represented in the top papers • enormous decrease in international ranking • serious worries about counterproductive encapsulation • this concern seems to be relevant for all of us
  • 4. 4 Trends I – Volume, Complexity from simple structures ... ... towards complex relationships
  • 5. 5 Trends II - Anonymity direct exchange between known colleagues Domain of Repositories
  • 6. 6 Trends III – Re-Usage Domain of trusted Repositories • Data will be re-used in different contexts • Data needs to be findable, accessible, combinable and interpretable for others
  • 7. 7 Data Practices I – Survey ļ‚§ ~120 Interviews/Interactions ļ‚§ 2 Workshops with Leading Scientists (EU, US) ļ‚§ too much manual or via ad hoc scripts ļ‚§ too much in Legacy formats (no PID & MD) ļ‚§ there are lighthouse projects etc. but ... ļ‚§ DM and DP not efficient and too expensive (Biologist for 75% of his time data manager) ļ‚§ federating data incl. logical information much too expensive ļ‚§ hardly usage of automated workflows and lack of reproducibility
  • 8. 8 Data Practices I – Survey ļ‚§ ~120 Interviews/Interactions ļ‚§ 2 Workshops with Leading Scientists (EU, US) ļ‚§ too much manual or via ad hoc scripts ļ‚§ too much in Legacy formats (no PID & MD) ļ‚§ there are lighthouse projects etc. but ... ļ‚§ DM and DP not efficient and too expensive (Biologist for 75% of his time data manager) ļ‚§ federating data incl. logical information much too expensive ļ‚§ hardly usage of automated workflows and lack of reproducibility
  • 9. 9 12 21 26 95 95 96 97 266 676 DIF DwC DC EML FGDC Open GIS ISO My Lab none Metadata standards Data Practices III - Metadata slide von Bill Michener, DataONE
  • 10. 10 ļ‚§ lack of proper documentation, schemas, semantics, relations, etc. ļ‚§ directory structures, spreadsheets etc. are ad hoc creations and knowledge fades away ļ‚§ etc. Data Practices II – Data Entropy
  • 11. 11 Community Center Common Data Center Changes needed – EUDAT and others many excellent projects are working on changes: ESFRI projects, DataNet projects, e- Infrastructures, national projects RDA needs to build on experiences and expertise
  • 12. 12 RDA widely agreed I – time to change  management of data objects is widely type and discipline independent  still every project defines its own strategies leading to huge stack of software that will not be maintainable
  • 13. 13 RDA widely agreed II –time to change what Value Added Services Data Sources Persistent Identifiers Persistent Reference Analysis Citation Apps Custom Clients Plug-Ins Resolution System Typing PID Local Storage Cloud Computed Data Sets RDBMS Files Digital Objects PID record attributes bit sequence (instance) metadata attributes points to instances describes properties describes properties & context point to each other
  • 14. 14 RDA Results I: common data model • PIDs at the beginning of trust chain • have a worldwide, independent and robust PID system worldwide (DONA Handles – DOIs are Handles)! • metadata are essential in anonymous data world taken from RDA WG Data Foundation & Terminology
  • 15. 15 ļ‚§ result: a registry for data types ļ‚§ you get an unknown file, pull it on DTR and content is being visualized ļ‚§ extended MIME Type concept ļ‚§ no free lunch: someone needs to register and define type ļ‚§ code available begin 2015 ļ‚§ PIT Demo already working with DTR RDA Results II: Data Type Registry Federated Set of Type Registries Visualization Data Processing10100 11010 101…. Data Set Dissemination 10100 11010 101…. 10100 11010 101…. Terms:… Rights Agree Visualization Processing Interpretation 3 Domain of Services 2 1 Human or Machine Consumers 4 • NIST is already working with communities on fargoing ideas
  • 16. 16 ļ‚§ result: a generic API and a set of basic attributes ļ‚§ a PID Record is like a Passport (Number, Photo, Exp-Date, etc.) ļ‚§ if all PID Service-Provider agree on one API and talk the same language (registered terms) SW development will become easy ļ‚§ Test-Installation in operation together with DTR RDA Results III: PID Information Types LOC location, path CKSM checksum CKSM_T checksum type RoR owning repository MD path to MD
  • 17. 17 ļ‚§ due to unforeseen circumstances need until P5 ļ‚§ Practical Policies = executable Workflow Statements ļ‚§ result at P5: a set of Best Practice PPs for a number of typical DM/DP tasks (Integrity Check, Replication, etc.) ļ‚§ currently a large collection of PPs, currently being evaluated ļ‚§ you could add your policies RDA Results IV: Practical Policies replication policy X replication policy Y integrity policy A integrity policy B integrity policy C md extraction policy l md extraction policy k etc. Policy Inventory Repository selection implementation execution data manager
  • 18. 18 ļ‚§ need to place many RDA WGs & IGs on a common landscape since finally everything needs to fit together -> Data Fabric RDA ongoing: Data Fabric
  • 19. 19 1973 Changes take long ... 1990 1993 TCP/IP Specification 1977 TCP/IP Stress-test WWW-Mosaic available worldwide adoption ļ‚§ many different suggestion & protocols ļ‚§ first no advantage for TCP/IP ļ‚§ at the beginning discussion about different email systems ļ‚§ at the beginning no interest from researchers and also industry (toi of some freaks) ļ‚§ required some top-down decisions to enforce unification 20 years!
  • 20. 20 RDA is about global bridge building 20 RDA is about building the social and technical bridges that enable global open sharing of data. Researchers, scientists, data practitioners from around the world are invited to work together to achieve the vision Funders: NSF, EC, AU Gov, Japan, Brazil, DE?, UK?, ZA?, FI?, etc.
  • 21. 21 Thanks for your attention. http://www.rd-alliance.org http://europe.rd-alliance.org
  • 22. 22 ļ‚§ see Science 2.0 Initiative of EC ļ‚§ nr. of researchers increases enormously ļ‚§ there is a pressure in the direction of Grand Challenges and those topics relevant for societies ļ‚§ research is increasingly often data intensive ļ‚§ border-crossing research is a fact (countries, disciplines) ļ‚§ faster cycles (hypothesis – analysis – publications – reviews) Trends IV: research is changing
  • 24. 24 EUDAT Services 24 EUDAT Box dropbox-like service easy sharing local synching Semantic Anno checking , referencing and annotating Dynamic Data immediate handling Generic Workflow automating data processing B2DROP B2NOTE

Editor's Notes

  • #10: Suzie Scientists want to be able to use other scientists’ datasets, they are willin to share their own data and they feel it is appropriate to create new datasets from shared data.