SlideShare a Scribd company logo
COMSODE tools
Pushing data to the open ecosystem
Jindřich Mynarz
EEA.sk
ELAG 2015 Stockholm
June 9, 2015
The gist of the talk
To save legacy library data and satisfy
internal and external requirements on your
data you need ETL.
“Libraries have to focus on making their data
infrastructure more efficient if they want to
keep up with the ever changing needs of their
audience and invest in sustainable service
development.” — Lukas Koster (source)
Building tools to publish & reuse open data
EU FP7 project (2013➝2015)
Project partners:
● University of Milano-Bicocca,
Italy
● Charles University in Prague,
Czech Republic
● EEA, Czech Republic and
Slovakia
● ADDSEN, Slovakia
● Spinque, the Netherlands
● Ministry of Interior of the
Slovak Republic
Legacy library data
Save the data?
● …or let it go?
● What’s the cost of recovering the legacy?
● To save legacy data you need automation
⇒ ETL
● Unfortunately, paraphrasing Tolstoy, “tidy
datasets are all alike but every messy dataset
is messy in its own way.” (source)
Confusion of tongues
● MARC used to be (or
still is?) the lingua
franca. What's next?
● Many data formats
required to be
supported
● MARC→Web
impedance
mismatch
● Export & import
in systems
integration
Open Data Node
“(Linked) open data plumbing”
● Open Data Node (ODN) is a platform for
publishing (open) data & automating
internal data flows that enables
progressive enhancement of data.
● Main product of the COMSODE project
● Free, open source, modular, integrated (e.
g., single sign-on)
Open Data Node networks
● Data replication (e.g.,
local copy of name
authority file)
● Data synchronization (e.
g., periodical harvesting of
incremental updates via
OAI-PMH)
● Data distribution (e.g.,
shared cataloguing)
Open Data Node workflow
1. Catalogue your internal data
2. Create a data processing pipeline for the
datasets to be published
3. Schedule the pipeline to be run to publish
the data
Internal catalogue
● Map out the data you have or external
data you use; both open and closed.
● If data cannot be found, it is as if it did
not exist, so make data discoverable and
provide it with descriptive metadata
(DCAT-AP).
● Based on CKAN.
● An extensible ETL tool with native RDF
support for automating repetitive data
exchange and transformation tasks.
● Allows you to define, execute, monitor,
debug (examine intermediate data),
schedule, and share (import/export) data
transformations.
● Open source, dual-licensed to enable
commercial extensions
Extract-Transform-Load pipeline
Data flow of an ETL process in UnifiedViews
is defined as a pipeline composed of data
processing units.
Data processing units
Extractors
● Download
file
● Load from
SQL
database
● SPARQL
endpoint
extractor
Transformers
● Zip/unzip
● Find/replace
● Parse and
serialize RDF
● SPARQL
Update
● XSLT
● ISO 2709 to
MARCXML
● SPARQL
SELECT to
CSV
Loaders
● Files upload
● Load to
Virtuoso
● Load to SQL
database
+ Quality
Assessment
Public catalogue
● Public interface that enables users to
discover & access your data.
● Links to data dumps, APIs (REST API,
SPARQL endpoint), and applications
based on the data.
● Provides metadata, such as licence,
dataset maintainer’s contact, or last
update date.
● Based on CKAN.
COMSODE methodology
● Guidelines on how to use ODN for those
with little open data experience
● Defines phases, practices, roles, and
artifacts.
● Phases:
a. Development of open data publication plan
b. Preparation of publication
c. Realization of publication
d. Archiving
http://opendatanode.org/product/methodology-for-od-publishing
Open Data Node in use
● Reality check
○ Eating our own dog food
○ Testing the ODN’s versatility
● 150 datasets transformed
by COMSODE partners
● Supporting 10 pilot projects, including:
○ eDemokracia: Slovak nation-wide e-government
project
○ Czech Trade Inspection Authority
○ Slovak Environment Agency
○ Slovak National Library
Slovak National Library
COMSODE pilot
Demo time!
Impact
● Improve your internal & external data
flows.
● Libraries are required to publish data by
the EU directive on the re-use of public
sector information.
○ If you release MARC, is the cost of access to the
data marginal?
● Insiders have access, yet outsiders often
have more experience to build value
upon the data.
In conclusion
♫ The pipelines, the pipelines are calling... ♫
To save legacy library data and satisfy internal and
external requirements on your data you need ETL.
http://opendatanode.org
Image credits from the Noun Project:
Database by Dmitry Baranovskiy, Counter by Sergey
Demushkin, Ventil by Sergey Demushkin, Spider
Web by Denis, Scroll by EliRatus, Chest by Victor
Escorsin, Pipes by Christopher T. Howlett, Adoption
by Luis Prado, Plumber by Luis Prado, Filter by
Muneer A.Safiah, Lock by Alex Auda Samora, Lego
by Jon Trillana, Atom by Mister Pixel

More Related Content

PDF
Datalift lod2-paris-24032011
PDF
Seige arndt-lightning talk swib13
PDF
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
PPTX
Open Data Mashups: linking fragments into mosaics
PDF
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
PDF
Constantly Under Construction: STW Thesaurus for Economics Linked Data Maint...
PDF
Big Data Europe SC6 WS #3: Big Data Europe Platform: Apps, challenges, goals ...
PDF
Publishing open data and services for the Flemish Research Information Space
Datalift lod2-paris-24032011
Seige arndt-lightning talk swib13
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
Open Data Mashups: linking fragments into mosaics
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Constantly Under Construction: STW Thesaurus for Economics Linked Data Maint...
Big Data Europe SC6 WS #3: Big Data Europe Platform: Apps, challenges, goals ...
Publishing open data and services for the Flemish Research Information Space

What's hot (20)

PDF
LDCache - a cache for linked data-driven web applications
PPT
Nordic regional germplasm documentation, at European genbank network meeting ...
PDF
Open Data Node - Platform and Methodology - 2015-May
PDF
Geo linked data lstd10(v2-boris)
PDF
Nobel Prizes as Linked Open Data
PDF
skos-history: Tracking the evolution of Knowledge Organization Systems
PDF
Maurer Presentation - WARCnet Spring Meeting 2021
PDF
Change Tracking in Knowledge Organization Systems with skos-history
PDF
TransportDCAT-AP and PhD Thesis at Civic Lab Brussels
PPTX
The Use of Big Data Techniques for Digital Archiving
PPT
INOTAXA markup and its relations to ViBRANT
PPTX
Session 1.6 fostering interoperability of european qualifications: the qual...
PPTX
Learning R - Handling NetCDF files
PDF
Scalable load-balancing for large-scale big data applications (+Brazil, São P...
PDF
Dirk Goldhahn: Introduction to the German Wortschatz Project
PPTX
Tuesday 5 May: Definition and Representation of National Web Domains across W...
PDF
Drupal Day 2011 - Thinking spatially with your open data
PPTX
Integration and Exploration of Financial Data using Semantics and Ontologies
ODP
Linked Data for Abbreviations and Segmentation
LDCache - a cache for linked data-driven web applications
Nordic regional germplasm documentation, at European genbank network meeting ...
Open Data Node - Platform and Methodology - 2015-May
Geo linked data lstd10(v2-boris)
Nobel Prizes as Linked Open Data
skos-history: Tracking the evolution of Knowledge Organization Systems
Maurer Presentation - WARCnet Spring Meeting 2021
Change Tracking in Knowledge Organization Systems with skos-history
TransportDCAT-AP and PhD Thesis at Civic Lab Brussels
The Use of Big Data Techniques for Digital Archiving
INOTAXA markup and its relations to ViBRANT
Session 1.6 fostering interoperability of european qualifications: the qual...
Learning R - Handling NetCDF files
Scalable load-balancing for large-scale big data applications (+Brazil, São P...
Dirk Goldhahn: Introduction to the German Wortschatz Project
Tuesday 5 May: Definition and Representation of National Web Domains across W...
Drupal Day 2011 - Thinking spatially with your open data
Integration and Exploration of Financial Data using Semantics and Ontologies
Linked Data for Abbreviations and Segmentation
Ad

Viewers also liked (20)

PDF
Theorie U
PPTX
manosalasiembra1meraño"A"
PPTX
EL DERECHO EN LA INFORMÁTICA
PDF
COMSODE networking session at ICT Lisbon 2015
PDF
05 ai uml_illik_students_part_2_eng
PDF
05 ai uml_illik_students_part_2_de
PDF
05 ai uml_illik_students_part_1_de
PPT
TcpET
PDF
Open Mobile EcoSystem
PDF
9 system-sizing
PPTX
The Open Ecosystem: Issues and challenges for Institutional Repositories
PDF
201510 odn-itapa
PPTX
Leveraging the Open IoT Ecosystem to Accelerate Product Strategy
PDF
A Performance Comparison Of C# 2013, Delphi Xe6, And Python 3.4 Languages
PPT
Difference between Java and c#
PPTX
An Open and Collaborative Ecosystem for IoT
PPTX
Comparison of Programming Platforms
DOCX
PEA responsibilities
Theorie U
manosalasiembra1meraño"A"
EL DERECHO EN LA INFORMÁTICA
COMSODE networking session at ICT Lisbon 2015
05 ai uml_illik_students_part_2_eng
05 ai uml_illik_students_part_2_de
05 ai uml_illik_students_part_1_de
TcpET
Open Mobile EcoSystem
9 system-sizing
The Open Ecosystem: Issues and challenges for Institutional Repositories
201510 odn-itapa
Leveraging the Open IoT Ecosystem to Accelerate Product Strategy
A Performance Comparison Of C# 2013, Delphi Xe6, And Python 3.4 Languages
Difference between Java and c#
An Open and Collaborative Ecosystem for IoT
Comparison of Programming Platforms
PEA responsibilities
Ad

Similar to Comsode tools - pushing data to open ecosystem (20)

PDF
Sebastian Hellmann
PPTX
UnifiedViews: Towards ETL Tool for Simple yet Powerful RDF Data Management.
PDF
Linked Data at the OU - the story so far
PDF
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
PDF
20170501 Distributed Network of Digital Heritage Information
PDF
F2 kepa rodriguez_ehri_integration_retrieva_minerva_2016
PDF
Harvesting Repositories: DPLA, Europeana, & Other Case Studies
PPTX
Tim Pugh-SPEDDEXES 2014
PPTX
OR2012 Biblio-transformation-engine
PDF
Minimizing the Complexities of Machine Learning with Data Virtualization
PDF
Kettle: Pentaho Data Integration tool
PPTX
On chemical structures, substances, nanomaterials and measurements
PDF
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
PPTX
RDM@Edinburgh_interoperation_IDCC2015
PDF
Presentations from ICT 2015 in Lisbon
PDF
CLARIAH Toogdag 2018: A distributed network of digital heritage information
PDF
KEDL DBpedia 2019
PDF
Service Integration to Enhance RDM
Sebastian Hellmann
UnifiedViews: Towards ETL Tool for Simple yet Powerful RDF Data Management.
Linked Data at the OU - the story so far
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
20170501 Distributed Network of Digital Heritage Information
F2 kepa rodriguez_ehri_integration_retrieva_minerva_2016
Harvesting Repositories: DPLA, Europeana, & Other Case Studies
Tim Pugh-SPEDDEXES 2014
OR2012 Biblio-transformation-engine
Minimizing the Complexities of Machine Learning with Data Virtualization
Kettle: Pentaho Data Integration tool
On chemical structures, substances, nanomaterials and measurements
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
RDM@Edinburgh_interoperation_IDCC2015
Presentations from ICT 2015 in Lisbon
CLARIAH Toogdag 2018: A distributed network of digital heritage information
KEDL DBpedia 2019
Service Integration to Enhance RDM

More from Comsode - FP7 project (9)

PDF
ODN - Technical introduction of the platform
PDF
ODN introduction @ Innovation Radar
PPTX
Apporach to Open Data in Umbria region
PDF
Approach to Open Data in Vienna
PPT
Comsode pilot - Slovak eDemokracia project
PPTX
Comsode pilot - Netherlands Institute for Sounds and Vision
PPTX
Comsode pilot - Czech Trade Inspection Authority
PDF
Deployment strategies of Open Data Node focused mainly on pilots (2015-May)
PDF
Predstavenie Open Data Node - Open Data Meetup
ODN - Technical introduction of the platform
ODN introduction @ Innovation Radar
Apporach to Open Data in Umbria region
Approach to Open Data in Vienna
Comsode pilot - Slovak eDemokracia project
Comsode pilot - Netherlands Institute for Sounds and Vision
Comsode pilot - Czech Trade Inspection Authority
Deployment strategies of Open Data Node focused mainly on pilots (2015-May)
Predstavenie Open Data Node - Open Data Meetup

Recently uploaded (20)

PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Essential Infomation Tech presentation.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
top salesforce developer skills in 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
medical staffing services at VALiNTRY
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How Creative Agencies Leverage Project Management Software.pdf
L1 - Introduction to python Backend.pptx
PTS Company Brochure 2025 (1).pdf.......
Essential Infomation Tech presentation.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo POS Development Services by CandidRoot Solutions
2025 Textile ERP Trends: SAP, Odoo & Oracle
top salesforce developer skills in 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Reimagine Home Health with the Power of Agentic AI​
wealthsignaloriginal-com-DS-text-... (1).pdf
medical staffing services at VALiNTRY
Design an Analysis of Algorithms II-SECS-1021-03
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
ai tools demonstartion for schools and inter college
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025

Comsode tools - pushing data to open ecosystem

  • 1. COMSODE tools Pushing data to the open ecosystem Jindřich Mynarz EEA.sk ELAG 2015 Stockholm June 9, 2015
  • 2. The gist of the talk To save legacy library data and satisfy internal and external requirements on your data you need ETL. “Libraries have to focus on making their data infrastructure more efficient if they want to keep up with the ever changing needs of their audience and invest in sustainable service development.” — Lukas Koster (source)
  • 3. Building tools to publish & reuse open data EU FP7 project (2013➝2015) Project partners: ● University of Milano-Bicocca, Italy ● Charles University in Prague, Czech Republic ● EEA, Czech Republic and Slovakia ● ADDSEN, Slovakia ● Spinque, the Netherlands ● Ministry of Interior of the Slovak Republic
  • 4. Legacy library data Save the data? ● …or let it go? ● What’s the cost of recovering the legacy? ● To save legacy data you need automation ⇒ ETL ● Unfortunately, paraphrasing Tolstoy, “tidy datasets are all alike but every messy dataset is messy in its own way.” (source)
  • 5. Confusion of tongues ● MARC used to be (or still is?) the lingua franca. What's next? ● Many data formats required to be supported ● MARC→Web impedance mismatch ● Export & import in systems integration
  • 6. Open Data Node “(Linked) open data plumbing” ● Open Data Node (ODN) is a platform for publishing (open) data & automating internal data flows that enables progressive enhancement of data. ● Main product of the COMSODE project ● Free, open source, modular, integrated (e. g., single sign-on)
  • 7. Open Data Node networks ● Data replication (e.g., local copy of name authority file) ● Data synchronization (e. g., periodical harvesting of incremental updates via OAI-PMH) ● Data distribution (e.g., shared cataloguing)
  • 8. Open Data Node workflow 1. Catalogue your internal data 2. Create a data processing pipeline for the datasets to be published 3. Schedule the pipeline to be run to publish the data
  • 9. Internal catalogue ● Map out the data you have or external data you use; both open and closed. ● If data cannot be found, it is as if it did not exist, so make data discoverable and provide it with descriptive metadata (DCAT-AP). ● Based on CKAN.
  • 10. ● An extensible ETL tool with native RDF support for automating repetitive data exchange and transformation tasks. ● Allows you to define, execute, monitor, debug (examine intermediate data), schedule, and share (import/export) data transformations. ● Open source, dual-licensed to enable commercial extensions
  • 11. Extract-Transform-Load pipeline Data flow of an ETL process in UnifiedViews is defined as a pipeline composed of data processing units.
  • 12. Data processing units Extractors ● Download file ● Load from SQL database ● SPARQL endpoint extractor Transformers ● Zip/unzip ● Find/replace ● Parse and serialize RDF ● SPARQL Update ● XSLT ● ISO 2709 to MARCXML ● SPARQL SELECT to CSV Loaders ● Files upload ● Load to Virtuoso ● Load to SQL database + Quality Assessment
  • 13. Public catalogue ● Public interface that enables users to discover & access your data. ● Links to data dumps, APIs (REST API, SPARQL endpoint), and applications based on the data. ● Provides metadata, such as licence, dataset maintainer’s contact, or last update date. ● Based on CKAN.
  • 14. COMSODE methodology ● Guidelines on how to use ODN for those with little open data experience ● Defines phases, practices, roles, and artifacts. ● Phases: a. Development of open data publication plan b. Preparation of publication c. Realization of publication d. Archiving http://opendatanode.org/product/methodology-for-od-publishing
  • 15. Open Data Node in use ● Reality check ○ Eating our own dog food ○ Testing the ODN’s versatility ● 150 datasets transformed by COMSODE partners ● Supporting 10 pilot projects, including: ○ eDemokracia: Slovak nation-wide e-government project ○ Czech Trade Inspection Authority ○ Slovak Environment Agency ○ Slovak National Library
  • 18. Impact ● Improve your internal & external data flows. ● Libraries are required to publish data by the EU directive on the re-use of public sector information. ○ If you release MARC, is the cost of access to the data marginal? ● Insiders have access, yet outsiders often have more experience to build value upon the data.
  • 19. In conclusion ♫ The pipelines, the pipelines are calling... ♫ To save legacy library data and satisfy internal and external requirements on your data you need ETL. http://opendatanode.org Image credits from the Noun Project: Database by Dmitry Baranovskiy, Counter by Sergey Demushkin, Ventil by Sergey Demushkin, Spider Web by Denis, Scroll by EliRatus, Chest by Victor Escorsin, Pipes by Christopher T. Howlett, Adoption by Luis Prado, Plumber by Luis Prado, Filter by Muneer A.Safiah, Lock by Alex Auda Samora, Lego by Jon Trillana, Atom by Mister Pixel