SlideShare a Scribd company logo
Technical Challenges and
Approaches to build an
Open Ecosystem of
Heterogeneous Heritage
Collections
Ricard de la Vega
Natalia Torres
Albert Martínez
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
1. Introduction
Who we are?
All the code, specifications and documentation are available
under an open source MIT license on the Github Echoes
page: https://github.com/CSUC/ECHOES-Tools
Technological partner
1. Introduction
What is Echoes?
Echoes provides open, easy and innovative
access to digital cultural assets from different
institutions and is available in several languages.
Within a single and integrated platform, users have
access to a wide range of information on
archaeology, architecture, books, monuments,
people, photography etc. This can be explored
using different criteria: concepts, digital objects,
people, places and time. The platform can be
installed for a region or a theme.
1. Introduction
What is Echoes?
Echoes has developed tools that allow to
analyze, clean and transform data collections to
Europeana Data Model (EDM).
Also tools to validate, enrich and publish
heterogeneous data to a normalized data lake
that can be exploited as linked open data and
used with different data visualizations.
1. Introduction
What is Echoes?
1. Introduction
An example of 1+1=3
Pilot with 3 different collections
‒ Archeologic Heritage
‒ Architectonic Heritage
‒ Institutional repository
Roses
Port de la Selva
Vall de Boí
1. Introduction
An example of 1+1=3
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
1. Introduction
How Echoes works?
Data access
Data storage
Data homogenization
Data collections
4
3
2
1
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
2. Technical architecture
Modular approach
1. Input (data collections)
2. Mapping and transformation
tools (data homogenization)
3. Data lake (data storage)
4. Output
– SPARQL endpoint (RDF)
– Portal (WordPress)
– API-Rest, OAI-PMH
5. Enrichments
2. Technical architecture
Modular approach
1. Inputs
2. Mapping
3. Data lake
4. Output
– SPARQL
– Portal
– API-Rest,
– Enrichments
2. Technical architecture | Inputs
2. Technical architecture | Inputs | Examples
– ELO: 4K, 144K, 280K items (A2A)
– Tresoar: 21K, 36K, 2M items (A2A)
– Gencat: 1K (Custom)
1.351.416
983.677 983.677
95.377
10.989 2.380 560
0
200.000
400.000
600.000
800.000
1.000.000
1.200.000
1.400.000
1.600.000
November 2018
2. Technical architecture | Data homogenization
2. Technical architecture | Data homogenization
Echoes is a project of interoperability between
different data collections.
Integrating data is not just about putting them
together in a repository, but also to facilitate
their access so it can be properly exploited by
the public
2. Technical architecture | Data homogenization
If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system
‒ A posteriori, in real-time when the data is used
2. Technical architecture | Data homogenization
If garbage comes in, then garbage comes out
To simplify the reuse and visualization of the data,
all the records inserted into the system should
have the same structure and format.
There are two ways to ensure the data coherence
and consistency, clean & transform data:
‒ A priori, before insert to the system. Due to the
complexity and the high volume of data
‒ A posteriori, in real-time when the data is used
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional Optional
Demo on https://youtu.be/LQSheaKJOiY
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files from
a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
2. Technical architecture | Data homogenization
‒ Gives feedback on the data properties
‒ Useful to get to know the contents of the data
especially if you didn’t create the dataset
‒ Gives the ability to determine the usefulness of
the data when you want to enrich.
Ex. If there are no places in the dataset, enrichment
with coordinates is impossible
2. Technical architecture | Data homogenization
ECHOES Workshop Archiving 2019
ECHOES
Analyze
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
A2A
Dublin Core
TopX
EAD
CARARE Upload file
Analyze
Report
in xml
as a
Supports:
Accepts data as a:
Delivers:
* An xml file can be easily
imported in your favorite
reporting tool.Custom
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
2. Technical architecture | Data homogenization
ECHOES Workshop Archiving 2019
ECHOES
Transform
URL
Open Archieves Initiative
protocol for
metadata harvesting
(OAI-PMH)
Upload file
Your EDM
dataset
as a
Accepts data as a:
Delivers:A2A
Dublin Core
TopX
EAD
CARARE
Supports:
Custom
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
Quality Assurance Module
2. Technical architecture | Data homogenization
Schema Semantics Content
1 2 3
Review:
‒ Tags
‒ mandatory fields
Results:
‒ OK: next step
‒ Error: stop
Review:
‒ Schematron
Results:
‒ OK: next step
‒ Error: stop
Review:
‒ Metadata fields based
on configurable specs.
Results:
‒ OK: valid item
‒ Error: stop
‒ Warning: partial valid
‒ Info: valid item
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a
source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
Optional
2. Technical architecture | Data homogenization
Analyze Transform
Quality
Assurance
Enrich Publish
1 42 3 5
Analyze
content from
a source to
“know about”
your data
Download
items into
local files
from a source
Transform to
EDM
Review each
item and
based on
defined rules
decide if it can
be loaded into
Data Lake
Quality report
Enrich
metadata
from different
sources
Publish items
into Data Lake
Only valid
items can be
loaded
Optional Optional
2. Technical architecture | Data lake
‒ Blazegraph™ DB is an ultra high-performance
graph database supporting Blueprints and
RDF/SPARQL API's
‒ Ex. Wikimedia Foundation Wikidata Query Service
‒ https://github.com/blazegraph/database
2. Technical architecture | Output
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
https://echoes.community
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
2. Technical architecture | Outputs
The theory is sound, but there still exist
many challenges to tackle…
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichment
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichment
3.1. Challenges and Approaches | Different metadata schemas
‒ Different collections can have different metadata
schemas…
‒ Dublin Core (DC), A2A, EAD, Custom…
3.1. Challenges and Approaches | Different metadata schemas
‒ It was necessary to have one metadata standard that
was the standard to map the datasets to
‒ We choose: Europeana Data Model (EDM)
‒ Transformation module. Mapping to EDM from DC,
A2A, EAD, Topx, custom metadata and “CARARE”
‒ Transformation tool is easy extensible to other
formats, if someone wants a format that is not on the
list, can create their own EDM mapping (and can
contribute it to the community)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.2. Challenges and Approaches | Poor data quality
‒ Sometimes the data quality is not as good as we
would like it to be…
‒ This poor quality limits the exploitation of the data
‒ For example
‒ One unique field with different geolocation levels,
Bussum (municipality), Chicago (city), China (country)
‒ The same with dates (day and time, year, centuries…)
‒ Misspellings (Lide4n, Leideb, Lidedn, Leiden…)
3.2. Challenges and Approaches | Poor data quality
3 modules have been developed:
‒ Analyze focus on data profiling
Ex. blank cells, number of instances of each metadata…
‒ Quality assurance to validate the input data
Ex. Empty mandatory field, place without coordinates…
‒ Enrich to complete some metadata
Ex. Coordinates (to show in a map) from a text location
All the modules can be easily extended with new rules,
statistics, checks and enrichments.
Quality reports can be used to improve original data sets.
Item
1
Item
2
Some
Metadata
fields are
not included
Item
x
Item
not
included
3.2. Challenges and Approaches | Poor data quality
Item
1
Item
2
Item
x
Item
1
Item
2
Item
x
Items
- 6 ok
Ítems
- 1 warning
- 1 error
- 4 ok
Ítems
- 6 error
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.3 Challenges and Approaches | Data deduplication
3.3 Challenges and Approaches | Data deduplication
‒ Deduplication is easy if the items have
identificatory metadata.
‒ If not, different similarity and distance metrics can
be used to find duplicates (Levensthein, Jaro-
Winkler…) with the Duke tool.
‒ Useful to only get one value for places, dates…
3.3 Challenges and Approaches | Data deduplication
‒ Ex. Items from different Gencat and DIBA
collections (with an id in the metadata)
‒ Match done using custom identifier, BCIN or BCIL
(local register identifier for cultural assets)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.4 Challenges and Approaches | Automatic enrichments
‒ Which fields are candidates to enrichments?
We start with geolocations. A2A collections have
a location, but no coordinates, which is necessary to
visualize the data on a map.
If the enrichment is mandatory e.g. for proper
presentation on a map, it's automatically done on the last
step in the quality assurance module;
if enrichment is 'nice to have', it can be configured in the
enrich module.
‒ Use existing or new metadata?
Extend the metadata schema to insert the enrichment
(without modifying the original metadata)
3.4 Challenges and Approaches | Automatic enrichments
EDM Automatic enrichment Manual enrichment
Metadata Source Metadata Metadata
wgs84_pos:lat Geonames geonames:lat user:lat
wgs84_pos:long Geonames geonames:long user:long
skos:prefLabel Geonames geonames:alternateName
geonames:coloquialName
geonames:historicalName
geonames:officialName
geonames:name
user:prefLabel
DBPedia foaf:name
rdfs:label
owl:sameAs
Getty (TGN) rdfs:label
skos:prefLabel
skos:altLabel
skos:altLabel Getty (TGN) rdfs:label
skos:prefLabel
skos:altLabel
user:altLabel
3.4 Challenges and Approaches | Automatic enrichments
‒ Data visualization of an A2A collection (without
original coordinates) on a map
3.4 Challenges and Approaches | Automatic enrichments
‒ Another challenge; some third-party API's have
usage limitations like:
‒ Limit number of connections
‒ Premium options (€)
‒ One approach is to download a part or the total API
(cache), if is it possible…
 MaxResults:10.000
 MaxQueryExecutionTime = 120’
 MaxQueryCostEstimationTime = 1500’
 Connection limit = 50
 maximum request rate = 100
 Daily create a downloaded large
worldwide text file and offers a REST
API.
 Limitation 20’000 results, the hourly
limit is 1000 credits.
 Premium subscription
 Enpoint refreshed monthly.
 Webservice
 No information about limitations
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.5. Challenges and Approaches | Too much data
What's this?
a) A flower?
b) A black hole?
c) A (not user friendly)
data visualization of a
450K nodes graph?
3.5. Challenges and Approaches | Too much data
‒ Divide and conquer strategy. Pick a focus point
(e.g. based in something) and let the system
compute the “optimal” relevant context given the
users current interests.
‒ Don’t target to explore the whole database, focus
on specific domains. Ex. Different visualization
tools are developed based on different types of
information to be displayed (maps, timespan,
graph, etc)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.6. Challenges and Approaches | Easy SPARQL queries
‒ All the data is accessible in RDF format via a
linked open data endpoint.
‒ A user-friendly interface (YASGUI) is integrated
to access this data.
‒ User-friendly? Only if you know the SPARQL
language and database structure. So we found:
3.6. Challenges and Approaches | Easy SPARQL queries
A visual SPARQL query system to drag the
database elements to a canvas and 'build' your
query (Visual SPARQL Builder)
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.7. Challenges and Approaches | Different scope
‒ From small institutions to regional thematic installations. One
size fits all?
‒ The technology that is developed is scalable, so it covers many
different scenarios. Performance test have been designed to
test the performance with large collections.
‒ The modular approach enables (smaller) institutions to 'mix and
match' modules. for example, only the ECHOES transformation
module to transform one collection to EDM or linked open data
3. Challenges and Approaches
1. Different metadata schemas
2. Poor data quality
3. Data deduplication
4. Automatic enrichments
5. Too much data
6. Easy SPARQL queries
7. Different scope
8. User enrichments
3.8 Challenges and Approaches | User enrichments
One of the objectives of the project is giving the
user a possibility to enrich of the content
Not initiated yet…
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
4. Lessons Learned
Some decisions that we would take again:
‒ Use of agile methodology. Flexible to changes (Ex.
focus on data quality). Team collaboration on
iterations align all in the same direction.
‒ A multidisciplinary team brings different points of view
to solve the challenges. Also different countries.
‒ Start from the beginning. Focus on input data before
than enrichments.
‒ Learning by doing, the best way to know if it works is
to test it.
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
5. Results and future development
After 2,5 years we have done…
‒ 27 one-month iterations sprints (cookies meeting)
‒ 7 releases
‒ 1 modular product, version 1.5
‒ 1 open source community
(Benevolent dictator for life model)
All the code, specifications and documentation are available under an
open source MIT license on the GitHub Echoes page:
https://github.com/CSUC/ECHOES-Tools
5. Results and future development
The developed tools allow you to analyze, clean and
transform data collections to the EDM standard.
Validate, enrich and publish heterogeneous data to a
normalized data lake that can be exploited as linked
open data and with different data visualizations
5. Results and future development
Demo’s corner
- https://youtu.be/LQSheaKJOiY (Echoes Tool)
- https://youtu.be/LddOAUc9tig (End Point)
- https://youtu.be/bb3Sxyyx8aA (End Point)
- https://youtu.be/oa7aY6p4o5Y (Echoes Portal)
5. Results and future development
The current status of the development is:
‒ To improve the data sources mapping and
transformation tools
‒ Focus on the enrichments
‒ More users of the platform are expected to help
grow the community
Join us!
Agenda
1.Introduction
2.Technical Architecture
3.Challenges and Approaches
4.Lessons learned
5.Results and future developments
6.References
6. References
• Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines,
institutes and borders (IS&T, Washington, 2018) pg. 70-74
• Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it
together spreading ORCID among Catalan universities and researchers” (ORCID-
CASRAI Joint conference, Barcelona, 2015)
• Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open
Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016)
• Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization-
tools/edm-documentation.
• Duke. A tool to find duplicates. https://github.com/larsga/Duke
• Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”:
Supporting Large Graph Exploration with Degree-of-interest.
http://perer.org/papers/adamPerer-DOIGraphs-InfoVis2009.pdf
Contact
Walther Hasselo
w.hasselo@erfgoedleiden.nl
Anna Busom
abusom@gencat.cat
Olav Kwakman
Olav.kwakman@tresoar.nl
Thanks for your attention
Development
team
Ricard de la Vega | ricard.delavega@csuc.cat
Natalia Torres | natalia.torres@csuc.cat
Albert Martínez | albert.martinez@csuc.cat

More Related Content

PDF
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
PDF
Apache Metron in the Real World
PPTX
Accelerating TensorFlow with RDMA for high-performance deep learning
PPT
Open Networking
PDF
An elastic batch-and stream-processing stack with Pravega and Apache Flink
PPTX
Omid: scalable and highly available transaction processing for Apache Phoenix
PDF
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
PPTX
Building COVID-19 Museum as Open Science Project
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Apache Metron in the Real World
Accelerating TensorFlow with RDMA for high-performance deep learning
Open Networking
An elastic batch-and stream-processing stack with Pravega and Apache Flink
Omid: scalable and highly available transaction processing for Apache Phoenix
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Building COVID-19 Museum as Open Science Project
 

What's hot (20)

PPTX
CLARIN CMDI use case and flexible metadata schemes
 
PDF
Willem VanEssendelft Profile
PDF
Amersfoort 2016 koch_wg_v02
PPTX
GCE11 Apache Rave Presentation
PDF
Open Source Visualization of Scientific Data
PPTX
Cloud Native Analysis Platform for NGS analysis
PPTX
CLARIAH CMDI use case and flexible metadata schemes
PPTX
LDP4j: A framework for the development of interoperable read-write Linked Da...
PDF
MyersTessella_Dec2013
PPTX
Manage democratization of the data - Data Replication in Hadoop
PDF
Oscon 2011 Practicing Open Science
DOCX
Jeevananthan_Informatica
PDF
Interoperability is the key: repositories networks promoting the quality and ...
PPT
OGCE RT Rroject Review
PPTX
Ontologies, controlled vocabularies and Dataverse
 
PPTX
Flexible metadata schemes for research data repositories - Clarin Conference...
PDF
Chemical Databases and Open Chemistry on the Desktop
PPTX
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
 
PPTX
Security event logging and monitoring techniques
PPT
Metadata Sharing Beyond Your Institution
CLARIN CMDI use case and flexible metadata schemes
 
Willem VanEssendelft Profile
Amersfoort 2016 koch_wg_v02
GCE11 Apache Rave Presentation
Open Source Visualization of Scientific Data
Cloud Native Analysis Platform for NGS analysis
CLARIAH CMDI use case and flexible metadata schemes
LDP4j: A framework for the development of interoperable read-write Linked Da...
MyersTessella_Dec2013
Manage democratization of the data - Data Replication in Hadoop
Oscon 2011 Practicing Open Science
Jeevananthan_Informatica
Interoperability is the key: repositories networks promoting the quality and ...
OGCE RT Rroject Review
Ontologies, controlled vocabularies and Dataverse
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Chemical Databases and Open Chemistry on the Desktop
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
 
Security event logging and monitoring techniques
Metadata Sharing Beyond Your Institution
Ad

Similar to Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections (20)

PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
A distributed network of digital heritage information - Semantics Amsterdam
PDF
Session 1.4 a distributed network of heritage information
PDF
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
PDF
The Europeana Data Model - TPDL2018
PDF
A distributed network of digital heritage information - Unesco/NDL India
PDF
Fondly Collisions: Archival hierarchy and the Europeana Data Model
PPTX
Speak to Your Data
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Extract, transform and load architecture for metadata collection
PDF
Дмитрий Попович "How to build a data warehouse?"
PPTX
Aggregation of cultural heritage datasets through the Web of Data
PDF
Reference Representation in Large Metamodel-based Datasets
PDF
SFSCON23 - Andrea Vianello - Achieving FAIRness with EDP-portal
PPTX
Europeana as a Linked Data (Quality) case
PDF
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
PDF
G05 dimitris gavrilis_more_aggregation
PDF
G04 vassilis tzouvaras_mapping_with_mint
PPTX
The Mint Mapping tool and the MoRe aggregator
Data Lakehouse, Data Mesh, and Data Fabric (r2)
A distributed network of digital heritage information - Semantics Amsterdam
Session 1.4 a distributed network of heritage information
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
The Europeana Data Model - TPDL2018
A distributed network of digital heritage information - Unesco/NDL India
Fondly Collisions: Archival hierarchy and the Europeana Data Model
Speak to Your Data
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Extract, transform and load architecture for metadata collection
Дмитрий Попович "How to build a data warehouse?"
Aggregation of cultural heritage datasets through the Web of Data
Reference Representation in Large Metamodel-based Datasets
SFSCON23 - Andrea Vianello - Achieving FAIRness with EDP-portal
Europeana as a Linked Data (Quality) case
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
G05 dimitris gavrilis_more_aggregation
G04 vassilis tzouvaras_mapping_with_mint
The Mint Mapping tool and the MoRe aggregator
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Spectroscopy.pptx food analysis technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectroscopy.pptx food analysis technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf

Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneous Heritage Collections

  • 1. Technical Challenges and Approaches to build an Open Ecosystem of Heterogeneous Heritage Collections Ricard de la Vega Natalia Torres Albert Martínez
  • 2. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 3. 1. Introduction Who we are? All the code, specifications and documentation are available under an open source MIT license on the Github Echoes page: https://github.com/CSUC/ECHOES-Tools Technological partner
  • 4. 1. Introduction What is Echoes? Echoes provides open, easy and innovative access to digital cultural assets from different institutions and is available in several languages. Within a single and integrated platform, users have access to a wide range of information on archaeology, architecture, books, monuments, people, photography etc. This can be explored using different criteria: concepts, digital objects, people, places and time. The platform can be installed for a region or a theme.
  • 5. 1. Introduction What is Echoes? Echoes has developed tools that allow to analyze, clean and transform data collections to Europeana Data Model (EDM). Also tools to validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and used with different data visualizations.
  • 7. 1. Introduction An example of 1+1=3 Pilot with 3 different collections ‒ Archeologic Heritage ‒ Architectonic Heritage ‒ Institutional repository Roses Port de la Selva Vall de Boí
  • 9. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 10. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 11. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 12. 1. Introduction How Echoes works? Data access Data storage Data homogenization Data collections 4 3 2 1
  • 13. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 14. 2. Technical architecture Modular approach 1. Input (data collections) 2. Mapping and transformation tools (data homogenization) 3. Data lake (data storage) 4. Output – SPARQL endpoint (RDF) – Portal (WordPress) – API-Rest, OAI-PMH 5. Enrichments
  • 15. 2. Technical architecture Modular approach 1. Inputs 2. Mapping 3. Data lake 4. Output – SPARQL – Portal – API-Rest, – Enrichments
  • 17. 2. Technical architecture | Inputs | Examples – ELO: 4K, 144K, 280K items (A2A) – Tresoar: 21K, 36K, 2M items (A2A) – Gencat: 1K (Custom) 1.351.416 983.677 983.677 95.377 10.989 2.380 560 0 200.000 400.000 600.000 800.000 1.000.000 1.200.000 1.400.000 1.600.000 November 2018
  • 18. 2. Technical architecture | Data homogenization
  • 19. 2. Technical architecture | Data homogenization Echoes is a project of interoperability between different data collections. Integrating data is not just about putting them together in a repository, but also to facilitate their access so it can be properly exploited by the public
  • 20. 2. Technical architecture | Data homogenization If garbage comes in, then garbage comes out To simplify the reuse and visualization of the data, all the records inserted into the system should have the same structure and format. There are two ways to ensure the data coherence and consistency, clean & transform data: ‒ A priori, before insert to the system ‒ A posteriori, in real-time when the data is used
  • 21. 2. Technical architecture | Data homogenization If garbage comes in, then garbage comes out To simplify the reuse and visualization of the data, all the records inserted into the system should have the same structure and format. There are two ways to ensure the data coherence and consistency, clean & transform data: ‒ A priori, before insert to the system. Due to the complexity and the high volume of data ‒ A posteriori, in real-time when the data is used
  • 22. 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional Optional Demo on https://youtu.be/LQSheaKJOiY
  • 23. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 24. 2. Technical architecture | Data homogenization ‒ Gives feedback on the data properties ‒ Useful to get to know the contents of the data especially if you didn’t create the dataset ‒ Gives the ability to determine the usefulness of the data when you want to enrich. Ex. If there are no places in the dataset, enrichment with coordinates is impossible
  • 25. 2. Technical architecture | Data homogenization ECHOES Workshop Archiving 2019 ECHOES Analyze URL Open Archieves Initiative protocol for metadata harvesting (OAI-PMH) A2A Dublin Core TopX EAD CARARE Upload file Analyze Report in xml as a Supports: Accepts data as a: Delivers: * An xml file can be easily imported in your favorite reporting tool.Custom
  • 26. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 27. 2. Technical architecture | Data homogenization ECHOES Workshop Archiving 2019 ECHOES Transform URL Open Archieves Initiative protocol for metadata harvesting (OAI-PMH) Upload file Your EDM dataset as a Accepts data as a: Delivers:A2A Dublin Core TopX EAD CARARE Supports: Custom
  • 28. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 29. Quality Assurance Module 2. Technical architecture | Data homogenization Schema Semantics Content 1 2 3 Review: ‒ Tags ‒ mandatory fields Results: ‒ OK: next step ‒ Error: stop Review: ‒ Schematron Results: ‒ OK: next step ‒ Error: stop Review: ‒ Metadata fields based on configurable specs. Results: ‒ OK: valid item ‒ Error: stop ‒ Warning: partial valid ‒ Info: valid item
  • 30. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 31. Optional 2. Technical architecture | Data homogenization Analyze Transform Quality Assurance Enrich Publish 1 42 3 5 Analyze content from a source to “know about” your data Download items into local files from a source Transform to EDM Review each item and based on defined rules decide if it can be loaded into Data Lake Quality report Enrich metadata from different sources Publish items into Data Lake Only valid items can be loaded Optional Optional
  • 32. 2. Technical architecture | Data lake ‒ Blazegraph™ DB is an ultra high-performance graph database supporting Blueprints and RDF/SPARQL API's ‒ Ex. Wikimedia Foundation Wikidata Query Service ‒ https://github.com/blazegraph/database
  • 36. 2. Technical architecture | Outputs https://echoes.community
  • 43. The theory is sound, but there still exist many challenges to tackle…
  • 44. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 45. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichment 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichment
  • 46. 3.1. Challenges and Approaches | Different metadata schemas ‒ Different collections can have different metadata schemas… ‒ Dublin Core (DC), A2A, EAD, Custom…
  • 47. 3.1. Challenges and Approaches | Different metadata schemas ‒ It was necessary to have one metadata standard that was the standard to map the datasets to ‒ We choose: Europeana Data Model (EDM) ‒ Transformation module. Mapping to EDM from DC, A2A, EAD, Topx, custom metadata and “CARARE” ‒ Transformation tool is easy extensible to other formats, if someone wants a format that is not on the list, can create their own EDM mapping (and can contribute it to the community)
  • 48. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 49. 3.2. Challenges and Approaches | Poor data quality ‒ Sometimes the data quality is not as good as we would like it to be… ‒ This poor quality limits the exploitation of the data ‒ For example ‒ One unique field with different geolocation levels, Bussum (municipality), Chicago (city), China (country) ‒ The same with dates (day and time, year, centuries…) ‒ Misspellings (Lide4n, Leideb, Lidedn, Leiden…)
  • 50. 3.2. Challenges and Approaches | Poor data quality 3 modules have been developed: ‒ Analyze focus on data profiling Ex. blank cells, number of instances of each metadata… ‒ Quality assurance to validate the input data Ex. Empty mandatory field, place without coordinates… ‒ Enrich to complete some metadata Ex. Coordinates (to show in a map) from a text location All the modules can be easily extended with new rules, statistics, checks and enrichments. Quality reports can be used to improve original data sets.
  • 51. Item 1 Item 2 Some Metadata fields are not included Item x Item not included 3.2. Challenges and Approaches | Poor data quality Item 1 Item 2 Item x Item 1 Item 2 Item x Items - 6 ok Ítems - 1 warning - 1 error - 4 ok Ítems - 6 error
  • 52. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 53. 3.3 Challenges and Approaches | Data deduplication
  • 54. 3.3 Challenges and Approaches | Data deduplication ‒ Deduplication is easy if the items have identificatory metadata. ‒ If not, different similarity and distance metrics can be used to find duplicates (Levensthein, Jaro- Winkler…) with the Duke tool. ‒ Useful to only get one value for places, dates…
  • 55. 3.3 Challenges and Approaches | Data deduplication ‒ Ex. Items from different Gencat and DIBA collections (with an id in the metadata) ‒ Match done using custom identifier, BCIN or BCIL (local register identifier for cultural assets)
  • 56. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 57. 3.4 Challenges and Approaches | Automatic enrichments ‒ Which fields are candidates to enrichments? We start with geolocations. A2A collections have a location, but no coordinates, which is necessary to visualize the data on a map. If the enrichment is mandatory e.g. for proper presentation on a map, it's automatically done on the last step in the quality assurance module; if enrichment is 'nice to have', it can be configured in the enrich module. ‒ Use existing or new metadata? Extend the metadata schema to insert the enrichment (without modifying the original metadata)
  • 58. 3.4 Challenges and Approaches | Automatic enrichments EDM Automatic enrichment Manual enrichment Metadata Source Metadata Metadata wgs84_pos:lat Geonames geonames:lat user:lat wgs84_pos:long Geonames geonames:long user:long skos:prefLabel Geonames geonames:alternateName geonames:coloquialName geonames:historicalName geonames:officialName geonames:name user:prefLabel DBPedia foaf:name rdfs:label owl:sameAs Getty (TGN) rdfs:label skos:prefLabel skos:altLabel skos:altLabel Getty (TGN) rdfs:label skos:prefLabel skos:altLabel user:altLabel
  • 59. 3.4 Challenges and Approaches | Automatic enrichments ‒ Data visualization of an A2A collection (without original coordinates) on a map
  • 60. 3.4 Challenges and Approaches | Automatic enrichments ‒ Another challenge; some third-party API's have usage limitations like: ‒ Limit number of connections ‒ Premium options (€) ‒ One approach is to download a part or the total API (cache), if is it possible…  MaxResults:10.000  MaxQueryExecutionTime = 120’  MaxQueryCostEstimationTime = 1500’  Connection limit = 50  maximum request rate = 100  Daily create a downloaded large worldwide text file and offers a REST API.  Limitation 20’000 results, the hourly limit is 1000 credits.  Premium subscription  Enpoint refreshed monthly.  Webservice  No information about limitations
  • 61. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 62. 3.5. Challenges and Approaches | Too much data What's this? a) A flower? b) A black hole? c) A (not user friendly) data visualization of a 450K nodes graph?
  • 63. 3.5. Challenges and Approaches | Too much data ‒ Divide and conquer strategy. Pick a focus point (e.g. based in something) and let the system compute the “optimal” relevant context given the users current interests. ‒ Don’t target to explore the whole database, focus on specific domains. Ex. Different visualization tools are developed based on different types of information to be displayed (maps, timespan, graph, etc)
  • 64. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 65. 3.6. Challenges and Approaches | Easy SPARQL queries ‒ All the data is accessible in RDF format via a linked open data endpoint. ‒ A user-friendly interface (YASGUI) is integrated to access this data. ‒ User-friendly? Only if you know the SPARQL language and database structure. So we found:
  • 66. 3.6. Challenges and Approaches | Easy SPARQL queries A visual SPARQL query system to drag the database elements to a canvas and 'build' your query (Visual SPARQL Builder)
  • 67. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 68. 3.7. Challenges and Approaches | Different scope ‒ From small institutions to regional thematic installations. One size fits all? ‒ The technology that is developed is scalable, so it covers many different scenarios. Performance test have been designed to test the performance with large collections. ‒ The modular approach enables (smaller) institutions to 'mix and match' modules. for example, only the ECHOES transformation module to transform one collection to EDM or linked open data
  • 69. 3. Challenges and Approaches 1. Different metadata schemas 2. Poor data quality 3. Data deduplication 4. Automatic enrichments 5. Too much data 6. Easy SPARQL queries 7. Different scope 8. User enrichments
  • 70. 3.8 Challenges and Approaches | User enrichments One of the objectives of the project is giving the user a possibility to enrich of the content Not initiated yet…
  • 71. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 72. 4. Lessons Learned Some decisions that we would take again: ‒ Use of agile methodology. Flexible to changes (Ex. focus on data quality). Team collaboration on iterations align all in the same direction. ‒ A multidisciplinary team brings different points of view to solve the challenges. Also different countries. ‒ Start from the beginning. Focus on input data before than enrichments. ‒ Learning by doing, the best way to know if it works is to test it.
  • 73. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 74. 5. Results and future development After 2,5 years we have done… ‒ 27 one-month iterations sprints (cookies meeting) ‒ 7 releases ‒ 1 modular product, version 1.5 ‒ 1 open source community (Benevolent dictator for life model) All the code, specifications and documentation are available under an open source MIT license on the GitHub Echoes page: https://github.com/CSUC/ECHOES-Tools
  • 75. 5. Results and future development The developed tools allow you to analyze, clean and transform data collections to the EDM standard. Validate, enrich and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and with different data visualizations
  • 76. 5. Results and future development Demo’s corner - https://youtu.be/LQSheaKJOiY (Echoes Tool) - https://youtu.be/LddOAUc9tig (End Point) - https://youtu.be/bb3Sxyyx8aA (End Point) - https://youtu.be/oa7aY6p4o5Y (Echoes Portal)
  • 77. 5. Results and future development The current status of the development is: ‒ To improve the data sources mapping and transformation tools ‒ Focus on the enrichments ‒ More users of the platform are expected to help grow the community Join us!
  • 78. Agenda 1.Introduction 2.Technical Architecture 3.Challenges and Approaches 4.Lessons learned 5.Results and future developments 6.References
  • 79. 6. References • Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines, institutes and borders (IS&T, Washington, 2018) pg. 70-74 • Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it together spreading ORCID among Catalan universities and researchers” (ORCID- CASRAI Joint conference, Barcelona, 2015) • Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016) • Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization- tools/edm-documentation. • Duke. A tool to find duplicates. https://github.com/larsga/Duke • Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”: Supporting Large Graph Exploration with Degree-of-interest. http://perer.org/papers/adamPerer-DOIGraphs-InfoVis2009.pdf
  • 81. Thanks for your attention Development team Ricard de la Vega | ricard.delavega@csuc.cat Natalia Torres | natalia.torres@csuc.cat Albert Martínez | albert.martinez@csuc.cat