SlideShare a Scribd company logo
Franco Niccolucci & Achille Felicetti
(PIN, University of Florence, Italy)
EOSC-hub Week 2018
Malaga, 16/4/2018
EOSCpilot is a project funded by the EC H2020 programme
 Domain: Archaeology
 Goal: semantic enrichment of texts
 Archaeological documentation largely based on texts
◦ Excavation diaries, reports, surveys, grey literature
◦ Literary/historical sources. research articles, monographs …
◦ Huge number of small (<100Kb) files in different languages
 Registry of 2,000,000 archaeological datasets (70% texts) in ARIADNE
 ARIADNE’s data infrastructure popular among archaeologists
◦ ARIADNE users in 2016: 25-30% of the European research community
◦ Strong support by
 Professional associations (EAA, EAC) & national archaeological/cultural heritage authorities
 National research institutions (CNR, CNRS, CAS, ÖAW, KNAW, BAS, ATHENA RC, FORTH)
 International recognition (USA, Mexico, Japan, Argentina)
 Needed for cloud-based data infrastructure to be developed in ARIADNEplus
◦ Deeper integration between texts, databases, GIS etc.
◦ Advanced services & VREs for data-centric archaeological research
2
EOSCpilot is a project funded by the EC H2020 programme
 NLP & NER OS engine
 Syntactic rules (tailored to specific writing style)
 Texts stating facts, not stories
◦ Data fuzziness, provenance, reliability, reasoning
 Domain ontology: CIDOC CRM (ISO 21127:2006)
◦ ... and not TEI
 Terminology
◦ Specialized vocabularies
 Terra sigillata is not just “sealed earth”
◦ Gazetteers for modern (Geonames) and ancient (Pleiades) place names
 Málaga (modern) vs Màlaka (Phoenician) vs Màlaca (Roman)
◦ Named time period management
 Bronze Age (∼ 3200-600 BC), Recent Orientalizing Period (∼ 630-570 BC)
EOSCpilot is a project funded by the EC H2020 programme
 Modular framework based on GATE toolchain: https://gate.ac.uk
◦ Advanced stemming/lemmatization components
 OpenNLP (https://opennlp.apache.org) : sentence segmentation and part of
speech (POS) tagging
 OpeNER (http://www.opener-project.eu) neuronal network for advanced
named entities recognition (NER), developed in OpeNER FP7 project
◦ Machine learning framework for auto education
 Annotated corpus required
 Ontology: CRMarcheo (CRM extension for archaeology)
 Vocabularies, gazetteers and terminological tools
◦ ICCD vocabularies for Italian archaeology, augmented with term lists
created on purpose
◦ Geonames (modern places), Pleiades (historical places)
◦ Timespan and named period component based on PeriodO
4
EOSCpilot is a project funded by the EC H2020 programme
 TextCrowd detects:
◦ Artefacts
◦ Colours
◦ Materials
◦ Time periods
◦ Persons
◦ Places
◦ Sites
◦ Time spans
◦ Techniques
 Target output formats:
◦ Textual documents automatically annotated and enriched
◦ CIDOC CRM semantic triples (RDF)
5
EOSCpilot is a project funded by the EC H2020 programme
 No annotated text corpora available in Italian to be used as training data for
machine learning algorithms
◦ Manual annotation of 400 pages of Italian archaeology reports (< 1 Person-Month)
 Preparation and adaptation of vocabularies
 Availability of user-friendly cloud-based environments and of necessary tools, to
migrate standalone prototype to cloud
◦ Several cloud solutions tested in early development, limited support provided except in
D4Science
◦ Implementation in D4Science infrastructure, but portable to other cloud services if support and
required modules available
 Authentication and Authorization
◦ No access control to metadata/data implemented so far
◦ Demonstrator focused on freely accessible textual documents
◦ Fasti Online used (http://www.fastionline.org) Open Access collection of archaeological reports
6
EOSCpilot is a project funded by the EC H2020 programme
 Operated and maintained by CNR-ISTI on the D4Science platform
https://www.d4science.org
 Modular engine based on GATE toolchain + OpenNLP-OpeNER
modules, natively provided by D4Science
 Web-based user interface for
◦ User and access management
◦ Cloud storage (private and shared files)
◦ Results available for other Virtual Research Environments (VRE) within D4Science
 Released for open use, for tests & comments
 No fancy interface produced, also to adapt to any Look-and-Feel
7
EOSCpilot is a project funded by the EC H2020 programme
 Machine-readable results: RDF encoding produced
 Human-readable results: color-encoded text (for testing)
 Interoperability of extracted knowledge
◦ Semantic information in CRM format: full integration and interoperability with
other archaeological semantic data (to be fully implemented in ARIADNEplus)
 Supporting FAIR Principles implementation
◦ Metadata to be stored in various registries for easy findability and accessibility
◦ Results ready to be reused within the same environment or consumed by other
services and/or in different scenarios
8
EOSCpilot is a project funded by the EC H2020 programme
 TEXTCROWD has shown to be useful for its main purpose: to demonstrate
the importance and usefulness of EOSC for scientific research in the cultural
heritage domain
 Adoption by other research teams in the EOSCpilot framework
◦ Integration of TEXTCROWD with new VisualMedia Demonstrator: a service for
sharing and visualizing visual media files on the web - automatic metadata extraction
from controlled lists or textual documents for 2D and 3D models
 Testing on real use cases in progress
◦ Open Access papers of the Italian Journal Archeologia e Calcolatori, ongoing
 Clean visualization
 Language extension
◦ English, Dutch: from standalone to cloud-based (annotated corpora available)
◦ French, Spanish, German: new from scratch (annotated corpora to be prepared)
◦ Other EU languages: OpeNER extension required
 Additional work required to suit it to everyday use – but not too much
9
EOSCpilot is a project funded by the EC H2020 programme
 TEXTCROWD Official Pages:
https://eoscpilot.eu/science-demos/textcrowd
https://textcrowd.d4science.org
 TEXTCROWD Pilot:
https://services.d4science.org/group/textcrowd/data-miner
(registration required)
10
EOSCpilot is a project funded by the EC H2020 programme
1. Upload the file(s) to analyze
2. Launch TextCrowd
3. Select the file(s) to process
4. Collect the results
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
Franco Niccolucci: franco.niccolucci@gmail.com – Achille Felicetti: achille.felicetti@pin.unifi.it

More Related Content

PDF
Session3 01.clemens neudecker
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PDF
Crating Value with Open Source, OW2con11, Nov 24-25, Paris
 
ODP
Poio API and GraF-XML @ Balisage 2013
PPTX
Science Demonstrator Session: Social and Earth Sciences
PPTX
European Open Science Cloud architecture future view
PPT
New Goals of PARES: Spanish Archives Web Portal
PPTX
European Research Projects as EOSC Service Providers
Session3 01.clemens neudecker
OCR-D: An end-to-end open source OCR framework for historical printed documents
Crating Value with Open Source, OW2con11, Nov 24-25, Paris
 
Poio API and GraF-XML @ Balisage 2013
Science Demonstrator Session: Social and Earth Sciences
European Open Science Cloud architecture future view
New Goals of PARES: Spanish Archives Web Portal
European Research Projects as EOSC Service Providers

Similar to Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd (20)

PDF
Reducing Infrastructure and Service Fragmentation
PPTX
Gergely Sipos, Claudio Cacciari: Welcome and mapping the landscape: EOSC-hub ...
PDF
Design phase kick-off event and Ceremony
PDF
LoCloud Annual Publishable Summary 2014-15
PPTX
European Cloud Initiative: implementation status
PPTX
2019 05-21 egi and eosc - final
PPTX
WEBINAR: "How to manage your data to make them open and fair"
PDF
Archiver pilot phase kick off Award Ceremony
PDF
Archiver pilot phase kick off Award Ceremony
PPTX
Deep Hybrid DataCloud
PDF
A Service-Oriented National E-Theses Information System And Repository
PPT
IMPACT at OCR Summit
PPTX
Science Demonstrator Session: Physics and Astrophysics
PPTX
Introduction to LoCloud
PPTX
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
PDF
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
PPTX
Technical integration of data repositories status and challenges
 
PDF
Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...
PPT
Rio Info 2009 - Europeana - Bram van der Werf
PPT
Videoactive @IASA World Conference 2009
Reducing Infrastructure and Service Fragmentation
Gergely Sipos, Claudio Cacciari: Welcome and mapping the landscape: EOSC-hub ...
Design phase kick-off event and Ceremony
LoCloud Annual Publishable Summary 2014-15
European Cloud Initiative: implementation status
2019 05-21 egi and eosc - final
WEBINAR: "How to manage your data to make them open and fair"
Archiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award Ceremony
Deep Hybrid DataCloud
A Service-Oriented National E-Theses Information System And Repository
IMPACT at OCR Summit
Science Demonstrator Session: Physics and Astrophysics
Introduction to LoCloud
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Technical integration of data repositories status and challenges
 
Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...
Rio Info 2009 - Europeana - Bram van der Werf
Videoactive @IASA World Conference 2009
Ad

More from EOSC-hub project (20)

PPTX
EOSC-hub Early Adopter Programme
PPTX
Introduction to service management and FitSM
PPTX
Service management board (SMB), Service providers’ forum (SPF)
PPTX
Joining the EOSC-hub as a Service Provider
PDF
PID services - understandability and findability of data
PDF
Software for data management and exploitation
PDF
Repositories for long-term preservation - certification
PDF
EOSC working group on FAIR
PDF
Updates on the FAIR Data Maturity Model RDA Working Group & the DG RTD FAIR i...
PDF
Services to support FAIR data - Introduction
PDF
EOSC-synergy
PDF
PDF
EOSC-Pillar
PDF
NI4OS-Europe
PDF
Excellerat CoE
PDF
Pathways for EOSC-hub and MaX collaboration
PDF
Overview on the HPC CoEs panorama
PDF
Overview of the Onboarding and validation process and the Rules of Participat...
PDF
ELIXIR Competence Centre in EOSC-hub
PDF
Data sharing in EOSC-hub: perspectives on “sensitive” data
EOSC-hub Early Adopter Programme
Introduction to service management and FitSM
Service management board (SMB), Service providers’ forum (SPF)
Joining the EOSC-hub as a Service Provider
PID services - understandability and findability of data
Software for data management and exploitation
Repositories for long-term preservation - certification
EOSC working group on FAIR
Updates on the FAIR Data Maturity Model RDA Working Group & the DG RTD FAIR i...
Services to support FAIR data - Introduction
EOSC-synergy
EOSC-Pillar
NI4OS-Europe
Excellerat CoE
Pathways for EOSC-hub and MaX collaboration
Overview on the HPC CoEs panorama
Overview of the Onboarding and validation process and the Rules of Participat...
ELIXIR Competence Centre in EOSC-hub
Data sharing in EOSC-hub: perspectives on “sensitive” data
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Spectroscopy.pptx food analysis technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
SOPHOS-XG Firewall Administrator PPT.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Spectroscopy.pptx food analysis technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
Assigned Numbers - 2025 - Bluetooth® Document
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools

Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd

  • 1. Franco Niccolucci & Achille Felicetti (PIN, University of Florence, Italy) EOSC-hub Week 2018 Malaga, 16/4/2018
  • 2. EOSCpilot is a project funded by the EC H2020 programme  Domain: Archaeology  Goal: semantic enrichment of texts  Archaeological documentation largely based on texts ◦ Excavation diaries, reports, surveys, grey literature ◦ Literary/historical sources. research articles, monographs … ◦ Huge number of small (<100Kb) files in different languages  Registry of 2,000,000 archaeological datasets (70% texts) in ARIADNE  ARIADNE’s data infrastructure popular among archaeologists ◦ ARIADNE users in 2016: 25-30% of the European research community ◦ Strong support by  Professional associations (EAA, EAC) & national archaeological/cultural heritage authorities  National research institutions (CNR, CNRS, CAS, ÖAW, KNAW, BAS, ATHENA RC, FORTH)  International recognition (USA, Mexico, Japan, Argentina)  Needed for cloud-based data infrastructure to be developed in ARIADNEplus ◦ Deeper integration between texts, databases, GIS etc. ◦ Advanced services & VREs for data-centric archaeological research 2
  • 3. EOSCpilot is a project funded by the EC H2020 programme  NLP & NER OS engine  Syntactic rules (tailored to specific writing style)  Texts stating facts, not stories ◦ Data fuzziness, provenance, reliability, reasoning  Domain ontology: CIDOC CRM (ISO 21127:2006) ◦ ... and not TEI  Terminology ◦ Specialized vocabularies  Terra sigillata is not just “sealed earth” ◦ Gazetteers for modern (Geonames) and ancient (Pleiades) place names  Málaga (modern) vs Màlaka (Phoenician) vs Màlaca (Roman) ◦ Named time period management  Bronze Age (∼ 3200-600 BC), Recent Orientalizing Period (∼ 630-570 BC)
  • 4. EOSCpilot is a project funded by the EC H2020 programme  Modular framework based on GATE toolchain: https://gate.ac.uk ◦ Advanced stemming/lemmatization components  OpenNLP (https://opennlp.apache.org) : sentence segmentation and part of speech (POS) tagging  OpeNER (http://www.opener-project.eu) neuronal network for advanced named entities recognition (NER), developed in OpeNER FP7 project ◦ Machine learning framework for auto education  Annotated corpus required  Ontology: CRMarcheo (CRM extension for archaeology)  Vocabularies, gazetteers and terminological tools ◦ ICCD vocabularies for Italian archaeology, augmented with term lists created on purpose ◦ Geonames (modern places), Pleiades (historical places) ◦ Timespan and named period component based on PeriodO 4
  • 5. EOSCpilot is a project funded by the EC H2020 programme  TextCrowd detects: ◦ Artefacts ◦ Colours ◦ Materials ◦ Time periods ◦ Persons ◦ Places ◦ Sites ◦ Time spans ◦ Techniques  Target output formats: ◦ Textual documents automatically annotated and enriched ◦ CIDOC CRM semantic triples (RDF) 5
  • 6. EOSCpilot is a project funded by the EC H2020 programme  No annotated text corpora available in Italian to be used as training data for machine learning algorithms ◦ Manual annotation of 400 pages of Italian archaeology reports (< 1 Person-Month)  Preparation and adaptation of vocabularies  Availability of user-friendly cloud-based environments and of necessary tools, to migrate standalone prototype to cloud ◦ Several cloud solutions tested in early development, limited support provided except in D4Science ◦ Implementation in D4Science infrastructure, but portable to other cloud services if support and required modules available  Authentication and Authorization ◦ No access control to metadata/data implemented so far ◦ Demonstrator focused on freely accessible textual documents ◦ Fasti Online used (http://www.fastionline.org) Open Access collection of archaeological reports 6
  • 7. EOSCpilot is a project funded by the EC H2020 programme  Operated and maintained by CNR-ISTI on the D4Science platform https://www.d4science.org  Modular engine based on GATE toolchain + OpenNLP-OpeNER modules, natively provided by D4Science  Web-based user interface for ◦ User and access management ◦ Cloud storage (private and shared files) ◦ Results available for other Virtual Research Environments (VRE) within D4Science  Released for open use, for tests & comments  No fancy interface produced, also to adapt to any Look-and-Feel 7
  • 8. EOSCpilot is a project funded by the EC H2020 programme  Machine-readable results: RDF encoding produced  Human-readable results: color-encoded text (for testing)  Interoperability of extracted knowledge ◦ Semantic information in CRM format: full integration and interoperability with other archaeological semantic data (to be fully implemented in ARIADNEplus)  Supporting FAIR Principles implementation ◦ Metadata to be stored in various registries for easy findability and accessibility ◦ Results ready to be reused within the same environment or consumed by other services and/or in different scenarios 8
  • 9. EOSCpilot is a project funded by the EC H2020 programme  TEXTCROWD has shown to be useful for its main purpose: to demonstrate the importance and usefulness of EOSC for scientific research in the cultural heritage domain  Adoption by other research teams in the EOSCpilot framework ◦ Integration of TEXTCROWD with new VisualMedia Demonstrator: a service for sharing and visualizing visual media files on the web - automatic metadata extraction from controlled lists or textual documents for 2D and 3D models  Testing on real use cases in progress ◦ Open Access papers of the Italian Journal Archeologia e Calcolatori, ongoing  Clean visualization  Language extension ◦ English, Dutch: from standalone to cloud-based (annotated corpora available) ◦ French, Spanish, German: new from scratch (annotated corpora to be prepared) ◦ Other EU languages: OpeNER extension required  Additional work required to suit it to everyday use – but not too much 9
  • 10. EOSCpilot is a project funded by the EC H2020 programme  TEXTCROWD Official Pages: https://eoscpilot.eu/science-demos/textcrowd https://textcrowd.d4science.org  TEXTCROWD Pilot: https://services.d4science.org/group/textcrowd/data-miner (registration required) 10
  • 11. EOSCpilot is a project funded by the EC H2020 programme 1. Upload the file(s) to analyze 2. Launch TextCrowd 3. Select the file(s) to process 4. Collect the results
  • 12. EOSCpilot is a project funded by the EC H2020 programme
  • 13. EOSCpilot is a project funded by the EC H2020 programme
  • 14. EOSCpilot is a project funded by the EC H2020 programme
  • 15. EOSCpilot is a project funded by the EC H2020 programme
  • 16. EOSCpilot is a project funded by the EC H2020 programme
  • 17. EOSCpilot is a project funded by the EC H2020 programme
  • 18. EOSCpilot is a project funded by the EC H2020 programme Franco Niccolucci: franco.niccolucci@gmail.com – Achille Felicetti: achille.felicetti@pin.unifi.it