SlideShare a Scribd company logo
How can repositories support the
text-mining of their content and
why?
@openminted_eu
Dr. Petr Knoth and Dr. Nancy Pontika
Knowledge Media institute, The Open University
United Kingdom
Twitter: @oacore
Why should repositories
support TDM?
@openminted_eu
@openminted_eu
In the UK
Repositories and TDM
@openminted_eu
Institutional
Repositories
Institutional
Repositories
Subject
Repositories
Subject
Repositories
Publishers/
OA journals
Publishers/
OA journals
Other sources:
Research
Networking
Services
Primary Research
Data...
Other sources:
Research
Networking
Services
Primary Research
Data...
Text Mining Services
How can repositories support the text mining of their content and why?
TDM & Repositories
Managers
@openminted_eu
• Established and maintain a close collaboration with
researchers
• Extensive experience in advocacy, i.e. open access
• Knowledgeable about the repository’s collection
• Participate in the Academic Institution’s Research
Committees
• Knowledgeable of your repository’s collection
• Familiarity with Copyright issues and Creative Commons
Licenses
How can repositories
support TDM?
TDM is all about processing text and data at scale.
The role of repositories is to facilitate the aggregation
of research papers at a full-text level (and beyond)
effectively enabling TDM services to operate
seamlessly on all available research content.
7
What is the problem?
@openminted_eu
• A small study (Knoth, 2013)
• 83 repositories - mainly Eprints with PDF research
outputs
• 1,461,016 metadata records
  metadata
linked to
content
content
downloadable
content
machine
readable
Mean 54.1% 34.4% 27.6%
Median 39.5% 16.7% 13.0%
Standard
deviation
39.2% 34.2% 31.0%
How is content aggregated
today?
@openminted_eu
• DC over OAI-PMH: vast majority of repositories, never
intended to support content harvesting. The main
problem: linking metadata with content.
“The nature of a resource identifier is outside the scope of the OAI-
PMH. To facilitate access to the resource associated with harvested
metadata, repositories should use an element in metadata records
to establish a linkage between the record (and the identifier of its
item) and the identifier (URL, URN, DOI, etc.) of the associated
resource. The mandatory Dublin Core format provides the identifier
element that should be used for this purpose.”
How is content aggregated
today?
@openminted_eu
• RIOXX: Just one identifier, recommends the identifier
points to the actual resource being described.
• OpenAIRE Guidelines: identifier links to either the
resource or a jump-off page. Does allow multiple
identifiers.
• ResourceSync
• CrossRef: comercial publishers/journals
The content referencing
problem
@openminted_eu
Principle 1: content
referencing
Repositories should always establish a link from the
metadata record to the item the metadata record
describes using a dereferencable identifier pointing to
the version held locally in the repository. The
dereferencable identifier should be provided in the
appropriate metadata element in the used metadata
format (i.e. dc:identifier in the case of Dublin Core).
If multiple identifiers are used, it is recommended
listing the local dereferencable identifier first.
12
The accessibility of
repositories to harvesting
systems
@openminted_eu
Principle 2: Content
accessibility to machines
Repositories must provide universal access to
machines with the same level of access as humans
have. It is the role of repositories to allow aggregators
to harvest the entire content of the repository in a
reasonable time to enable acquiring and maintain up-
to-date information about the repository content.
14
What can repositories
do?
@openminted_eu
• Ensure correct referencing of content from
metadata:
• Dereferencable link which resolves to content
• Locally held (content under its control)
• Using a standard repository platform can help
• Check robots.txt
• Register your repository
• Advocate for good pdf (media) quality of deposited content
• Use monitoring tools
• CORE Repository Dashboard
• OpenAIRE Repository Manager Dashboard
• Machine readable licensing
beyond Open
Access
MAKING SENSE OF
LARGE VOLUMES
OF
SCIENTIFIC
CONTENT
16
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:00
Mining Open Access
publications with CORE
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Developer track 1, 11:20
Oxford vs Cambridge
Contest: Collecting
Open Research
Evaluation Metrics for
University Ranking
Interested in how to
TDM research papers?
@openminted_eu
We have 3 more
talks tomorrow!
Papers 4, 4:00
Exploring
Semantometrics
: full text-based
research
evaluation for
open
repositories
Thank you
Dr. Pert Knoth,, Research Fellow
petr.knoth@open.ac.uk
Dr. Nancy Pontika, Open Access
Aggregation Officer
nancy.pontika@open.ac.uk
.
20

More Related Content

PDF
OpenMinted: It's Uses and Benefits for the Social Sciences
PPTX
The Future is All Mine
PDF
The Breakdown: What is OpenMinTeD?
PDF
Text Mining: the next data frontier. Beyond Open Access
PPTX
OpenMinTeD - Repositories in the centre of new scientific knowledge
PDF
OpenMinTeD: Making Sense of Large Volumes of Data
PPT
Jisc Text Mining Capabilities
PPTX
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...
OpenMinted: It's Uses and Benefits for the Social Sciences
The Future is All Mine
The Breakdown: What is OpenMinTeD?
Text Mining: the next data frontier. Beyond Open Access
OpenMinTeD - Repositories in the centre of new scientific knowledge
OpenMinTeD: Making Sense of Large Volumes of Data
Jisc Text Mining Capabilities
OpenAIRE in 8 minutes - Introduction to European einfrastructures session at ...

What's hot (20)

PPTX
Scholze liber 2015-06-25_final
PPT
LIBER on the path towards Open Science: Libraries as enablers
PPTX
Zenodo - The catch-all repository
PDF
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
PDF
re3data.org – a Registry of Research Data Repositories
PDF
Scholze goportis 4-11-14
PPT
Imac 090924
PDF
Elab 16 5-13-re3data-scholze-final
PDF
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
PDF
re3data.org – Registry of Research Data Repositories
PDF
Open content opens up new avenues of research
PDF
Making Research Data Repositories Visible – The re3data.org Registry
PDF
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
PPTX
Connecting Museums with Linked Data
PDF
PPTX
Rebecca Grant - DRI Training Series: 1. Organising Your Collection
PPTX
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
PDF
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
PPTX
Research data management: DMP & repository
PDF
Open Science in HORIZON Grant Agreement
Scholze liber 2015-06-25_final
LIBER on the path towards Open Science: Libraries as enablers
Zenodo - The catch-all repository
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
re3data.org – a Registry of Research Data Repositories
Scholze goportis 4-11-14
Imac 090924
Elab 16 5-13-re3data-scholze-final
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
re3data.org – Registry of Research Data Repositories
Open content opens up new avenues of research
Making Research Data Repositories Visible – The re3data.org Registry
(Big) bibliographic data @ ScaDS project meeting - 2015-06-12
Connecting Museums with Linked Data
Rebecca Grant - DRI Training Series: 1. Organising Your Collection
FREYA - Connected Open Identifiers for Discovery, Access and Use of Research ...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Research data management: DMP & repository
Open Science in HORIZON Grant Agreement
Ad

Similar to How can repositories support the text mining of their content and why? (20)

PPTX
How can repositories support the text-mining of their content and why?
PDF
A Pragmatic Approach to Facilitating Text and Data Mining
PPTX
Towards an Infrastructure for Mining Scientific Publications
PDF
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
PPTX
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
PPTX
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
PDF
OSFair2017 training | Machine accessibility of Open Access scientific publica...
PDF
Engaging Information Professionals in the Process of Authoritative Interlinki...
PPTX
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
PPTX
Next generation repositories
PPT
Repositories Update (UK)
PDF
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
PPTX
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
PDF
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
PPTX
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
PPTX
Patham "NISO-ODI (Open Discovery Initiative) Standards Update"
PPT
Open Science, Open Data: towards a new transparent and reproducible ecosystem
PPTX
OA Repositories for DE in Myanmar presentation
PDF
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
PPTX
The once and future library - reimagining the national library as infrastruct...
How can repositories support the text-mining of their content and why?
A Pragmatic Approach to Facilitating Text and Data Mining
Towards an Infrastructure for Mining Scientific Publications
Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining P...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OSFair2017 training | Machine accessibility of Open Access scientific publica...
Engaging Information Professionals in the Process of Authoritative Interlinki...
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...
Next generation repositories
Repositories Update (UK)
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
(Open) Research Data Management in H2020 (ISERD – Tel Aviv, Oct 31, 2016)
OpenAIRE compatibility for repositories - Webinar on the OpenAIRE Guidelines
IDCC workshop: OpenAIRE services and tools for Open Research Data in H2020
Patham "NISO-ODI (Open Discovery Initiative) Standards Update"
Open Science, Open Data: towards a new transparent and reproducible ecosystem
OA Repositories for DE in Myanmar presentation
Webinar on OpenAIRE compatibility for repositories: DSpace repository platform
The once and future library - reimagining the national library as infrastruct...
Ad

More from openminted_eu (11)

PPTX
Supporting the uptake of TDM
PDF
OpenMinTeD, LIBER conference 2017
PDF
Resource sync overview and real-world use cases for discovery, harvesting, an...
PPTX
Seamless access to the world's open access research papers via resources sync
PDF
Webinar slides: Interoperability between resources involved in TDM at the lev...
PDF
Legal issues Text and Data Mining
PPTX
Tentative steps in mining UK theses
PPTX
OpenMinTeD - Une infrastructure text-mining au service des scientifiques
PDF
Infrastructure crossroads... and the way we walked them in DKPro
PDF
Experiences of Text Mining; the National Library of Austria perspective
PPT
Text and Data Mining at the Royal Library in the Netherlands
Supporting the uptake of TDM
OpenMinTeD, LIBER conference 2017
Resource sync overview and real-world use cases for discovery, harvesting, an...
Seamless access to the world's open access research papers via resources sync
Webinar slides: Interoperability between resources involved in TDM at the lev...
Legal issues Text and Data Mining
Tentative steps in mining UK theses
OpenMinTeD - Une infrastructure text-mining au service des scientifiques
Infrastructure crossroads... and the way we walked them in DKPro
Experiences of Text Mining; the National Library of Austria perspective
Text and Data Mining at the Royal Library in the Netherlands

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
1_Introduction to advance data techniques.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Computer network topology notes for revision
PPTX
Introduction to machine learning and Linear Models
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Fluorescence-microscope_Botany_detailed content
Business Analytics and business intelligence.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
1_Introduction to advance data techniques.pptx
Reliability_Chapter_ presentation 1221.5784
IB Computer Science - Internal Assessment.pptx
SAP 2 completion done . PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
oil_refinery_comprehensive_20250804084928 (1).pptx
Computer network topology notes for revision
Introduction to machine learning and Linear Models
Business Ppt On Nestle.pptx huunnnhhgfvu
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Fluorescence-microscope_Botany_detailed content

How can repositories support the text mining of their content and why?

  • 1. How can repositories support the text-mining of their content and why? @openminted_eu Dr. Petr Knoth and Dr. Nancy Pontika Knowledge Media institute, The Open University United Kingdom Twitter: @oacore
  • 2. Why should repositories support TDM? @openminted_eu
  • 4. Repositories and TDM @openminted_eu Institutional Repositories Institutional Repositories Subject Repositories Subject Repositories Publishers/ OA journals Publishers/ OA journals Other sources: Research Networking Services Primary Research Data... Other sources: Research Networking Services Primary Research Data... Text Mining Services
  • 6. TDM & Repositories Managers @openminted_eu • Established and maintain a close collaboration with researchers • Extensive experience in advocacy, i.e. open access • Knowledgeable about the repository’s collection • Participate in the Academic Institution’s Research Committees • Knowledgeable of your repository’s collection • Familiarity with Copyright issues and Creative Commons Licenses
  • 7. How can repositories support TDM? TDM is all about processing text and data at scale. The role of repositories is to facilitate the aggregation of research papers at a full-text level (and beyond) effectively enabling TDM services to operate seamlessly on all available research content. 7
  • 8. What is the problem? @openminted_eu • A small study (Knoth, 2013) • 83 repositories - mainly Eprints with PDF research outputs • 1,461,016 metadata records   metadata linked to content content downloadable content machine readable Mean 54.1% 34.4% 27.6% Median 39.5% 16.7% 13.0% Standard deviation 39.2% 34.2% 31.0%
  • 9. How is content aggregated today? @openminted_eu • DC over OAI-PMH: vast majority of repositories, never intended to support content harvesting. The main problem: linking metadata with content. “The nature of a resource identifier is outside the scope of the OAI- PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”
  • 10. How is content aggregated today? @openminted_eu • RIOXX: Just one identifier, recommends the identifier points to the actual resource being described. • OpenAIRE Guidelines: identifier links to either the resource or a jump-off page. Does allow multiple identifiers. • ResourceSync • CrossRef: comercial publishers/journals
  • 12. Principle 1: content referencing Repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository. The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format (i.e. dc:identifier in the case of Dublin Core). If multiple identifiers are used, it is recommended listing the local dereferencable identifier first. 12
  • 13. The accessibility of repositories to harvesting systems @openminted_eu
  • 14. Principle 2: Content accessibility to machines Repositories must provide universal access to machines with the same level of access as humans have. It is the role of repositories to allow aggregators to harvest the entire content of the repository in a reasonable time to enable acquiring and maintain up- to-date information about the repository content. 14
  • 15. What can repositories do? @openminted_eu • Ensure correct referencing of content from metadata: • Dereferencable link which resolves to content • Locally held (content under its control) • Using a standard repository platform can help • Check robots.txt • Register your repository • Advocate for good pdf (media) quality of deposited content • Use monitoring tools • CORE Repository Dashboard • OpenAIRE Repository Manager Dashboard • Machine readable licensing
  • 16. beyond Open Access MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT 16
  • 17. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Developer track 1, 11:00 Mining Open Access publications with CORE
  • 18. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Developer track 1, 11:20 Oxford vs Cambridge Contest: Collecting Open Research Evaluation Metrics for University Ranking
  • 19. Interested in how to TDM research papers? @openminted_eu We have 3 more talks tomorrow! Papers 4, 4:00 Exploring Semantometrics : full text-based research evaluation for open repositories
  • 20. Thank you Dr. Pert Knoth,, Research Fellow petr.knoth@open.ac.uk Dr. Nancy Pontika, Open Access Aggregation Officer nancy.pontika@open.ac.uk . 20

Editor's Notes

  • #8: Mining individual repositories is not intersteing. TDM is about processing at scale. The role of repositories is: …
  • #9: So why am I talking about what the role of the repositories is? Well I think we have a slight problem here … We have done a study to …
  • #10: The main problem: linking metadata with content.
  • #11: OpenAIRE guidelines: https://guidelines.openaire.eu/en/latest/literature/field_resourceidentifier.html The ideal use of this element is to use a direct link or a link to a jump-off page (persistent URL) fromdc:identifier in the metadata record to the digital resource or a jump-off page.
  • #12: <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
  • #13: The problem is not multiple metadata formats, but the fact that none of them is good enough! Thinking that by supporting the guidelines you allow content aggregation is an issue. Locally means within the repositories control. <dc:identifier> field: The aim of the Dublin Core Metadata tags is to ensure online interoperability of metadata standards. The importance of the <dc:identifier> tag is that it describes the resource of the harvested output. CORE expects in this field to find the direct URL of the PDF. When the information in this field is not presented properly, the CORE crawler needs to crawl for the PDF and the success of finding it cannot be guaranteed. This also causes additional server processing time and bandwidth both for the harvester and the hosting institution.There are also three additional points that need to be considered with regards to the <dc:identifier>; a) this field should describe an absolute path to the file, b) it should contain an appropriate file name extension, for example “.pdf” and c) the full-text items should be stored under the same repository domain.
  • #14: Arxiv has now a slightly nicer robots.txt where anoyone is allowed access with a 15s delay. Still not doable …
  • #16: Platform: For those who haven’t deployed a repository yet, it is highly advised that the repository platform is not built in house, but one of the industry standard platforms is chosen. The benefits of choosing one of the existing platforms is that they provide frequent content updates, constant support and extend repository functionality through plug-ins.
  • #17: Our ultimate goal is to put in place infrastructure that will enable anyone to make sense of large volumes of scientific data. The infrastructure is open and transparent.
  • #18: If you are interested in how we makes sense of the large volumes of scientific content.
  • #19: If you are interested in how we makes sense of the large volumes of scientific content.
  • #20: If you are interested in how we makes sense of the large volumes of scientific content.