SlideShare a Scribd company logo
“Challenges in Extracting and
Managing References”
EXCITE Workshop 2017
Philipp Mayr
GESIS – Leibniz Institute for the Social Sciences
2017-03-30
http://west.uni-koblenz.de/en/research/excite/workshop-2017
#excitews2017
Who we are
• PI: Steffen Staab (WeST), Philipp Mayr (GESIS)
• Researchers: Behnam Ghavimi, Martin Körner
• Collaborator: Heinrich Hartmann (Independent)
2
EXCITE: Background
3
• We run productive search systems and research in
information retrieval, recommendation systems and
knowledge discovery
 Sowiport http://sowiport.gesis.org/
 Related-work http://dev.related-work.net/
• Shortage of citation data for the international and
German social sciences
• Open availability of citation data is still very limited
 Open Citation Corpus http://opencitations.net/
 CitEc: Citations in Economics http://citec.repec.org/
EXCITE: Background
4
EXCITE: Main objectives
• Narrow the supply gap of citation data in the social
sciences
• Improve the breadth and accuracy of current reference
extraction systems
• Develop web service API to allow third-parties to extract
citation data from arbitrary publications
• Integrate and publish the extracted citation data
General objectives
• Developing a toolchain of citation extraction and
matching software
• Tools and data will be made available to researchers
5
EXCITE: five steps
(1) extraction of text from source documents,
(2) identification of reference sections and other forms of embedded reference information within the text,
(3) segmentation of individual references into its constituent fields such as author, title, etc.,
(4) matching of reference strings against databases of bibliographic information,
(5) the export of matched references to reusable formats 6
1st workshop
EXCITE: Outcomes
• Open reference extraction tools for PDF
documents
• Open datasets of segmented references
• Web service API for citation extraction
• Assessment of the overall quality of the
extraction and matching pipeline
• Open gold standard testbed
• Improved infrastructures and services
7
Workshop Agenda: Day 1
8
12:20 Information Extraction out of Born-Digital Scientific Articles Roman Kern, TU Graz
12:40 Advanced citation matching and large-scale full-text analysis Nees Jan van Eck, Leiden U
13:00 Lunch Break (Cafeteria)
14:20 APIs for third parties to extract and deposit output executions of
automated extraction pipelines (via videoconferencing)
Min-Yen Kan, NU Singapore
14:40 Extracting references from scientific articles in CERMINE
system
Dominika Tkaczyk, U Warsaw
15:00 Coffee Break (Cafeteria)
15:30 CitEc to CitEcCyr. A stab at distributed citation systems.
(via videoconferencing)
Thomas Krichel, Open Library
Society, NYC
15:50 EXCITE project: Status report Behnam Ghavimi, GESIS
Martin Körner, WeST
Heinrich Hartmann, Circonus
16:10 Processing of in-text References: Towards a Semantic Analysis Marc Bertin, U Toulouse
16:30 Citations in Utopia Documents David Thorne, U Manchester
16:50 Coffee Break (Cafeteria)
17:20 Research around the Tagging System BibSonomy Andreas Hotho, U Würzburg
17:50 LOC-DB: A Linked Open Citation Database provided by Libraries.
Motivation and Challenges.
Kai Eckert, HDM Stuttgart
Anne Lauscher, HDM Stuttgart
Akansha Bhardwaj, DFKI
18:20 tbd (via videoconferencing) Lee Giles, Penn State U
Workshop Agenda: Day 2
• Hand-on sessions
9
9:00 Second Day Kickoff (Room: West II)
9:15 Extraction Result
Discussion Group
Gold Standard
Discussion Group
Collaboration
Discussion Group
11:15 Coffee Break (Cafeteria)
11:30 Extraction Result
Discussion Group
Gold Standard
Discussion Group
Collaboration
Discussion Group
12:30 Closing Talks
(Room: West II)
13:00 End
Thank you
Contact:
Dr Philipp Mayr
GESIS - Leibniz Institute for the Social Sciences, Germany
Email: philipp.mayr@gesis.org
Twitter: @philipp_mayr
• Workshop website
http://west.uni-koblenz.de/en/research/excite/workshop-
2017
• GIT https://github.com/exciteproject/
10

More Related Content

PPTX
Masterclass Research Support
PDF
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
PPTX
Large-scale visualization of science
PDF
VOSviewer: A software tool for analyzing and visualizing scientific literature
PPTX
Intermediacy of publications
PDF
Using full-text data to create improved term maps
PDF
CAVAL ANDS Workshop - Managing library teams for a research and data-intensiv...
PDF
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
Masterclass Research Support
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
Large-scale visualization of science
VOSviewer: A software tool for analyzing and visualizing scientific literature
Intermediacy of publications
Using full-text data to create improved term maps
CAVAL ANDS Workshop - Managing library teams for a research and data-intensiv...
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking

What's hot (20)

PPTX
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
PDF
VOSviewer and CitNetExplorer Tutorial
PDF
Infrastruktur för att stimulera och hjälpa forskarna att tillgängliggöra fors...
PDF
Multiple perspectives on bibliometric data
PDF
Bibliometric network analysis: Software tools, techniques, and an analysis o...
PDF
Large-scale analysis of bibliometric data sources
PDF
MITProfessionalX DSx Certificate | MIT Professional Education Digital Programs
PPTX
Research Data Management Librarian trial presentation at CRIG
PPTX
JISC LIDP ILI2011
PPTX
OSGIS Conference: report on RDA/MPG Science workshop
PPTX
Ranking universities responsibly
PPTX
UKSG Conference 2016 Breakout Session - Online authoring tools for global col...
PPTX
Recent applications of Knowledge Organization Systems
PDF
Resume12-24-16
PDF
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
PDF
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
PDF
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
PPT
Online Resources at the Wellcome Library
PPTX
Programme presentation (library systems)
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
VOSviewer and CitNetExplorer Tutorial
Infrastruktur för att stimulera och hjälpa forskarna att tillgängliggöra fors...
Multiple perspectives on bibliometric data
Bibliometric network analysis: Software tools, techniques, and an analysis o...
Large-scale analysis of bibliometric data sources
MITProfessionalX DSx Certificate | MIT Professional Education Digital Programs
Research Data Management Librarian trial presentation at CRIG
JISC LIDP ILI2011
OSGIS Conference: report on RDA/MPG Science workshop
Ranking universities responsibly
UKSG Conference 2016 Breakout Session - Online authoring tools for global col...
Recent applications of Knowledge Organization Systems
Resume12-24-16
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
MITProfessionalX DSx Certificate _ MIT Professional Education Digital Programs
Online Resources at the Wellcome Library
Programme presentation (library systems)
Ad

Viewers also liked (19)

PPTX
Wasabi waiter game - Gamification in recruitment - Manu Melwin Joy
PDF
Frank Michael encore un peu plus riche
PDF
Experience WOW. A new benchmark in MDP / EDP
PPT
Critiques of Community Forestry
PPT
Nutritional Therapists of Ireland, Health Impacts of Water Fluoridation May 2014
PDF
Groeispurt voor bvba kunstenaar Koen Vanmechelen
PPTX
Analytic and strategic challenges of serious games
PDF
La hosteleria 22 Marta Pedra Wines (2)
PDF
Drupal y la programación defensiva (spanish version)
PPT
Presentacio nova cultura de la salut Toni Barbarà
PDF
Relatório Caminhos para a produtividade - Indústria 4.0
PDF
The Winning Brand Formula
PPTX
International Public Relations - Overview
PDF
結晶と空間充填 #ロマ数ボーイズ
PPTX
Treatment of tb with sirturo
PPTX
Toma de muestra para analisis microbiologico de la leche y productos lacteos.
 
PDF
Document Writing in CI Environment
PDF
Open Source Governance - The Hard Parts
DOC
Davignon én Simonet wilden baas van NAVO worden
Wasabi waiter game - Gamification in recruitment - Manu Melwin Joy
Frank Michael encore un peu plus riche
Experience WOW. A new benchmark in MDP / EDP
Critiques of Community Forestry
Nutritional Therapists of Ireland, Health Impacts of Water Fluoridation May 2014
Groeispurt voor bvba kunstenaar Koen Vanmechelen
Analytic and strategic challenges of serious games
La hosteleria 22 Marta Pedra Wines (2)
Drupal y la programación defensiva (spanish version)
Presentacio nova cultura de la salut Toni Barbarà
Relatório Caminhos para a produtividade - Indústria 4.0
The Winning Brand Formula
International Public Relations - Overview
結晶と空間充填 #ロマ数ボーイズ
Treatment of tb with sirturo
Toma de muestra para analisis microbiologico de la leche y productos lacteos.
 
Document Writing in CI Environment
Open Source Governance - The Hard Parts
Davignon én Simonet wilden baas van NAVO worden
Ad

More from GESIS (20)

PDF
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs
PPTX
10th BIR Workshop @ECIR 2020: introduction
PPTX
From closed to open access: A case study of flipped journals
PPTX
Highly cited references in PLOS ONE and their in-text usage over time
PDF
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
PPTX
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
PPTX
Analyzing the network structure and gender differences of the “NKOS community”
PPTX
Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...
PPTX
Searching beyond datasets in the Social Sciences
PPTX
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
PPTX
Contextualised Browsing in a Digital Library’s Living Lab
PPTX
41st European Conference on Information Retrieval (ECIR 2019)
PPTX
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
PDF
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
PDF
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
PPTX
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
PDF
Recent Advances in Bibliometric-Enhanced Information Retrieval
PPTX
Analyzing the research output presented at European Networked Knowledge Organ...
PPTX
Introduction to the 15th NKOS workshop @TPDL2016
PDF
Using co-authorship networks for author name disambiguation
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs
10th BIR Workshop @ECIR 2020: introduction
From closed to open access: A case study of flipped journals
Highly cited references in PLOS ONE and their in-text usage over time
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Analyzing the network structure and gender differences of the “NKOS community”
Recent advances in the project EXCITE – Extraction of Citations from PDF Docu...
Searching beyond datasets in the Social Sciences
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
Contextualised Browsing in a Digital Library’s Living Lab
41st European Conference on Information Retrieval (ECIR 2019)
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Recent Advances in Bibliometric-Enhanced Information Retrieval
Analyzing the research output presented at European Networked Knowledge Organ...
Introduction to the 15th NKOS workshop @TPDL2016
Using co-authorship networks for author name disambiguation

Recently uploaded (20)

PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPT
protein biochemistry.ppt for university classes
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Overview of calcium in human muscles.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
An interstellar mission to test astrophysical black holes
PPTX
BIOMOLECULES PPT........................
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
7. General Toxicologyfor clinical phrmacy.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
protein biochemistry.ppt for university classes
Placing the Near-Earth Object Impact Probability in Context
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Overview of calcium in human muscles.pptx
2. Earth - The Living Planet Module 2ELS
C1 cut-Methane and it's Derivatives.pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Phytochemical Investigation of Miliusa longipes.pdf
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
An interstellar mission to test astrophysical black holes
BIOMOLECULES PPT........................
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...

Challenges in Extracting and Managing References

  • 1. “Challenges in Extracting and Managing References” EXCITE Workshop 2017 Philipp Mayr GESIS – Leibniz Institute for the Social Sciences 2017-03-30 http://west.uni-koblenz.de/en/research/excite/workshop-2017 #excitews2017
  • 2. Who we are • PI: Steffen Staab (WeST), Philipp Mayr (GESIS) • Researchers: Behnam Ghavimi, Martin Körner • Collaborator: Heinrich Hartmann (Independent) 2
  • 3. EXCITE: Background 3 • We run productive search systems and research in information retrieval, recommendation systems and knowledge discovery  Sowiport http://sowiport.gesis.org/  Related-work http://dev.related-work.net/ • Shortage of citation data for the international and German social sciences • Open availability of citation data is still very limited  Open Citation Corpus http://opencitations.net/  CitEc: Citations in Economics http://citec.repec.org/
  • 5. EXCITE: Main objectives • Narrow the supply gap of citation data in the social sciences • Improve the breadth and accuracy of current reference extraction systems • Develop web service API to allow third-parties to extract citation data from arbitrary publications • Integrate and publish the extracted citation data General objectives • Developing a toolchain of citation extraction and matching software • Tools and data will be made available to researchers 5
  • 6. EXCITE: five steps (1) extraction of text from source documents, (2) identification of reference sections and other forms of embedded reference information within the text, (3) segmentation of individual references into its constituent fields such as author, title, etc., (4) matching of reference strings against databases of bibliographic information, (5) the export of matched references to reusable formats 6 1st workshop
  • 7. EXCITE: Outcomes • Open reference extraction tools for PDF documents • Open datasets of segmented references • Web service API for citation extraction • Assessment of the overall quality of the extraction and matching pipeline • Open gold standard testbed • Improved infrastructures and services 7
  • 8. Workshop Agenda: Day 1 8 12:20 Information Extraction out of Born-Digital Scientific Articles Roman Kern, TU Graz 12:40 Advanced citation matching and large-scale full-text analysis Nees Jan van Eck, Leiden U 13:00 Lunch Break (Cafeteria) 14:20 APIs for third parties to extract and deposit output executions of automated extraction pipelines (via videoconferencing) Min-Yen Kan, NU Singapore 14:40 Extracting references from scientific articles in CERMINE system Dominika Tkaczyk, U Warsaw 15:00 Coffee Break (Cafeteria) 15:30 CitEc to CitEcCyr. A stab at distributed citation systems. (via videoconferencing) Thomas Krichel, Open Library Society, NYC 15:50 EXCITE project: Status report Behnam Ghavimi, GESIS Martin Körner, WeST Heinrich Hartmann, Circonus 16:10 Processing of in-text References: Towards a Semantic Analysis Marc Bertin, U Toulouse 16:30 Citations in Utopia Documents David Thorne, U Manchester 16:50 Coffee Break (Cafeteria) 17:20 Research around the Tagging System BibSonomy Andreas Hotho, U Würzburg 17:50 LOC-DB: A Linked Open Citation Database provided by Libraries. Motivation and Challenges. Kai Eckert, HDM Stuttgart Anne Lauscher, HDM Stuttgart Akansha Bhardwaj, DFKI 18:20 tbd (via videoconferencing) Lee Giles, Penn State U
  • 9. Workshop Agenda: Day 2 • Hand-on sessions 9 9:00 Second Day Kickoff (Room: West II) 9:15 Extraction Result Discussion Group Gold Standard Discussion Group Collaboration Discussion Group 11:15 Coffee Break (Cafeteria) 11:30 Extraction Result Discussion Group Gold Standard Discussion Group Collaboration Discussion Group 12:30 Closing Talks (Room: West II) 13:00 End
  • 10. Thank you Contact: Dr Philipp Mayr GESIS - Leibniz Institute for the Social Sciences, Germany Email: philipp.mayr@gesis.org Twitter: @philipp_mayr • Workshop website http://west.uni-koblenz.de/en/research/excite/workshop- 2017 • GIT https://github.com/exciteproject/ 10