SlideShare a Scribd company logo
Literature-Data Integration in the Life Sciences
Lisbon, Oct 2nd 2012
Publications and Data Sources
Europe PubMed Central


26 million abstracts



                       2.3 million full text articles


                                        Citation networks
                                        Database links
                                        Text-mining




    2006                           2011                     2012   2016?
How many open access articles in UKPMC?
                                                                     PubMed (995K)




                                                                     UKPMC (18%,182K)
                                                                     OA (9.6%, 96K)

 200   200   200   200    200   200    200     200   200   20   20
                                Publication Date



                         Total: 489,000 OA articles
45000



 • Big data
                                                                                                                                           300
                                                                          European Nucleotide Archive                                             Ensembl and Ensembl Genomes




                                   Nucleotides (millions)
                                                             40000
                                                                                                                                           250
                                                             35000




 • Thematic data
                                                             30000                                                                         200




                                                                                                                                Genomes
                                                             25000
                                                                                                                                           150
                                                             20000



 • Public data                                               15000

                                                             10000
                                                                                                                                           100


                                                                                                                                            50



 • Archived data
                                                              5000

                                                                 0                                                                           0
                                                                        2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
                                                                                                                                                  2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
                                                                                              Year
                                                      14000000                                                                            25000
                                                                                                                                                                        Year
                                                      12000000
                                                                        UniProt                                                                    InterPro
                                                                                                                                          20000
                                                      10000000




                             Entries
• Two petabytes of data




                                                                                                                               Entries
                                                            8000000                                                                       15000



• Scales to 7 pbs raw disk
                                                            6000000
                                                                                                                                          10000
                                                            4000000


• Majority is DNA
                                                                                                                                           5000
                                                            2000000

                                                                    0                                                                         0

                                                                        2001 2002 2003 2004 2005 2006 2007 2008 2009 2010                          2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

                                                                                            Year                                                                        Year
                                                            500000
                                                                                                                                          70000
                                                            450000       ArrayExpress
                                                                                                                                                   PDBe
                              Hybridisations




                                                            400000                                                                        60000




                                                                                                                             Structures
                                                            350000
                                                                                                                                          50000
                                                            300000
                                                                                                                                          40000
                                                            250000
                                                            200000                                                                        30000
                                                            150000
                                                                                                                                          20000
                                                            100000
                                                                                                                                          10000
                                                            50000
                                                                0                                                                            0
                                                                         2001 2002 2003 2004 2005 2006 2007 2008 2009 2010                        2001 2002 2003 2004 2005 2006 2007 2008 2009 2010


                                                                                               Year                                                                       Year

                                                                                                    Figure 2. Growth of key resources
Literature citation from data
              vs
Data referal from literature
PMC336623   Extended to several other biological data types
Literature citation from data
800 K                     •   Proteins
                          •   Nucleotides
                          •   OMIM
                          •   Chemicals
                          •   Structure
                          •   Clinical reviews
          370 K           •   Protein families
                          •   Protein-protein interactions
                          •   Gene expression experiments
                  110 K
Data referral from literature: text mining

Semantic Type   Unique Terms             Articles   Annotations
Accession No.         233,017             66,356        387,787
Chemical                76,712      1,694,385        83,923,066
Disease               171,692       1,768,214        57,821,871
Gene/Protein          227,318       1,310,382        77,189,022
GO Terms                32,664      1,832,294        65,061,579
Organism              180,637       1,713,280        70,832,222


                  2.3 million articles
Annotation of accession numbers (OA)
100                                          100
90                                            90
80                                            80
70                                            70
60                                            60
50              publisher-annotated           50                  text-mined
40                                            40
30                                            30
20                                            20
10                                            10
  0                                            0




                ~10,000 articles                          >25,000 articles

      BMC Genomics:   1,484 TM tagged,   4,337 articles (1135 tagged)
      PLoS One:       4,226 TM tagged,   42,888 articles


                                                             SenayKafkas and Jee-Hyub Kim
Why is this important? Implications
Scientific:
    Linking articles that cite the same data
Citation:
    Data Citation as measure of impact (Thomson: Data citation index)
    Context of data citation: submission, reuse, analysis
Operational:
    Services for publishers to improve Accession number tagging
    Editorial policies and adherence
    Extension of NLM DTD
    Lessons learned for considering unstructured data

 That we can perform this analysis at all highlights a benefit of Open Access
Case Study of an FP7-funded article (1)
Case Study of an FP7-funded article (2)
Europe PubMed Central content map


   Abstract    Full text
                                               Citing
                                               articles


                           Unstructured
                           Datasets

   Databases

               Extracted
               terms



                                          Citing
                                          articles
AY387398: needle in a haystack
Europe PubMed Central and Institutional Repositories:
               content matching




                          Number of article IDs
    OpenAIRE plus



      **Coming soon: RESTful interface for data linked to articles
People
•   Paula Buttery     • Rebholz Group
•   Andrew Caines     • Peter Stoehr
•   Norman Cobley
•   Yuci Gou          • University of Manchester
•   SenayKafkas       • British Library
•   JyothiKaturi
•   Oliver Kilian     • OpenAIRE/OpenAIRE Plus
•   Jee-Hyub Kim
•   Nikos Marinos     • NCBI, NLM
•   Jo McEntyre
•   Xingjun Pi
•   Philip Rossiter

More Related Content

PDF
CERN: Big Science Meets Big Data
PPTX
Session 2A - Les Shephard
PDF
01 edwin koot - solarplaza
PDF
Android apps development by Small company in Japan
PDF
IBM Storwize V7000 Ultimate Performance Eng
PPT
Walking through a library remotely. Digital Humanities seminar April 12, 2013...
PPTX
Shortening distances with destination branding inglés
PDF
Enabling Clean Talking
CERN: Big Science Meets Big Data
Session 2A - Les Shephard
01 edwin koot - solarplaza
Android apps development by Small company in Japan
IBM Storwize V7000 Ultimate Performance Eng
Walking through a library remotely. Digital Humanities seminar April 12, 2013...
Shortening distances with destination branding inglés
Enabling Clean Talking

What's hot (20)

PDF
Utah Adult Education Report Card (2008-2009)
PDF
Australia's Future Health
PDF
Amsa annual national leadership development seminar 30 aug 2010
PPTX
Trends of Formal and Informal Livestock Marketing in Ethiopia
PDF
Visual data mining with HeatMiner
PPTX
Climate Finance for Sustainable Infrastructure Development
PDF
Australia's Future Health
PPTX
Resource Efficiency and Waste: The Challenge for Ireland
XLS
Vertical format for trading account, profit and loss account & balance sheet
PDF
Netflix Business Plan with SWOT for Spain
PDF
Clinical Trials in Australia
PDF
01 Stig Andersen Five Ways To Adapt To Declining Changing Paper Markets
PDF
World Newspaper Congress 11, World Editors Forum 11, World Press Trends 2011,...
PPTX
Geospatially knowing the fire
PDF
Poster presentation
PPTX
04 heederik benzeno
PPTX
Impact of Agricultural Activities on Groundwater Quality and its Suitability ...
PDF
Apstartup crowdfunding ver1
PDF
Update on US Rail Transportation
PPTX
CA coordination in Zimbabwe. through the Zimbabwe CA taskforce ZWCATF. Michae...
Utah Adult Education Report Card (2008-2009)
Australia's Future Health
Amsa annual national leadership development seminar 30 aug 2010
Trends of Formal and Informal Livestock Marketing in Ethiopia
Visual data mining with HeatMiner
Climate Finance for Sustainable Infrastructure Development
Australia's Future Health
Resource Efficiency and Waste: The Challenge for Ireland
Vertical format for trading account, profit and loss account & balance sheet
Netflix Business Plan with SWOT for Spain
Clinical Trials in Australia
01 Stig Andersen Five Ways To Adapt To Declining Changing Paper Markets
World Newspaper Congress 11, World Editors Forum 11, World Press Trends 2011,...
Geospatially knowing the fire
Poster presentation
04 heederik benzeno
Impact of Agricultural Activities on Groundwater Quality and its Suitability ...
Apstartup crowdfunding ver1
Update on US Rail Transportation
CA coordination in Zimbabwe. through the Zimbabwe CA taskforce ZWCATF. Michae...
Ad

Similar to Access to open data through open access articles in the life sciences (20)

PPTX
Top Application Performance Landmines
PPTX
Biomedical Annotation - Kevin Livingston
PPT
Mangalore University Library Orientation 2011
PPTX
Aslam Mehdi etbs2012
PDF
Novell ZENworks Configuration Management Database Management
PPT
Open Source
PDF
Taming the Big Data Tsunami using Intel Architecture
PDF
Analyzing professional writing from multiple sources via keystroke logging wi...
XLSX
Cases and places 2
XLSX
Cases and places 2
XLSX
Cases and places 2
XLS
Bass Diffusion Model
PPT
Verification Metrics
PDF
101 cd 1600-1630
PDF
A Function by Any Other Name is a Function
PPTX
Use Distributed Filesystem as a Storage Tier
PDF
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
PDF
모바일Bm 대한상공회의소 Mac
PDF
Bioinformatic approaches to functionally characterise RNAs
PPTX
Engagement Metrics Nov 2011
Top Application Performance Landmines
Biomedical Annotation - Kevin Livingston
Mangalore University Library Orientation 2011
Aslam Mehdi etbs2012
Novell ZENworks Configuration Management Database Management
Open Source
Taming the Big Data Tsunami using Intel Architecture
Analyzing professional writing from multiple sources via keystroke logging wi...
Cases and places 2
Cases and places 2
Cases and places 2
Bass Diffusion Model
Verification Metrics
101 cd 1600-1630
A Function by Any Other Name is a Function
Use Distributed Filesystem as a Storage Tier
Bottlenecks, Bottlenecks, and more Bottlenecks: Lessons Learned from 2 Years ...
모바일Bm 대한상공회의소 Mac
Bioinformatic approaches to functionally characterise RNAs
Engagement Metrics Nov 2011
Ad

More from Conferência Luso-Brasileira de Ciência Aberta (20)

PPTX
Citações e métricas complementares: um estudo da sua correlação em artigos ci...
PDF
Pré-Workshop: Formação em Edição Eletrónica
PDF
Análise relacional entre princípios FAIR de gestão de dados de pesquisa e nor...
PPTX
Programa de formação modular sobre Ciência Aberta
PPTX
Análise da Produção Científica Brasileira em Periódicos de Acesso Aberto
PPTX
Acesso aberto como ferramenta para o empoderamento do paciente
PPT
Livros eletrônicos, políticas de licenciamento e acesso aberto - relações con...
PPTX
Ciência aberta e revisão por pares aberta: aspectos e desafios da participaçã...
PDF
Melhorando a citabilidade de programas de computador para pesquisa com o Cita...
PPTX
Técnicas de Search Engine Optimization (SEO) aplicadas no site da Biblioteca ...
PPT
Café com Ciência – divulgação das publicações técnico-científicas em acesso a...
PPTX
Serviço Nacional de Registo de Identificadores DOI
PPTX
Recursos educacionais abertos na Universidade Aberta. A rede como estratégia ...
PPTX
Infraestrutura OpenAIRE: desenvolvimentos para o fortalecimento da Ciência Ab...
PPTX
Preservação digital, gestão de dados de pesquisa e biodversidade
PPTX
Dados governamentais na perspectiva da Ciência Aberta: potencialidades e desa...
PPTX
Do acesso à informação aos Dados Parlamentares Abertos em Portugal
PDF
Transparência e Dados Abertos do Recife: Uma Estratégia Bem Sucedida de Publi...
PPTX
Revistas científicas brasileiras de acesso aberto: qualidade do ponto de vist...
Citações e métricas complementares: um estudo da sua correlação em artigos ci...
Pré-Workshop: Formação em Edição Eletrónica
Análise relacional entre princípios FAIR de gestão de dados de pesquisa e nor...
Programa de formação modular sobre Ciência Aberta
Análise da Produção Científica Brasileira em Periódicos de Acesso Aberto
Acesso aberto como ferramenta para o empoderamento do paciente
Livros eletrônicos, políticas de licenciamento e acesso aberto - relações con...
Ciência aberta e revisão por pares aberta: aspectos e desafios da participaçã...
Melhorando a citabilidade de programas de computador para pesquisa com o Cita...
Técnicas de Search Engine Optimization (SEO) aplicadas no site da Biblioteca ...
Café com Ciência – divulgação das publicações técnico-científicas em acesso a...
Serviço Nacional de Registo de Identificadores DOI
Recursos educacionais abertos na Universidade Aberta. A rede como estratégia ...
Infraestrutura OpenAIRE: desenvolvimentos para o fortalecimento da Ciência Ab...
Preservação digital, gestão de dados de pesquisa e biodversidade
Dados governamentais na perspectiva da Ciência Aberta: potencialidades e desa...
Do acesso à informação aos Dados Parlamentares Abertos em Portugal
Transparência e Dados Abertos do Recife: Uma Estratégia Bem Sucedida de Publi...
Revistas científicas brasileiras de acesso aberto: qualidade do ponto de vist...

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPT
Teaching material agriculture food technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MIND Revenue Release Quarter 2 2025 Press Release
sap open course for s4hana steps from ECC to s4
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Assigned Numbers - 2025 - Bluetooth® Document
Teaching material agriculture food technology
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25-Week II
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Access to open data through open access articles in the life sciences

  • 1. Literature-Data Integration in the Life Sciences Lisbon, Oct 2nd 2012
  • 3. Europe PubMed Central 26 million abstracts 2.3 million full text articles Citation networks Database links Text-mining 2006 2011 2012 2016?
  • 4. How many open access articles in UKPMC? PubMed (995K) UKPMC (18%,182K) OA (9.6%, 96K) 200 200 200 200 200 200 200 200 200 20 20 Publication Date Total: 489,000 OA articles
  • 5. 45000 • Big data 300 European Nucleotide Archive Ensembl and Ensembl Genomes Nucleotides (millions) 40000 250 35000 • Thematic data 30000 200 Genomes 25000 150 20000 • Public data 15000 10000 100 50 • Archived data 5000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 14000000 25000 Year 12000000 UniProt InterPro 20000 10000000 Entries • Two petabytes of data Entries 8000000 15000 • Scales to 7 pbs raw disk 6000000 10000 4000000 • Majority is DNA 5000 2000000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year 500000 70000 450000 ArrayExpress PDBe Hybridisations 400000 60000 Structures 350000 50000 300000 40000 250000 200000 30000 150000 20000 100000 10000 50000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year Figure 2. Growth of key resources
  • 6. Literature citation from data vs Data referal from literature
  • 7. PMC336623 Extended to several other biological data types
  • 8. Literature citation from data 800 K • Proteins • Nucleotides • OMIM • Chemicals • Structure • Clinical reviews 370 K • Protein families • Protein-protein interactions • Gene expression experiments 110 K
  • 9. Data referral from literature: text mining Semantic Type Unique Terms Articles Annotations Accession No. 233,017 66,356 387,787 Chemical 76,712 1,694,385 83,923,066 Disease 171,692 1,768,214 57,821,871 Gene/Protein 227,318 1,310,382 77,189,022 GO Terms 32,664 1,832,294 65,061,579 Organism 180,637 1,713,280 70,832,222 2.3 million articles
  • 10. Annotation of accession numbers (OA) 100 100 90 90 80 80 70 70 60 60 50 publisher-annotated 50 text-mined 40 40 30 30 20 20 10 10 0 0 ~10,000 articles >25,000 articles BMC Genomics: 1,484 TM tagged, 4,337 articles (1135 tagged) PLoS One: 4,226 TM tagged, 42,888 articles SenayKafkas and Jee-Hyub Kim
  • 11. Why is this important? Implications Scientific: Linking articles that cite the same data Citation: Data Citation as measure of impact (Thomson: Data citation index) Context of data citation: submission, reuse, analysis Operational: Services for publishers to improve Accession number tagging Editorial policies and adherence Extension of NLM DTD Lessons learned for considering unstructured data That we can perform this analysis at all highlights a benefit of Open Access
  • 12. Case Study of an FP7-funded article (1)
  • 13. Case Study of an FP7-funded article (2)
  • 14. Europe PubMed Central content map Abstract Full text Citing articles Unstructured Datasets Databases Extracted terms Citing articles
  • 15. AY387398: needle in a haystack
  • 16. Europe PubMed Central and Institutional Repositories: content matching Number of article IDs OpenAIRE plus **Coming soon: RESTful interface for data linked to articles
  • 17. People • Paula Buttery • Rebholz Group • Andrew Caines • Peter Stoehr • Norman Cobley • Yuci Gou • University of Manchester • SenayKafkas • British Library • JyothiKaturi • Oliver Kilian • OpenAIRE/OpenAIRE Plus • Jee-Hyub Kim • Nikos Marinos • NCBI, NLM • Jo McEntyre • Xingjun Pi • Philip Rossiter