SlideShare a Scribd company logo
Defrosting the Digital Library A survey of bibliographic tools for the next generation Web Duncan Hull Faculty of Life Sciences (1992-6) BSc.  Computer Science (2002-2007) MSc, PhD.  Chemistry (2008-date) Postdoc
It’s all Casey’s fault! Dr. Casey Bergman, Lecturer  Faculty of Life Sciences I  s  Citeulike.org! http://ukpmc.ac.uk/
http://pubmed.gov/19060304
Defrosting the Digital Library (in one slide) There are lots of digital libraries out there for scientists! ACM, IEEE, PubMed, DBLP, Scopus, ISI-WoK, Google Scholar, arXiv But they have some fundamental problems with their data Identity crisis: identifying people accurately Identity crisis: identifying publications accurately Keeping data and metadata coupled together Impersonal, unsociable, difficult to use: “Cold” Some new tools exist to make things better: “warmer” Citeulike, Mendeley, Zotero, Papyro, Papers etc BUT  Fundamental problems with identity and data need to be fixed before the tools will get any better
Metawhat? getMetadata getData From the Greek  μετ ά  (meta)  meaning after metadata not just  data  about  data metadata is  data  after  data data first metadata second   Reversible reaction (“round-tripping”) Title: defrosting the digital library Authors: Duncan Hull, Steve Pettifer and Douglas Kell Published: 2008 Journal: PLoS Computational Biology Tell me more? What is it about? Where did it  come from?
Metadata in: Chemistry (Science of Matter) Biology (Science of Life) Informatics (Science of  Information) Cheminformatics Biochemistry Bioinformatics Science! www.mib.ac.uk nactem.ac.uk/refine www.citeulike.org
R epresenting  E vidence  F or  I nteracting  N etwork  E lements www.sbml.org  from  www.biomodels.net  database at the  EBI.ac.uk
Example from Glycolysis in Yeast reactant reactant product product modifier This is just one reaction, there are at least another 1700+ in Yeast
Synonyms from Pedro Mendes  B-Net Database http://www.comp-sys-bio.org/yeastnet/   Robison ester, D-Glucose 6-phosphate Glucose-6-phosphate 5'-adenylphosphoric acid; Adenosine 5'-diphosphate;  H3adp ADP Hexokinase-1; Hexokinase-A; Hexokinase PI; YFR053C Hexokinase Adenosine 5'-triphosphate; Adenosine triphosphate; H4atp ATP dextrose; D-Glucose; D-(+)-glucose; D(+)-glucose;  grape sugar; Traubenzucker D-Glucose Synonyms Name
Chemistry Biology Informatics Cheminformatics Biochemistry Bioinformatics
For more info. www.nactem.ac.uk/refine   One of the biggest challenges is getting hold of accurate metadata from libraries and databases
But first… Before getting into the paper… Some lessons I learnt while working in industrial informatics for a small startup company called CSW Informatics Ltd Ford and BBC How business and governments manage metadata
Ford Focus (launched 1998) getMetadata getData 6 million+ “units” sold worldwide to date: america, europe, middle east, africa, australasia Lots of data, metadata and money! Owner’s handbook Tell me more? What is it about?
Final solution: Web XSLT Print
Summary: Lessons from Ford Data often the tip of the iceberg If the data doesn’t sink you, the metadata will Businesses like Ford spent $ £ € keeping  data and metadata stay together Data is often worthless without it Can’t sell data (cars) without metadata (manuals) Don’t just “make cars” DATA METADATA
 
BBC Spooks? Open Source Intelligence (OSINT) Overt  not Covert espionage:  370 journalists, 24-7, ~100 languages   Caversham, Reading.  Keeping an eye on people around the world since 1939  Winston Churchill “ B ig  B ritish  C astle” (BBC)
I  hate powerpoint Radio MS Word TV
How do they stay in business? Broadcasting House, London Foreign governments, e.g. U.S.A. etc
Word:  Not  the best way to manage data and metadata
Getting Rid of Word database XML schema Web &  Intranet Printed documents XSLT
A solution that worked! getMetadata getData Who is Thabo Mbeki? These documents are all about  Thabo Mbeki Thabo Mbeki
Summary: Lessons from the BBC Important decisions made on the basis metadata Crucial that metadata is accurate, high quality and trustworthy Identify people properly is crucial (100%) You know what data is about (getMetadata) You know where it came from (getData) Looked after properly (this can be expensive) Businesses built on buying/selling metadata:
How have libraries managed metadata? On paper since 300 B.C.  (Library of Alexandria) Organised in physical space  In buildings made from bricks and mortar Expensive and slow distribute Only ever read by humans Filled with content bought from publishers,  locked up with copyright   Image via  http://en.wikipedia.org/wiki/Library_of_Alexandria
From  ~1824  until ~1989 Photos via dpicker  http://www.flickr.com/photos/dpicker/3107856991/  and pit yacker  http://www.flickr.com/photos/78825653@N00/131611136   JRULM (Main Library) Joule  Library Mostly “private” only available to an elite (e.g. University of Manchester Students and Staff)
Metadata (after) Data Tightly bound (literally) Rarely separated First published 1687, over 300 years old
Data and metadata was like this for centuries! Until…
+ Tim Berners-Lee 1989
Timeline: Unchanged for centuries but… 20 years  ÷   2309 years  = <1%
Everything’s Gone Digital!  www.scopus.com www.pubmed.gov http://ukpmc.ac.uk   www. isiknowledge .com scholar.google.com
Digital Utopia? Bits and bytes  1010100101000001101010  (not paper) In pervasive cyberspace  (not physical space) Databases and/or Web identified by URIs:  (not buildings) Cost of distribution  fallen by orders of magnitude Read and indexed by  machines  like Googlebot  et al  (not just humans) Increasingly public, available to everyone via Open-Access publishing  (less private, less restrictive copyright) Everything is great? Alexander Griekspoor www.mekentosj.com
Welcome to Digital Dystopia Isolation  each discipline has its own data silo Impersonal and unsociable  “ who the hell are you”? Where are “my” papers? (authored by me, or of interest to me) What are my friends and colleagues reading? What are the experts reading? What is popular this week / month / year ? “ Cold”: Identity of publications and authors is inadequate Data divorced from its metadata GetMetadata / GetData unreliable  Therefore can be difficult to tell what data is about, or where metadata came from Obsolete models of publication, not everything fits publication-sized holes Micro-attribution Mega-attribution Digital contributions (databases, software, wikis/blogs?)
Isolated publication silos Chemistry Informatics Biology impersonal, isolated, unsociable, Generally rubbish
Identity Crisis part 1: Which publication? http://pubmed.gov/18974831   http://www.ncbi.nlm.nih.gov/pubmed/18974831 http://ukpmc.ac.uk/articlerender.cgi?accid=pmcA2568856 http://ukpmc.ac.uk/picrender.cgi?artid=1687256&blobtype=pdf   http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000204   http://www.dbkgroup.org/Papers/hull_defrost_ploscb08.pdf   http://dx.doi.org/10.1371/journal.pcbi.1000204   One paper, many URIs. Disambiguation algorithms rely on getting metadata for each Big problem for libraries is these redundant duplicates Matching can be done by Digital Object Identifier (DOI) and PubMed ID (PMID);  these are frequently absent < 5% (Kevin Emamy, citeulike)
Identity crisis part 2: Who are you?  Who, who … who, who? Douglas Kell Doug Kell Douglas B Kell Kell, D Kell, D.B. Douglas Bruce Kell Druglas Kell Neil Smalheiser and Vetle Torvik Typo Attribution would seem to be a simple process and yet it represents a  major, unsolved problem   for information science. http://tinyurl.com/authorid
Identity crisis part 3: Mistaken Identity Google Scholar  thinks I’m Maurice Wilkins Dr. Duncan Hull Humble Postdoc Article about Authored-by Authored-by Wrong! “ DNA mania” title http://tinyurl.com/mistakenid
Can’t get metadata (decoupled from data): PDF getMetadata getData Title: defrosting the digital library Authors: Duncan Hull, Steve Pettifer and Douglas Kell Published: 2008 Tell me more Don’t know, Try google Don’t know,  Title might be  “ defrosting…” Where did this  come from?
Can’t get metadata (decoupled from data): PDF MP3 music file in iTunes Why can't I manage  academic papers like MP3s? http: //tinyurl .com/mp3vpdf   James Howison, Carnegie Mellon University Data is tightly coupled to its metadata getMetadata getData Artist: The Who Title: Who Are You? Recorded: 1978 Album: Who Are You
Can’t get metadata (decoupled from data): PDF Peter Murray-Rust Hamburger (unstructured data) PDF is a hamburger,  and we're trying to turn it  back into a cow.   http://tinyurl.com/pdfhamburger   Cow (structured data) publishing text-mining
Can’t get metadata (decoupled from data): HTTP Arbitrary URI (not just pubmed, but any scientific paper)  http://www.ncbi.nlm.nih.gov/pubmed/18974831
Can’t get metadata (decoupled from data): HTTP Fundamental problem with the way the web is built  using HTTP, can’t change it now… Tim Bray, Sun Microsystems One of the Web's distinguishing features  is that there's a big gaping hole  where the metadata ought to be. http://tinyurl.com/nometadata
I’ll stop moaning now Isolation Can’t identify people Can’t identify publications Metadata gets divorced from its data But what are the solutions?
www.citeulike.org   Richard Cameron Kevin Emamy Picture from  http://network.nature.com/people/mfenner/blog/2009/01/30/interview-with-kevin-emamy  and  http://www.citeulike.org/faq/faq.adp   The reason I wrote the site [citeulike.org] was, after recently coming back to academia,  I was slightly shocked by the quality of some of the tools available to help academics  do their job. I found it preferable to start writing proper tools for my own use than to use existing software.
Why should you care about citeulike? Could save you time But also like Green Fluorescent Protein…
All references in one place
Click Post to Citeulike
Tag it (optional)
Citeulike: Recoupling data and metadata Wouldn’t be a problem if the publishers hadn’t decoupled it in the first place!
Citegeist = Citeulike + Zeitgeist
allegedly 2,243,177 ~2,000 /day variable 674,076 2,880 /day 2 papers / min Linear growth ~500,000
Where will citeulike break? The more people that use “social software”, the better they get Citeulike is one of the leading ones, but there is plenty of competition Parsers are fragile, easily (and deliberately) broken by publishers  ISI WOK and Scopus Each publisher has its own parser  (euuuggh!) Privacy and competition “ I don’t want to share  any  of my data before publication” “ It’s nobody’s business but mine” (basic human right to privacy) Closer integration with Word (and latex tools) Might go bust? Why put all my precious data in the hands of a commercial company?
Why should you bother with citeulike? Organisation and time saving Searching  Browsing Managing references while writing papers Quick and efficient sharing of data before publication  e.g. tag “defrost” when writing this paper http://www.citeulike.org/tag/defrost   Serendipity Casey Bergman story
Casey Bergman story I was importing papers on solexa and 454  genome assembly and came across the following paper: http://www. citeulike .org/user/cisevol/article/1465689   which was a real find in terms of convincing me  that light shotgun sequence data is worth analysing. I nicked this from a phd student's library in Brazil  http://www. citeulike . org/profile/GustavoLacerda Wouldn’t have found this any other way e.g (keyword searching or following citation trails)
Many  different  solutions e.g.  Papyro:  Steve  Pettifer http://utopia.cs.manchester.ac.uk/
And the rest… www.mendeley.com   www.zotero.org   www.connotea.org   www.mekentosj.com   www.hubmed.org   Re-couple metadata that has be de-coupled from data www.2collab.com   www.refworks.com   “ iTunes for PDF files”
There is still lots  more metadata How many times  has  http://pubmed.gov/19060304  been cited? Who has cited  http://pubmed.gov/19060304   ?  Give me all the references that cite this one Give me all the references cited by  http://pubmed.gov/19060304   Who the hell is Doug Kell? Steve Pettifer? Duncan Hull? What is Doug Kell’s h-index? Remember: Machines ask these questions, not just humans Notify me whenever Steve Pettifer publishes a paper Notify me whenever someone cites http://pubmed.gov/19060304   Impact factor?
Digital Identity would solve  some  of these problems Give yourself a URI,  you deserve it! Tim Berners-Lee  http://www.w3.org/People/Berners-Lee/card#i see  http://dig.csail.mit.edu/breadcrumbs/node/71
URI’s for Douglas Kell http://blogs.bbsrc.ac.uk http://www.chemistry.manchester.ac.uk/aboutus/staff/showprofile.php?id=194   http://dbkgroup.org/kell.htm http://douglaskell.myopenid.com http://dx.doi.org/10.1371/journal.pcbi.1000204   “ Contributor identifier” from  www.myopenid.com   www.openid.net   (Also Note researcher-id from thomson)
http://pubmed.gov/19112480   Phil Bourne
John Ziman, Physicist Science is  public  knowledge http://tinyurl.com/publicknowledge
Conclusions: What hasn’t changed The Web has revolutionised libraries in just 20 short years but… Still takes time for humans to read and digest: We can get more papers but there are still only 24 hours in a day, 7 days in a week, 52 weeks in a year We need help from machines (and the people that build them) Need to make metadata more machine-friendly
Conclusions: Publication metadata matters Managed to convince you metadata matters (and why) People make important decisions based on metadata Funding Hiring (and Firing) Publishing Who to collaborate with Yet our current libraries can’t even accurately identify crucial metadata Individual people - digital identity needed Publications - disambiguation Everything else…
Conclusions: Scientists are too blasé about metadata! Leave it to stamp collectors, dusty-librarians, informaticians,  database administrators (yawn!), “biocurators”  http://biocurator.org/   Boring, unscientific, not cutting-edge innovation? Everyone wants to use good metadata but few people want to spend time curating and cleaning metadata Like a clean toilet We ignore metadata at our peril “not my job” We leave it to publishers, who then mess it up,  and charge us for their services, we should be  getting better value for money  We waste precious time organising metadata We waste precious time searching for metadata Data is more valuable with better metadata Have a look at citeulike (and other tools)  metadata
Conclusions: Do us a favour!
Acknowledgements Refine project: Sophia Ananiadou, Jun'ichi Tsujii, Pedro Mendes, Steve Pettifer, Yoshimasa Tsuruoka, Douglas Kell  www.nactem.ac.uk/refine   BBSRC grant code BB/E004431/1 CSW Informatics Ltd.: John Chelsom, Mavis Cournane, Niki Dinsey  www.csw.co.uk  BBC Monitoring, Ford Motor Company School of Chemistry, MIB (now)  www.mib.ac.uk   Faculty of Life Sciences (a long long time ago) and Casey Bergman, Jean-Marc Schwartz (now) School of Computer Science (not so long ago) Information Management Group  http://img.cs.man.ac.uk/   Any Questions?

More Related Content

PDF
Learning Multilingual Semantics from Big Data on the Web
PDF
Semantic Web Applications in Libraries: The Road to BIBFRAME
PPTX
Beautifying Data in the real world
PDF
Understanding the Standards Gap
PDF
DBpedia as Gaeilge Chapter
PPTX
BIBFRAME : the future of cataloguing?
PPTX
Content Mining at Wellcome Trust
PPT
Web of Data - Introduction (english)
Learning Multilingual Semantics from Big Data on the Web
Semantic Web Applications in Libraries: The Road to BIBFRAME
Beautifying Data in the real world
Understanding the Standards Gap
DBpedia as Gaeilge Chapter
BIBFRAME : the future of cataloguing?
Content Mining at Wellcome Trust
Web of Data - Introduction (english)

What's hot (20)

PDF
Modern Tools & Rationales for 21st Century Research
PDF
Museum impact: linking-up specimens with research published on them
PDF
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
PDF
Open Research Data: Licensing | Standards | Future
PPTX
The culture of researchData
PPT
The Digital Library from Information Superhighway to the Semiotic Web
PPT
Data, data, data
PPTX
Best Practices for Multilingual Linked Open Data
PPTX
Introduction to bibframe
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PDF
Unknown Unknowns
PPTX
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
PDF
Linked Data and Archival Description: Confluences, Contingencies, and Conflicts
PDF
Archives & the Semantic Web
PPT
The Digital Library from Information Superhighway to the Semiotic Web
PPT
Data Journalism (City Online Journalism wk8)
PDF
Transcript - Provenance and Social Science data
KEY
YQL:: Select * from Internet
PPT
Linked Open Data for Libraries
PPTX
ContentMining in Neuroscience
Modern Tools & Rationales for 21st Century Research
Museum impact: linking-up specimens with research published on them
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Open Research Data: Licensing | Standards | Future
The culture of researchData
The Digital Library from Information Superhighway to the Semiotic Web
Data, data, data
Best Practices for Multilingual Linked Open Data
Introduction to bibframe
ContentMine: Liberating scholarship from Open publications and theses
Unknown Unknowns
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
Linked Data and Archival Description: Confluences, Contingencies, and Conflicts
Archives & the Semantic Web
The Digital Library from Information Superhighway to the Semiotic Web
Data Journalism (City Online Journalism wk8)
Transcript - Provenance and Social Science data
YQL:: Select * from Internet
Linked Open Data for Libraries
ContentMining in Neuroscience
Ad

Similar to Defrosting the Digital Library: A survey of bibliographic tools for the next generation web (20)

PPTX
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
PDF
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
PPT
The Big Metadata
PPT
Riding the wave - Paradigm shifts in information access
PPT
myExperiment @ Nettab
PPT
DataCite at APE 2011
PPT
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PPT
Metadata 101public
PPTX
Dataset Metadata, Tools and Approaches for Access and Preservation
PPT
香港六合彩
PPT
OUR space: the new world of metadata
PPTX
Current metadata landscape in the library world (Getaneh Alemu)
PPTX
Metadata enriching and discovery at Solent University Library
PPTX
DataCite: the Perfect Complement to CrossRef
PPT
Metadata
PPT
Metadata
PPTX
Will We Command Our Data? From the Petascale to the Personal
PDF
Metadata
PDF
Disclosing Private Information from Metadata, hidden info and lost data
PPTX
METADATA & JSTOR : Presentation for Ph.D
METADATA: A PRACTICE AND ITS SERVICES TOWARDS DIGITAL ENVIRONMENT
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
The Big Metadata
Riding the wave - Paradigm shifts in information access
myExperiment @ Nettab
DataCite at APE 2011
PIDs and DOI registration with DataCite - IATUL Workshop 2013
Metadata 101public
Dataset Metadata, Tools and Approaches for Access and Preservation
香港六合彩
OUR space: the new world of metadata
Current metadata landscape in the library world (Getaneh Alemu)
Metadata enriching and discovery at Solent University Library
DataCite: the Perfect Complement to CrossRef
Metadata
Metadata
Will We Command Our Data? From the Petascale to the Personal
Metadata
Disclosing Private Information from Metadata, hidden info and lost data
METADATA & JSTOR : Presentation for Ph.D
Ad

More from Duncan Hull (20)

PPT
Why study plants?
PPTX
Embedding employability in the Computer Science curriculum
PPTX
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
PPTX
Improving the troubled relationship between Scientists and Wikipedia
PPT
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
PPT
OWL and OBO
PPT
Accessing small molecule data using ChEBI
PPT
How to Blog
PPT
OWL-XML-Summer-School-09
PPT
Authenticating Scientists with OpenID
PPT
The Invisible Scientist
PPT
The Year of Blogging Dangerously
PPT
eScience: A Transformed Scientific Method
PPT
The Future of Research (Science and Technology)
PPT
Chemical named entity recognition and literature mark-up
PDF
Chemoinformatics and information management
PDF
Text mining tools for semantically enriching scientific literature
PDF
Issues for metabolomics and
PPT
Adding Meaning To Your Data
PPT
Web of Science: REST or SOAP?
Why study plants?
Embedding employability in the Computer Science curriculum
Wikipedia at the Royal Society: The Good, the Bad and the Ugly
Improving the troubled relationship between Scientists and Wikipedia
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
OWL and OBO
Accessing small molecule data using ChEBI
How to Blog
OWL-XML-Summer-School-09
Authenticating Scientists with OpenID
The Invisible Scientist
The Year of Blogging Dangerously
eScience: A Transformed Scientific Method
The Future of Research (Science and Technology)
Chemical named entity recognition and literature mark-up
Chemoinformatics and information management
Text mining tools for semantically enriching scientific literature
Issues for metabolomics and
Adding Meaning To Your Data
Web of Science: REST or SOAP?

Recently uploaded (20)

PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
project resource management chapter-09.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Architecture types and enterprise applications.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
August Patch Tuesday
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Hybrid model detection and classification of lung cancer
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
observCloud-Native Containerability and monitoring.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
project resource management chapter-09.pdf
Zenith AI: Advanced Artificial Intelligence
Architecture types and enterprise applications.pdf
Enhancing emotion recognition model for a student engagement use case through...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Getting started with AI Agents and Multi-Agent Systems
Hindi spoken digit analysis for native and non-native speakers
TLE Review Electricity (Electricity).pptx
O2C Customer Invoices to Receipt V15A.pptx
August Patch Tuesday
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Chapter 5: Probability Theory and Statistics
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Hybrid model detection and classification of lung cancer
NewMind AI Weekly Chronicles - August'25-Week II
observCloud-Native Containerability and monitoring.pptx

Defrosting the Digital Library: A survey of bibliographic tools for the next generation web

  • 1. Defrosting the Digital Library A survey of bibliographic tools for the next generation Web Duncan Hull Faculty of Life Sciences (1992-6) BSc. Computer Science (2002-2007) MSc, PhD. Chemistry (2008-date) Postdoc
  • 2. It’s all Casey’s fault! Dr. Casey Bergman, Lecturer Faculty of Life Sciences I s Citeulike.org! http://ukpmc.ac.uk/
  • 4. Defrosting the Digital Library (in one slide) There are lots of digital libraries out there for scientists! ACM, IEEE, PubMed, DBLP, Scopus, ISI-WoK, Google Scholar, arXiv But they have some fundamental problems with their data Identity crisis: identifying people accurately Identity crisis: identifying publications accurately Keeping data and metadata coupled together Impersonal, unsociable, difficult to use: “Cold” Some new tools exist to make things better: “warmer” Citeulike, Mendeley, Zotero, Papyro, Papers etc BUT Fundamental problems with identity and data need to be fixed before the tools will get any better
  • 5. Metawhat? getMetadata getData From the Greek μετ ά (meta) meaning after metadata not just data about data metadata is data after data data first metadata second Reversible reaction (“round-tripping”) Title: defrosting the digital library Authors: Duncan Hull, Steve Pettifer and Douglas Kell Published: 2008 Journal: PLoS Computational Biology Tell me more? What is it about? Where did it come from?
  • 6. Metadata in: Chemistry (Science of Matter) Biology (Science of Life) Informatics (Science of Information) Cheminformatics Biochemistry Bioinformatics Science! www.mib.ac.uk nactem.ac.uk/refine www.citeulike.org
  • 7. R epresenting E vidence F or I nteracting N etwork E lements www.sbml.org from www.biomodels.net database at the EBI.ac.uk
  • 8. Example from Glycolysis in Yeast reactant reactant product product modifier This is just one reaction, there are at least another 1700+ in Yeast
  • 9. Synonyms from Pedro Mendes B-Net Database http://www.comp-sys-bio.org/yeastnet/ Robison ester, D-Glucose 6-phosphate Glucose-6-phosphate 5'-adenylphosphoric acid; Adenosine 5'-diphosphate; H3adp ADP Hexokinase-1; Hexokinase-A; Hexokinase PI; YFR053C Hexokinase Adenosine 5'-triphosphate; Adenosine triphosphate; H4atp ATP dextrose; D-Glucose; D-(+)-glucose; D(+)-glucose; grape sugar; Traubenzucker D-Glucose Synonyms Name
  • 10. Chemistry Biology Informatics Cheminformatics Biochemistry Bioinformatics
  • 11. For more info. www.nactem.ac.uk/refine One of the biggest challenges is getting hold of accurate metadata from libraries and databases
  • 12. But first… Before getting into the paper… Some lessons I learnt while working in industrial informatics for a small startup company called CSW Informatics Ltd Ford and BBC How business and governments manage metadata
  • 13. Ford Focus (launched 1998) getMetadata getData 6 million+ “units” sold worldwide to date: america, europe, middle east, africa, australasia Lots of data, metadata and money! Owner’s handbook Tell me more? What is it about?
  • 14. Final solution: Web XSLT Print
  • 15. Summary: Lessons from Ford Data often the tip of the iceberg If the data doesn’t sink you, the metadata will Businesses like Ford spent $ £ € keeping data and metadata stay together Data is often worthless without it Can’t sell data (cars) without metadata (manuals) Don’t just “make cars” DATA METADATA
  • 16.  
  • 17. BBC Spooks? Open Source Intelligence (OSINT) Overt not Covert espionage: 370 journalists, 24-7, ~100 languages Caversham, Reading. Keeping an eye on people around the world since 1939 Winston Churchill “ B ig B ritish C astle” (BBC)
  • 18. I hate powerpoint Radio MS Word TV
  • 19. How do they stay in business? Broadcasting House, London Foreign governments, e.g. U.S.A. etc
  • 20. Word: Not the best way to manage data and metadata
  • 21. Getting Rid of Word database XML schema Web & Intranet Printed documents XSLT
  • 22. A solution that worked! getMetadata getData Who is Thabo Mbeki? These documents are all about Thabo Mbeki Thabo Mbeki
  • 23. Summary: Lessons from the BBC Important decisions made on the basis metadata Crucial that metadata is accurate, high quality and trustworthy Identify people properly is crucial (100%) You know what data is about (getMetadata) You know where it came from (getData) Looked after properly (this can be expensive) Businesses built on buying/selling metadata:
  • 24. How have libraries managed metadata? On paper since 300 B.C. (Library of Alexandria) Organised in physical space In buildings made from bricks and mortar Expensive and slow distribute Only ever read by humans Filled with content bought from publishers, locked up with copyright  Image via http://en.wikipedia.org/wiki/Library_of_Alexandria
  • 25. From ~1824 until ~1989 Photos via dpicker http://www.flickr.com/photos/dpicker/3107856991/ and pit yacker http://www.flickr.com/photos/78825653@N00/131611136 JRULM (Main Library) Joule Library Mostly “private” only available to an elite (e.g. University of Manchester Students and Staff)
  • 26. Metadata (after) Data Tightly bound (literally) Rarely separated First published 1687, over 300 years old
  • 27. Data and metadata was like this for centuries! Until…
  • 29. Timeline: Unchanged for centuries but… 20 years ÷ 2309 years = <1%
  • 30. Everything’s Gone Digital! www.scopus.com www.pubmed.gov http://ukpmc.ac.uk www. isiknowledge .com scholar.google.com
  • 31. Digital Utopia? Bits and bytes 1010100101000001101010 (not paper) In pervasive cyberspace (not physical space) Databases and/or Web identified by URIs: (not buildings) Cost of distribution fallen by orders of magnitude Read and indexed by machines like Googlebot et al (not just humans) Increasingly public, available to everyone via Open-Access publishing (less private, less restrictive copyright) Everything is great? Alexander Griekspoor www.mekentosj.com
  • 32. Welcome to Digital Dystopia Isolation each discipline has its own data silo Impersonal and unsociable “ who the hell are you”? Where are “my” papers? (authored by me, or of interest to me) What are my friends and colleagues reading? What are the experts reading? What is popular this week / month / year ? “ Cold”: Identity of publications and authors is inadequate Data divorced from its metadata GetMetadata / GetData unreliable Therefore can be difficult to tell what data is about, or where metadata came from Obsolete models of publication, not everything fits publication-sized holes Micro-attribution Mega-attribution Digital contributions (databases, software, wikis/blogs?)
  • 33. Isolated publication silos Chemistry Informatics Biology impersonal, isolated, unsociable, Generally rubbish
  • 34. Identity Crisis part 1: Which publication? http://pubmed.gov/18974831 http://www.ncbi.nlm.nih.gov/pubmed/18974831 http://ukpmc.ac.uk/articlerender.cgi?accid=pmcA2568856 http://ukpmc.ac.uk/picrender.cgi?artid=1687256&blobtype=pdf http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000204 http://www.dbkgroup.org/Papers/hull_defrost_ploscb08.pdf http://dx.doi.org/10.1371/journal.pcbi.1000204 One paper, many URIs. Disambiguation algorithms rely on getting metadata for each Big problem for libraries is these redundant duplicates Matching can be done by Digital Object Identifier (DOI) and PubMed ID (PMID); these are frequently absent < 5% (Kevin Emamy, citeulike)
  • 35. Identity crisis part 2: Who are you? Who, who … who, who? Douglas Kell Doug Kell Douglas B Kell Kell, D Kell, D.B. Douglas Bruce Kell Druglas Kell Neil Smalheiser and Vetle Torvik Typo Attribution would seem to be a simple process and yet it represents a major, unsolved problem for information science. http://tinyurl.com/authorid
  • 36. Identity crisis part 3: Mistaken Identity Google Scholar thinks I’m Maurice Wilkins Dr. Duncan Hull Humble Postdoc Article about Authored-by Authored-by Wrong! “ DNA mania” title http://tinyurl.com/mistakenid
  • 37. Can’t get metadata (decoupled from data): PDF getMetadata getData Title: defrosting the digital library Authors: Duncan Hull, Steve Pettifer and Douglas Kell Published: 2008 Tell me more Don’t know, Try google Don’t know, Title might be “ defrosting…” Where did this come from?
  • 38. Can’t get metadata (decoupled from data): PDF MP3 music file in iTunes Why can't I manage academic papers like MP3s? http: //tinyurl .com/mp3vpdf James Howison, Carnegie Mellon University Data is tightly coupled to its metadata getMetadata getData Artist: The Who Title: Who Are You? Recorded: 1978 Album: Who Are You
  • 39. Can’t get metadata (decoupled from data): PDF Peter Murray-Rust Hamburger (unstructured data) PDF is a hamburger, and we're trying to turn it back into a cow. http://tinyurl.com/pdfhamburger Cow (structured data) publishing text-mining
  • 40. Can’t get metadata (decoupled from data): HTTP Arbitrary URI (not just pubmed, but any scientific paper) http://www.ncbi.nlm.nih.gov/pubmed/18974831
  • 41. Can’t get metadata (decoupled from data): HTTP Fundamental problem with the way the web is built using HTTP, can’t change it now… Tim Bray, Sun Microsystems One of the Web's distinguishing features is that there's a big gaping hole where the metadata ought to be. http://tinyurl.com/nometadata
  • 42. I’ll stop moaning now Isolation Can’t identify people Can’t identify publications Metadata gets divorced from its data But what are the solutions?
  • 43. www.citeulike.org Richard Cameron Kevin Emamy Picture from http://network.nature.com/people/mfenner/blog/2009/01/30/interview-with-kevin-emamy and http://www.citeulike.org/faq/faq.adp The reason I wrote the site [citeulike.org] was, after recently coming back to academia, I was slightly shocked by the quality of some of the tools available to help academics do their job. I found it preferable to start writing proper tools for my own use than to use existing software.
  • 44. Why should you care about citeulike? Could save you time But also like Green Fluorescent Protein…
  • 45. All references in one place
  • 46. Click Post to Citeulike
  • 48. Citeulike: Recoupling data and metadata Wouldn’t be a problem if the publishers hadn’t decoupled it in the first place!
  • 49. Citegeist = Citeulike + Zeitgeist
  • 50. allegedly 2,243,177 ~2,000 /day variable 674,076 2,880 /day 2 papers / min Linear growth ~500,000
  • 51. Where will citeulike break? The more people that use “social software”, the better they get Citeulike is one of the leading ones, but there is plenty of competition Parsers are fragile, easily (and deliberately) broken by publishers ISI WOK and Scopus Each publisher has its own parser (euuuggh!) Privacy and competition “ I don’t want to share any of my data before publication” “ It’s nobody’s business but mine” (basic human right to privacy) Closer integration with Word (and latex tools) Might go bust? Why put all my precious data in the hands of a commercial company?
  • 52. Why should you bother with citeulike? Organisation and time saving Searching Browsing Managing references while writing papers Quick and efficient sharing of data before publication e.g. tag “defrost” when writing this paper http://www.citeulike.org/tag/defrost Serendipity Casey Bergman story
  • 53. Casey Bergman story I was importing papers on solexa and 454 genome assembly and came across the following paper: http://www. citeulike .org/user/cisevol/article/1465689 which was a real find in terms of convincing me that light shotgun sequence data is worth analysing. I nicked this from a phd student's library in Brazil http://www. citeulike . org/profile/GustavoLacerda Wouldn’t have found this any other way e.g (keyword searching or following citation trails)
  • 54. Many different solutions e.g. Papyro: Steve Pettifer http://utopia.cs.manchester.ac.uk/
  • 55. And the rest… www.mendeley.com www.zotero.org www.connotea.org www.mekentosj.com www.hubmed.org Re-couple metadata that has be de-coupled from data www.2collab.com www.refworks.com “ iTunes for PDF files”
  • 56. There is still lots more metadata How many times has http://pubmed.gov/19060304 been cited? Who has cited http://pubmed.gov/19060304 ? Give me all the references that cite this one Give me all the references cited by http://pubmed.gov/19060304 Who the hell is Doug Kell? Steve Pettifer? Duncan Hull? What is Doug Kell’s h-index? Remember: Machines ask these questions, not just humans Notify me whenever Steve Pettifer publishes a paper Notify me whenever someone cites http://pubmed.gov/19060304 Impact factor?
  • 57. Digital Identity would solve some of these problems Give yourself a URI, you deserve it! Tim Berners-Lee http://www.w3.org/People/Berners-Lee/card#i see http://dig.csail.mit.edu/breadcrumbs/node/71
  • 58. URI’s for Douglas Kell http://blogs.bbsrc.ac.uk http://www.chemistry.manchester.ac.uk/aboutus/staff/showprofile.php?id=194 http://dbkgroup.org/kell.htm http://douglaskell.myopenid.com http://dx.doi.org/10.1371/journal.pcbi.1000204 “ Contributor identifier” from www.myopenid.com www.openid.net (Also Note researcher-id from thomson)
  • 60. John Ziman, Physicist Science is public knowledge http://tinyurl.com/publicknowledge
  • 61. Conclusions: What hasn’t changed The Web has revolutionised libraries in just 20 short years but… Still takes time for humans to read and digest: We can get more papers but there are still only 24 hours in a day, 7 days in a week, 52 weeks in a year We need help from machines (and the people that build them) Need to make metadata more machine-friendly
  • 62. Conclusions: Publication metadata matters Managed to convince you metadata matters (and why) People make important decisions based on metadata Funding Hiring (and Firing) Publishing Who to collaborate with Yet our current libraries can’t even accurately identify crucial metadata Individual people - digital identity needed Publications - disambiguation Everything else…
  • 63. Conclusions: Scientists are too blasé about metadata! Leave it to stamp collectors, dusty-librarians, informaticians, database administrators (yawn!), “biocurators” http://biocurator.org/ Boring, unscientific, not cutting-edge innovation? Everyone wants to use good metadata but few people want to spend time curating and cleaning metadata Like a clean toilet We ignore metadata at our peril “not my job” We leave it to publishers, who then mess it up, and charge us for their services, we should be getting better value for money We waste precious time organising metadata We waste precious time searching for metadata Data is more valuable with better metadata Have a look at citeulike (and other tools) metadata
  • 64. Conclusions: Do us a favour!
  • 65. Acknowledgements Refine project: Sophia Ananiadou, Jun'ichi Tsujii, Pedro Mendes, Steve Pettifer, Yoshimasa Tsuruoka, Douglas Kell www.nactem.ac.uk/refine BBSRC grant code BB/E004431/1 CSW Informatics Ltd.: John Chelsom, Mavis Cournane, Niki Dinsey www.csw.co.uk BBC Monitoring, Ford Motor Company School of Chemistry, MIB (now) www.mib.ac.uk Faculty of Life Sciences (a long long time ago) and Casey Bergman, Jean-Marc Schwartz (now) School of Computer Science (not so long ago) Information Management Group http://img.cs.man.ac.uk/ Any Questions?