SlideShare a Scribd company logo
Making eTheses USEFUL
Peter Murray-Rust*,
University of Cambridge and OKF
ETD2014, Leicester, UK 2014-07-24
*Shuttleworth Fellow 2014-5
Overview
• We waste > 10,000,000,000 USD of eThesis value*
• Everyone else is becoming OPEN; not Universities
• What we CAN DO NOW: ContentMining
• What we SHOULD do: Open Notebook Science
• We don’t need commercial organisations to manage
theses.
• The time has come; We can do it now
*My numbers are DEBATABLE! Please add your thoughts to
http://pads.cottagelabs.com/p/etd2014 or tweet #etd2014
Jean-Claude Bradley
Jean-Claude Bradley was one of the
most influential open scientists of our
time. He was an innovator in all that
he did, from Open Education to
bleeding edge Open Science; in 2006,
he coined the phrase Open Notebook
Science. His loss is felt deeply by
friends and colleagues around the
world.
On Monday July 14, 2014 we gathered
at Cambridge University to honour his
memory and the legacy he leaves
behind with a highly distinguished set
of invited speakers to revisit and build
upon the ideas which inspired and
defined his life’s work.
Wikipedia CC BY-SA
The cost and value
The economic value of data
• I believe that we spend globally ca 400 billion
USD / yr on public research.
• The outputs include:
– Knowledge / papers / patents
– Organizations
– People
– Materials
– Data – many billions/year and much is lost
US Taxpayers spend 139 Billion USD / yr
on Scientific Research
4 Billion USD on human genome
yielded 800 Billion USD and 4 M job-years
Scholarly publication
• Citizens pay $400,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” … ($7 USD arXiv)
• … costs $10,000,000,000 …
• … “publishers” forbid access to 99.9% of citizens of the
world …
• … Value???
• Please challenge these numbers… #etd2014 or
http://pads.cottagelabs.com/p/etd2014
…three problems—flawed design, non-
publication, and poor reporting—together
meant >85% of research funds were wasted, a
global total loss >100 billion USD per year.
[Lancet 2009]
[Even more] waste clearly occurs after
publication: from poor access, poor
dissemination, and poor uptake of the findings
of research. [PLOS Medicine 2014-05-27]
Bad publication wastes science
Authors don’t deposit data (Ross Mounce)
Where is the Digital Enlightenment?
• Science is done in C20th ways …
• …communicated in C19th ways …
• … losing the power of C21st
Linked Open Data – the world’s knowledge
very little physical science and THESES?? 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
eTheses
• Citizens pay $20,000,000,000*…
• … for research in 200,000 science theses*…
• … cost $100,000 each to create* …
• … re-use ??? (near zero)
• … Value???
• *Please challenge these numbers…
• NOTE: we pay publishers $15,000,000,000 for
journals and APCs
“Free” and “Open”
• "Free software is a matter of liberty, not price.
’free speech', not 'free beer'”. (R M Stallman)
• “A piece of data or content is open if anyone is
free to use, reuse, and redistribute it”
(OKFN)http://opendefinition.org/
• “open” (access) has multiple incompatible “definitions”. Major split
is “human eyeballs” vs copying and machine “reusability”
• “Open” is a marketing term for publishers, who frequently (often
deliberately) do not grant full Openness.
“Gratis” vs “Libre”
Critical Historical Open Events
• Free Software Foundation (RMS,
1985) and Linux (Torvalds, 1991)
• The World Wide Web (TBL, 1991)
• The human genome (1990-2001)
The life of Aaron Swarz (1986-2013)
https://en.wikipedia.org/wiki/Bermuda_Principles
• Automatic release of sequence assemblies larger than 1
kb (preferably within 24 hours).
• Immediate publication of finished annotated
sequences.
• Aim to make the entire sequence freely available in the
public domain for both research and development in
order to maximise benefits to society.
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
Panton Principles for Open Data in
science(2010)
• PUBLISH YOUR DATA OPENLY
• …make an explicit and robust statement of your wishes.
• Use a recognized waiver or license that is appropriate for
data.
• open as defined by the Open Knowledge/Data Definition
(… NOT non-commercial)
• Explicit dedication of data … into the public domain via
PDDL or CCZero
Peter Murray-Rust, Cameron Neylon, Rufus Pollock, John
Wilbanks
Panton Authors and Fellows
Problems of Commercial
Elsevier wants to control Open Data
[asked by Michelle Brook]
Mendeley
From Wikipedia, the free encyclopedia
• … a social media site used by many scientists
to store metadata …
• … purchased by Elsevier in 2013
• David Dobbs, in The New Yorker, described
motive as:
– to acquire its user data,
– to destroy or coöpt an open-science icon that
threatens its business model.
• PM-R: Mendeley can also Snoop and Control
New ways for Theses
• Content Mining
• Open Notebook Theses
Traditional Research and Publication
“Lab” work paper/th
esis
Write
rewrite
Re-experiment
publish
???
Validation??
DATA
output often
seriously restricted
Content-Mining (TDM)
• Now COMPLETELY LEGAL IN UK since 2014-06-01 …
• … Whatever the publishers tell you. Do NOT sign their
APIs
• Contentmine.org …
• … sponsored by Shuttleworth Foundation …
• … to extract 100,000,000 facts from scientific literature
• And STM publishers are throwing millions to stop us
But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
How a machine reads a chemical thesis
nodes are compounds; arrows are reactions
PROPERTIES (Name-Value-Units-Error)
Name Value Units
NV U NV U N V
U
N
E
V E U
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
Parsing chemical sentences
http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle
Håstad 2,3 and Per Alström 4
PDF 
HTML 
Styles , superscripts
And diåcritics
preserved!
AMI
PDF 
Turdus iliacus
Taeniopygia guttata
Serinus canaria
Lanius excubitor
Melopsittacus undulatus
Pavo cristatus
Sturnus vulgaris
Dolichonyx oryzivorus
Ficedula hypoleuca
Vaccinium myrtillus
Falco tinnunculus
Turdus
Pomatostomus
Leothrix
Amytornis
Acanthisitta
Orthonyx x 2
Malurus
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
Acanthisittidae
Acanthizidae
Acrocephalidae
Callaeidae
Campephagidae
Cnemophilidae
Corvidae
0.84
0.91
0.93
0.95
Acanthisitta
Acrocephalus
Ailuroedus
Ailuroedus
Amytornis
Camptostoma
AMI
23.12
34.54
37.21
38.55
Posterior
probability
AMI can MEASURE
Branch lengths!
NexML
Genus Family
HTML
Open Notebook Science
• Graduate students understand it: do you?
Free/Open Software Development
Engineered
repository
World
community
CODE
rewrite
validate
CODE
fork
CODE
Re-use
CODE
Re-use
Github, BitBucket
StackOverflow,
Apache
inspires
OSI
Example: ContentMine at
http://github.com/ContentMine/quickscrape
Sophie Kershaw, Panton Fellow, Training PhD Students
“Do you think you would be
more confident in the future
about trying to apply Open
techniques to your work..?”
• 50% Yes, by myself
• 41% Yes, with help/guidance
• 9% No opinion/neutral
• 0% No
Rotation-Based Learning (RBL)
Phase 1: Initiator
• No communication
permitted between groups
• Attempt to reproduce
existing literature
• Deliver a coherent research
story by the end of Phase 1
Phase 2: Successor
• Communication between
groups still prohibited
• Validate and develop the
inherited research story
• Critique your predecessors
• Role of research producer vs. research user
• Can this approach help to foster awareness of reproducibility issues?
Throughout Phases 1 & 2:
• Daily lectures on open
science culture & techniques
• First-hand application to own
research work
• Version control using GitHub
• Daily group supervision
Making Theses USEFUL
Open Source software inspires Open Science
Jean-Claude Bradley 2006
Open Notebook Science, ONS
Jean-Claude Bradley 2006
http://michaelnielsen.org/blog/reinventing-
discovery/
http://en.wikipedia.org/wiki/Reinventing_Discovery
http://gowers.wordpress.com/2013/11/03/dbd1-initial-post/
http://polymathprojects.org/2013/11/04/polymath9-pnp/#comments
The Polymath project
Tim Gowers and the world
Jean-Claude Bradley 2006
Jean-Claude Bradley 2006
Jean-Claude Bradley 2006
And spectra were included as well
Jean-Claude Bradley 2006
TOOLS
Open Notebook Science
Open
engineered
repository
World
community
INSTRUMENT
validate
merge
MODEL
CODE
DATA
DATA
knowledge
calibrate
Problems are solved communally;
Nothing is needlessly duplicated; “publication“ is
continuous
Machines
and humans
Working
together
CC-BY

More Related Content

PPTX
PPTX
OpenNotebookScience NOW!
PPTX
ContentMine and WikiData
PPTX
Open data and Open Science
PPTX
Open Notebook Science
PPTX
The Content Mine (presented at UKSG)
PPTX
ContentMine: Liberating scholarship from Open publications and theses
PPTX
ContentMine: Open Data and Social Machines
OpenNotebookScience NOW!
ContentMine and WikiData
Open data and Open Science
Open Notebook Science
The Content Mine (presented at UKSG)
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Open Data and Social Machines

What's hot (20)

PPTX
Disruptive Communities and Technology
PPTX
Content Mining for Machines and Humans
PPTX
Petermrjisc20141201
PPTX
Embrace the Open Revolution
PPTX
Copyright Reform and Open Data
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Content Mining at Wellcome Trust
PPTX
ContentMine and WikiData
PPTX
Principles and practice of Open Science
PPTX
ContentMining in Neuroscience
PPT
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
PPTX
Improving the troubled relationship between Scientists and Wikipedia
PPTX
ContentMining and Clinical Trials
PPTX
ContentMining and Clinical Trials
PPTX
OpenNotebookScience NOW!
PPTX
Content Mining at Wellcome Trust
PPTX
Making Theses USEFUL
PPTX
The culture of researchData
PDF
Open scholarship [a FOSTER open science talk]
Disruptive Communities and Technology
Content Mining for Machines and Humans
Petermrjisc20141201
Embrace the Open Revolution
Copyright Reform and Open Data
Can Computers understand the scientific literature (includes compscie material)
Content Mining at Wellcome Trust
ContentMine and WikiData
Principles and practice of Open Science
ContentMining in Neuroscience
Bibliography 2.0: A citeulike case study from the Wellcome Trust Genome Campus
Improving the troubled relationship between Scientists and Wikipedia
ContentMining and Clinical Trials
ContentMining and Clinical Trials
OpenNotebookScience NOW!
Content Mining at Wellcome Trust
Making Theses USEFUL
The culture of researchData
Open scholarship [a FOSTER open science talk]
Ad

Similar to Making Theses USEFUL (20)

PPTX
Open Data and Open Science
PPTX
Open Knowledge and University of Cambridge European Bioinformatics Institute
PPTX
The culture of researchData
PPTX
The Culture of Research Data, by Peter Murray-Rust
PDF
Open science
PPTX
ContentMine: Open Data and Social Machines
PDF
Tools and Methodology for Research: Future of Science
PDF
KEYNOTE: Erin McKiernan, My pledge to be open (Yeah, how’s that going?)
PDF
The OpenCon Intro to Open Data
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Open Science
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPT
Open Research methodologies
PPTX
Learn to speak open
PDF
Open Research Data: Licensing | Standards | Future
PPTX
Benefits and practice of open science
PDF
OSFair2017 Training | Best practice in Open Science
PPTX
HKU Data Curation MLIM7350 Class 10
PDF
Do you speak open science
Open Data and Open Science
Open Knowledge and University of Cambridge European Bioinformatics Institute
The culture of researchData
The Culture of Research Data, by Peter Murray-Rust
Open science
ContentMine: Open Data and Social Machines
Tools and Methodology for Research: Future of Science
KEYNOTE: Erin McKiernan, My pledge to be open (Yeah, how’s that going?)
The OpenCon Intro to Open Data
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Open Science
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
Open Research methodologies
Learn to speak open
Open Research Data: Licensing | Standards | Future
Benefits and practice of open science
OSFair2017 Training | Best practice in Open Science
HKU Data Curation MLIM7350 Class 10
Do you speak open science
Ad

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
PPTX
WikiFactMine: Science for Everyone
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Scientific search for everyone
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism
WikiFactMine: Science for Everyone

Recently uploaded (20)

PDF
RMMM.pdf make it easy to upload and study
PDF
01-Introduction-to-Information-Management.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Structure & Organelles in detailed.
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Pre independence Education in Inndia.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Pharma ospi slides which help in ospi learning
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Basic Mud Logging Guide for educational purpose
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
RMMM.pdf make it easy to upload and study
01-Introduction-to-Information-Management.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Structure & Organelles in detailed.
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Pre independence Education in Inndia.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Supply Chain Operations Speaking Notes -ICLT Program
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Microbial disease of the cardiovascular and lymphatic systems
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Pharma ospi slides which help in ospi learning
Module 4: Burden of Disease Tutorial Slides S2 2025
human mycosis Human fungal infections are called human mycosis..pptx
Anesthesia in Laparoscopic Surgery in India
Basic Mud Logging Guide for educational purpose
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Making Theses USEFUL

Editor's Notes

  • #2: Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
  • #4: Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture. In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.