SlideShare a Scribd company logo
Architecture of TheContentMine
These slides are for enlightenment and presentations. Use
http://discuss.contentmine.org/t/overall-architecture/142 for up-
to-date info. Questions, comments and critiques welcome! All s/w
is Open (BSD/Apache2)
Some diagrams are autogenerated from *.dot files which are
located in the projects (mainly Norma and AMI)
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
Latest 20150908
quickscrape Norma Index &
Transform
PDF
XML
URL
DOI
DOC
CSV
sHTML
Plugins
SequencesSpecies
BespokeScrapers XPath
Taggers
Per- Journal
Chemistry
Phylogenetics Plants
AMI
BadHTML
OCR
Diagrams
CAT-alogue index
getpapersquery
Titles+
links
Daily
Crawl/
feed
EuPMC
JToCs
Latest 20150908; limited in scope
Starting points for ingestion
(getpapers/quickscrape/Norma)
• Search/Crawl/Feed-> PMCID,DOI,URL ->
quickscrape ->
CTree(PDF,HTML,XML,images/,meta) ->
Norma -> CMDir(sHTML|TXT|SVG|image)
good
• PDF,XML,TXT,HTML -> Norma ->
CTree(PDF,rawHTML,TXT,images/,meta?) ->
NormaOCR|TXT2HTML ->
CTree(sHTML,TXT,SVG) variable
20150908
Norma Conversions
• Paper-> Scanned -> TIFF (avoid)
• PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG
fast, variable
• PDF -> PDF2SVG-N -> sHTML, SVG, images/.
slow, accurate-ish
• PDF -> PDF2TXT-N -> TXT fast, variable
• PDF -> PDF2Image-N -> PNG fast, accurate
20150908
Norma End points
• Norma -> CTree(OpenSHTML-SVG) ->
everything?
• Norma -> CTree(sHTML. sections) -> AMI -> all
text + species, chemText, sequences)
• Norma -> CTree(TXT (unsectioned)) -> AMI ->
bagOfWords, regex, IDs, species?
• Norma -> CTree(PNG) -> AMI -> phylo, bar/xy-
plots,
• Norma -> CTree(SVG) -> AMI -> phylo, bar/xy-
plots, chemistry
Pre/early Norma toolchain
Transforming PDF and PNG into higher value components
20150908Diagram autogenerated from *.dot graph
getpapers/quickscrape/Norma workflow
20150908Diagram autogenerated from *.dot graph
20150908Diagram autogenerated from *.dot graph
Getpapers/quickscrape/Norma: commonest uses
20150908Diagram autogenerated from *.dot graph
AMI: inputs and outputs for common plugins
Earlier diagrams
Probably significantly out of date, but may
contain useful info.
NORMALIZE
Norma
Convert PDF,XML
To sHTML
Tag sections
Normalized
Scientific
Literature
AMI
Index
Transform
Extract
Search
PDF2SVG
XSL stylesheets
Taggers
normalization
Parameters
“Permanent”
Filestore
Temporary
Filestore
Extracted facts
indexes
Plugins
Regex
PDF
Non-Unicode
Pixel glyphs
No words
No structures
ScholarlyHTML
SVG
High-level
graphics
PDF2SVG
characters
Sentences
Paras
tables
PNG OCR
Tagged
Sections
SVGBuilder
Captioned
Figures
NORMA
XSLT1/2
Raw HTML
Not wellformed
Bad character
semantics
ScholarlyHTML
Well-formed
XHTML
PNG
Tagged
Sections
Captioned
Figures
Tables
Captioned
Tables
XML
HtmlTidy
Jsoup
HtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journal
Stylesheets
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button
quickscrape
Crawl
Feed
Norma Index &
Transform
TXT
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
PDF
Architecture of ContentMine Components contentmine.org

More Related Content

PPTX
Text and Data Mining explained at FTDM
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
ContentMine (TDM) at JISC Digifest
PPTX
Content Mining of Science and Medicine
PPTX
The culture of researchData
PPTX
Cochrane workshop2016
PPTX
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
PDF
Museum impact: linking-up specimens with research published on them
Text and Data Mining explained at FTDM
ContentMine + EPMC: Finding Zika!
ContentMine (TDM) at JISC Digifest
Content Mining of Science and Medicine
The culture of researchData
Cochrane workshop2016
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Museum impact: linking-up specimens with research published on them

What's hot (20)

PPTX
ContentMine + EPMC: Finding Zika!
PDF
Modern Tools & Rationales for 21st Century Research
PPTX
Reproducibility, Research Objects and Reality, Leiden 2016
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PDF
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
PDF
Open Research Data: Licensing | Standards | Future
PPTX
Content Mining of Science in Europe
PDF
Research Shared: researchobject.org
PPTX
Advances in Scientific Workflow Environments
PPTX
Mtsr2015 goble-keynote
PDF
The State of Open Research Data
PDF
Reproducibility of model-based results: standards, infrastructure, and recogn...
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
PPSX
Cochrane workshop 2016
PDF
Improving the Management of Computational Models -- Invited talk at the EBI
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPT
DCC Keynote 2007
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
A Global Commons for Scientific Data: Molecules and Wikidata
PPTX
FAIRy Stories
ContentMine + EPMC: Finding Zika!
Modern Tools & Rationales for 21st Century Research
Reproducibility, Research Objects and Reality, Leiden 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Open Research Data: Licensing | Standards | Future
Content Mining of Science in Europe
Research Shared: researchobject.org
Advances in Scientific Workflow Environments
Mtsr2015 goble-keynote
The State of Open Research Data
Reproducibility of model-based results: standards, infrastructure, and recogn...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Cochrane workshop 2016
Improving the Management of Computational Models -- Invited talk at the EBI
Automatic Extraction of Knowledge from Biomedical literature
DCC Keynote 2007
Automatic Extraction of Knowledge from Biomedical literature
A Global Commons for Scientific Data: Molecules and Wikidata
FAIRy Stories
Ad

Viewers also liked (19)

PDF
ContentMine (EMBL-EBI Industry Programme)
PPTX
Can machines understand the scientific literature
PPTX
ContentMining in Neuroscience
PPTX
Mining the scientific literature for plants and chemistry
PPTX
Asking the scientific literature to tell us about metabolism
PPTX
TheContentMine: Mining for Everyone
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
High throughput mining of the scholarly literature
PPTX
Towards Responsible Content Mining: A Cambridge perspective
PPTX
Open software and knowledge for MIOSS
PPTX
Content Mining at Wellcome Trust
PPTX
High throughput mining of the scholarly literature; talk at NIH
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
PPTX
Asking the scientific literature to tell us about metabolism
PPTX
Mining Scientific Images
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Content Mining of Science in Cambridge
PPTX
High throughput mining of the scholarly literature
PPTX
Mining Scientific Diagrams for facts
ContentMine (EMBL-EBI Industry Programme)
Can machines understand the scientific literature
ContentMining in Neuroscience
Mining the scientific literature for plants and chemistry
Asking the scientific literature to tell us about metabolism
TheContentMine: Mining for Everyone
Amanuens.is HUmans and machines annotating scholarly literature
High throughput mining of the scholarly literature
Towards Responsible Content Mining: A Cambridge perspective
Open software and knowledge for MIOSS
Content Mining at Wellcome Trust
High throughput mining of the scholarly literature; talk at NIH
ContentMining for France and Europe; Lessons from 2 years in UK
Asking the scientific literature to tell us about metabolism
Mining Scientific Images
Can Computers understand the scientific literature (includes compscie material)
Content Mining of Science in Cambridge
High throughput mining of the scholarly literature
Mining Scientific Diagrams for facts
Ad

Similar to Architecture of ContentMine Components contentmine.org (20)

PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PPT
STAT Requirement Analysis
PPT
ALA Interoperability
PPT
PLaNet launch: tech platform proposal
PPT
CustomizingStyleSheetsForHTMLOutputs
PDF
Source-to-source transformations: Supporting tools and infrastructure
PDF
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
PDF
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
PPT
PhD Presentation
PDF
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
PPT
The Chemtools LaBLog
PDF
Large Scale Crawling with Apache Nutch and Friends
PDF
DataScience Meeting II - Roman Kern - Building an open source based search so...
ODP
Large Scale Crawling with Apache Nutch and Friends
PPTX
Academy PRO: Node.js default stack. Lecture 2
PPTX
Better integrations through open interfaces
PPT
NeXML - phylogenetic data as XML
PDF
Tecnick.com Open Source Tecnologies
PPTX
Technology Stack Discussion
PPT
Structured Dynamics' Semantic Technologies Product Stack
Metadata and Provenance for ML Pipelines with Hopsworks
STAT Requirement Analysis
ALA Interoperability
PLaNet launch: tech platform proposal
CustomizingStyleSheetsForHTMLOutputs
Source-to-source transformations: Supporting tools and infrastructure
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
PhD Presentation
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
The Chemtools LaBLog
Large Scale Crawling with Apache Nutch and Friends
DataScience Meeting II - Roman Kern - Building an open source based search so...
Large Scale Crawling with Apache Nutch and Friends
Academy PRO: Node.js default stack. Lecture 2
Better integrations through open interfaces
NeXML - phylogenetic data as XML
Tecnick.com Open Source Tecnologies
Technology Stack Discussion
Structured Dynamics' Semantic Technologies Product Stack

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Scientific search for everyone
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism

Recently uploaded (20)

PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
protein biochemistry.ppt for university classes
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Comparative Structure of Integument in Vertebrates.pptx
HPLC-PPT.docx high performance liquid chromatography
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
2. Earth - The Living Planet Module 2ELS
protein biochemistry.ppt for university classes
microscope-Lecturecjchchchchcuvuvhc.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
bbec55_b34400a7914c42429908233dbd381773.pdf
Biophysics 2.pdffffffffffffffffffffffffff
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
. Radiology Case Scenariosssssssssssssss
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
AlphaEarth Foundations and the Satellite Embedding dataset

Architecture of ContentMine Components contentmine.org

  • 1. Architecture of TheContentMine These slides are for enlightenment and presentations. Use http://discuss.contentmine.org/t/overall-architecture/142 for up- to-date info. Questions, comments and critiques welcome! All s/w is Open (BSD/Apache2) Some diagrams are autogenerated from *.dot files which are located in the projects (mainly Norma and AMI)
  • 2. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts Latest 20150908
  • 3. quickscrape Norma Index & Transform PDF XML URL DOI DOC CSV sHTML Plugins SequencesSpecies BespokeScrapers XPath Taggers Per- Journal Chemistry Phylogenetics Plants AMI BadHTML OCR Diagrams CAT-alogue index getpapersquery Titles+ links Daily Crawl/ feed EuPMC JToCs Latest 20150908; limited in scope
  • 4. Starting points for ingestion (getpapers/quickscrape/Norma) • Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CTree(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG|image) good • PDF,XML,TXT,HTML -> Norma -> CTree(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR|TXT2HTML -> CTree(sHTML,TXT,SVG) variable 20150908
  • 5. Norma Conversions • Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG fast, variable • PDF -> PDF2SVG-N -> sHTML, SVG, images/. slow, accurate-ish • PDF -> PDF2TXT-N -> TXT fast, variable • PDF -> PDF2Image-N -> PNG fast, accurate 20150908
  • 6. Norma End points • Norma -> CTree(OpenSHTML-SVG) -> everything? • Norma -> CTree(sHTML. sections) -> AMI -> all text + species, chemText, sequences) • Norma -> CTree(TXT (unsectioned)) -> AMI -> bagOfWords, regex, IDs, species? • Norma -> CTree(PNG) -> AMI -> phylo, bar/xy- plots, • Norma -> CTree(SVG) -> AMI -> phylo, bar/xy- plots, chemistry
  • 7. Pre/early Norma toolchain Transforming PDF and PNG into higher value components 20150908Diagram autogenerated from *.dot graph
  • 9. 20150908Diagram autogenerated from *.dot graph Getpapers/quickscrape/Norma: commonest uses
  • 10. 20150908Diagram autogenerated from *.dot graph AMI: inputs and outputs for common plugins
  • 11. Earlier diagrams Probably significantly out of date, but may contain useful info.
  • 12. NORMALIZE Norma Convert PDF,XML To sHTML Tag sections Normalized Scientific Literature AMI Index Transform Extract Search PDF2SVG XSL stylesheets Taggers normalization Parameters “Permanent” Filestore Temporary Filestore Extracted facts indexes Plugins Regex
  • 13. PDF Non-Unicode Pixel glyphs No words No structures ScholarlyHTML SVG High-level graphics PDF2SVG characters Sentences Paras tables PNG OCR Tagged Sections SVGBuilder Captioned Figures NORMA XSLT1/2
  • 14. Raw HTML Not wellformed Bad character semantics ScholarlyHTML Well-formed XHTML PNG Tagged Sections Captioned Figures Tables Captioned Tables XML HtmlTidy Jsoup HtmlUnit XSLT1/2 XSLT1/2 NORMA Per-journal Stylesheets
  • 15. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button
  • 16. quickscrape Crawl Feed Norma Index & Transform TXT XML URL DOI Scientific literature Repositories DOC CSV sHTML Plugins Regex SequencesSpecies Bespoke Scrapers XPathPer-Journal Taggers Per- Journal MetadataChemistry Phylogenetics Farming AMI BadHTML OCR Diagrams Open NORMA-lized Scientific Literature + Facts CANARY pipeline CAT-alogue index PDF