SlideShare a Scribd company logo
MER: a Minimal Named‐Entity 
Recognition Tagger 
and Annotation Server
Francisco M. Couto, Luis F. Campos, and Andre Lamurias
LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
BioCreative V.5 Workshop , April 26‐27, 2017
Why Minimal?
• TIPS (Technical interoperability and performance of annotation servers)
– it’s cool, we have to participate somehow 
• But we have limited computational resources
• Idea: Go Minimal
– Minimize the number of tools and steps to 
perform Named‐Entity Recognition (NER)
What is Minimal?
• Flexibility
– Simple input
• Autonomy 
– minimal set of components and software 
dependencies
• Efficiency
– Low execution time
How Minimal?
• Only requires a lexicon as input 
– a text file
• Only two components: 
1. process the lexicon (offline)
2. produce the annotations (on‐the‐fly)
• GNU Bash shell script
– Using high performance grep and awk tools
– Portability:  any Unix‐like operating system
Input
• lexicon text file
α‐maltose
nicotinic acid
nicotinic acid D‐ribonucleotide
nicotinic acid‐adenine dinucleotide phosphate
Pre‐Processing
== one‐word ( . . . word1 . txt ) 
α.maltose
== two‐word ( . . . word2 . txt )
nicotinic acid
== more‐words ( . . . words . txt )
nicotinic acid d.ribonucleotide
nicotinic acid.adenine dinucleotide phosphate
== first‐two‐words ( . . . words2 . txt )
nicotinic acid
nicotinic acid.adenine
Recognition
• Common Solution
– Apply grep directly to the input text
– execution time is proportional to the size of the 
lexicon
• Inverted Solution
– input text as patterns matched against the lexicon
– more than 100 times faster
• TIPS chemical lexicon
Input text as patterns
Output
./get_entities.sh 'α‐maltose and nicotinic acid 
D‐ribonucleotide was found, but not nicotinic 
acid' lexicon
0       9       α‐maltose
14      28      nicotinic acid
65      79      nicotinic acid
14      45      nicotinic acid D‐ribonucleotide
ANNOTATION SERVER
Input: Lexicons
• Cell line and cell type
– Cellosaurus
• Chemical
– HMDB, ChEBI and ChEMBL
• Disease: 
– Human Disease Ontology
• miRNA: 
– miRBase
• Protein: 
– Protein Ontology
• Subcellular structure: 
– cellular component aspect of Gene Ontology
• Tissue and organ: 
– tissue and organ subsets of UBERON
https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip
Lexicon Size
• more than 1M terms composed of more than 
2M words and more than 25M characters
Input: text
• jq
– a command‐line JSON processor 
– to parse the requests
• cURL
– to  download each document
• Parsers
– PubMed, Patents, PMC
https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services
• NO CACHE
Output
• Added some more columns to MER output
– BeCalm TSV format
• The score 
– 1‐1/ln(nc), 
– nc = # characters of the recognized term
Infrastructure
• Three Virtual Machines (VM). 
– Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz
– CentOS Linux release 7.3.1611 (Core)
• VM (primary) to process the requests, distribute 
the jobs, and execute MER.
• The other two VMs (secondary) just execute 
MER. 
• NGINX as HTTP server running CGI scripts 
– high performance
• Task Spooler to manage and distribute jobs
Results
• April 21, 2017
• less than 3 seconds on average
Web Tool
http://labs.fc.ul.pt/mer/
RESTful Web service
Conclusions
• MER a minimal NER tagger
– Flexible: extensible to any lexicon
– Autonomous: only requires a GNU Bash shell
– Efficient: high‐performance capacity of grep
• Annotation Server 
– developed in‐house 
– minimal software dependencies 
– and is open‐source
• Future: entity linking functionality in MER
Acknowledgments
• Portuguese National Distributed Computing 
Infrastructure (http://www.incd.pt)
• Links
– https://github.com/lasigeBioTM/MER
– http://labs.fc.ul.pt/mer/

More Related Content

PDF
Netflix Architecture and Open Source
PDF
Geoscience and Microservices
PDF
Timed Text At Netflix
PPTX
SyPy IronPython
PDF
The tools & technologies behind Resin.io
PDF
Netflix OSS Meetup Season 4 Episode 4
PPTX
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
PDF
ULMAN-GUI
Netflix Architecture and Open Source
Geoscience and Microservices
Timed Text At Netflix
SyPy IronPython
The tools & technologies behind Resin.io
Netflix OSS Meetup Season 4 Episode 4
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
ULMAN-GUI

What's hot (20)

PDF
The Holy Grail of continuous delivery in distributed teams environment
PPTX
Advance programming techniques
PPTX
The Road to Kubernetes
PDF
LCA14: LCA14-209: ODP Project Update
PDF
Fluentd Intro for OpenShift Commons Briefing
PDF
ODP Presentation LinuxCon NA 2014
PDF
What we do with Go
PPTX
Tech Days 2015: Multi-language Programming with GPRbuild
PDF
A sip of Elixir
PDF
FluentD vs. Logstash
PDF
Full Stack Meat Project with Arduino Node AWS Mobile
PPT
.Net Introduction
PDF
Running AWS Locally
PDF
Confgetti - Put A Leash On Your Configuration!
PDF
Securing Your Resources with Short-Lived Certificates!
PDF
Netflix Open Source: Building a Distributed and Automated Open Source Program
PDF
NetflixOSS Meetup season 3 episode 1
PDF
Javantura v4 - Support SpringBoot application development lifecycle using Ora...
PPTX
Microcontainers and Tools for Hardcore Container Debugging
ODP
Go lambda-presentation
The Holy Grail of continuous delivery in distributed teams environment
Advance programming techniques
The Road to Kubernetes
LCA14: LCA14-209: ODP Project Update
Fluentd Intro for OpenShift Commons Briefing
ODP Presentation LinuxCon NA 2014
What we do with Go
Tech Days 2015: Multi-language Programming with GPRbuild
A sip of Elixir
FluentD vs. Logstash
Full Stack Meat Project with Arduino Node AWS Mobile
.Net Introduction
Running AWS Locally
Confgetti - Put A Leash On Your Configuration!
Securing Your Resources with Short-Lived Certificates!
Netflix Open Source: Building a Distributed and Automated Open Source Program
NetflixOSS Meetup season 3 episode 1
Javantura v4 - Support SpringBoot application development lifecycle using Ora...
Microcontainers and Tools for Hardcore Container Debugging
Go lambda-presentation
Ad

More from Francisco Couto (12)

PDF
A Student’s Guide to Master’s Theses in Bioinformatics and Computational Biology
PDF
Master's Theses in Bioinformatics and Computational Biology
PDF
Linked Data – challenges for Imagiology and Radiology
PDF
Metadata Analyser: measuring metadata quality
PDF
Towards a privacy-preserving environment for genomic data analysis
PDF
A Large-Scale Characterization of User Behaviour in Cable TV
PDF
A Flexible Recommendation System for Cable TV
PPTX
Master in Bioinformatics and Computational Biology
PDF
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
PDF
Bioinf2Bio Oportunidades
PPTX
Stabvida oportunidades profissionais
PPTX
Mestrado em Bioinformática e Biologia Computacional da FCUL
A Student’s Guide to Master’s Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational Biology
Linked Data – challenges for Imagiology and Radiology
Metadata Analyser: measuring metadata quality
Towards a privacy-preserving environment for genomic data analysis
A Large-Scale Characterization of User Behaviour in Cable TV
A Flexible Recommendation System for Cable TV
Master in Bioinformatics and Computational Biology
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
Bioinf2Bio Oportunidades
Stabvida oportunidades profissionais
Mestrado em Bioinformática e Biologia Computacional da FCUL
Ad

Recently uploaded (20)

PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
An interstellar mission to test astrophysical black holes
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
2. Earth - The Living Planet earth and life
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
POSITIONING IN OPERATION THEATRE ROOM.ppt
Biophysics 2.pdffffffffffffffffffffffffff
bbec55_b34400a7914c42429908233dbd381773.pdf
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
An interstellar mission to test astrophysical black holes
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
2. Earth - The Living Planet Module 2ELS
Introduction to Cardiovascular system_structure and functions-1
AlphaEarth Foundations and the Satellite Embedding dataset
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)
Phytochemical Investigation of Miliusa longipes.pdf
2. Earth - The Living Planet earth and life
Derivatives of integument scales, beaks, horns,.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
Cell Membrane: Structure, Composition & Functions

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server