Rajarshi	Guha,	NIH-NCATS
LINCS	Data	Science	Research	Webinar	Series
75%	of	protein	research	still	
focused	on	10%	genes	known	
before	human	genome	was	
mapped
AM	Edwards	et	al,	Nature,	2011
This	prompted	NIH	to	start	
the	Illuminating	the	Druggable
Genome	Initiative
The	Need	for	the	IDG
IDG	Knowledge	Management	Center
Target	Development	Level
• Protein	classification	
schemes	are	based	on	
structural	and	functional	
criteria.	
• For	therapeutic	
development,	it	is	useful	
to	understand	how	much	
and	what	types	of	data	
are	available	for	a	given	
protein,	thereby	
highlighting	well-studied	
and	understudied	targets.	
T.	Oprea et	al.,	Nature	Rev.	Drug	Discov.	poster,		Jan	2017
Target	Development	Level
• Proteins	annotated	as	
drug	targets	are	Tclin
• Proteins	for	which	potent	
small	molecules	are	
known	are	Tchem
• Proteins	for	which	biology	
is	better	understood	are	
Tbio
• Proteins	that	lack	
antibodies,	publications	
or	Gene	RIFs	are	Tdark
T.	Oprea et	al.,	Nature	Rev.	Drug	Discov.	poster,		Jan	2017
TDL:	External	Validation
T.	Oprea et	al.,	Nature	Rev.	Drug	Discov.	poster,		Jan	2017
Why	Should	ANYONE	Fund	Tdark?
Data	from	Tudor	Oprea &	Christian	Bologa
Leptin
SMO
S1PR1
Orexin
PCSK9
Ghrelin
1995 2000 2005 2010 2015
Median	time	to	go	from	Tdark	
to	bearing	fruit	is	17	years
The	Causality	Dilemma
• Are	Tdark	proteins	underfunded	because	there	is	
no	scientific	interest	in	this	category,	or	is	the	lack	
of	knowledge	perpetuated	by	lack	of	funding?
• It	is	possible	that	the	absence	of	high	quality,	well	
characterized	molecular	probes	may	be	a	root	
cause	for	this	situation.	
• However,	lack	of	tools	leads	to	lack	of	interest,	and	
lack	of	interest	diminishes	the	probability	of	such	
tools	being	developed
Tudor	Oprea
There	is	a	Knowledge	Deficit
• >	37%	of	proteins	are	poorly	described	(Tdark)
• ~10%	of	the	Proteome	(Tclin	&	Tchem)	can	be	
targeted	by	small	molecules
• 10%	of	NIH	R01’s	(2011-2015)	awarded	to	study	11	
targets	(out	of	7,934	targets	funded	in	total)
• Dark	genes	need	funding	and	patience
https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw1072
Entity	browsing	(filterable	&	linked)Search	(full	text,	auto-suggest)
Detailed	view	of	entities Built	on	top	of	a	robust	REST	API
Pharos	- An	Interface	to	the	KMC
Current	Status191 facets
17.8 GB database
30 GB Lucene indexes
36K LoC (Java)
14K LoC (Scala)
Image available
Source code available
20,120	targets
15,094	diseases
2.3M	publications
4,500	drugs
What’s	Included?
• Pharos	presents	data	from	a	variety	of	sources,	
integrated	by	U.	New	Mexico	
• Primary	focus	is	the	protein	target
• Wherever	possible,	targets	are	linked	to	other	
entities	(which	are	also	interlinked)
• Small	molecules,	Diseases,	Publications	
• Target	related	data	include
• Identifiers,	ontology	terms,	sequence,	expression	data,	
publications	(curated	&	text	mined),	phenotypes,	PPI
Data	Sources
Full	data	source	list	at	http://targetcentral.ws/Pharos
Drug Target Ontology
TCRD
DISEASE
TIN-X
Interactions	inside	&	
outside	the	IDG
Target	Audience
Biologists	&	
Clinical	Researcher
• Characterize	&	
validate	novel	
targets
• Identify	key	small	
molecules	or	
biologics
Informatics	
Scientists
• Data	mining
• Support	target	
validation	
projects
Program	Staff
• Explore	the	
research	
landscape
• New	directions	
for	research &	
funding
Different	Ways	to	Use	Pharos
Random
Access
Direct
Access
Manual Interaction Programmatic Interaction
Search Entity Info
Precomputation converts	analysis	in	to	browsing
Supporting	Both	Types	of	Users
• Efficient	full	text	search,	coupled	to	relevant	auto-
suggestion
• Primary	entry	point	when	exploring	
and	for	hypothesis	generation
• Extensive	list	of	facets
• Supports	easy	construction	of	
complex	filtering	rules
• Extensive	details	for	each	
target
• Linked	to	external	and	
internal resources
Batch	Search
• Easily	pull	up	on	data	on	multiple	targets	at	one	go
Sequence	Search
• Query is	ABL,ARG (from	LINCS	KiNativ dataset),	
similarity	>	0.7
Structure	Search
• Search by	substructure	or	similarity
• Identify	targets	enriched	in	a	scaffold
Visualization
• Key	requirement	for	efficient	exploration,	summary
• Increase	information	density	in	limited	screen	real	
estate,	take	context	into	account
• Interactivity	is	desirable,	high	quality	for	easy	
inclusion	in	documents
• Simple	is	better	than	fancy	but	pretty	pictures	have	
value,	make	for	a	better	experience
• Integrate	and	link	to	external	visualization
• TinX,	Harmonizome
Visualization	Highlights
Visualization	dashboard	– filters	appropriately
represented,	plots	act	as	filters
Inline	visualization	to	increase	information	density
Summary	visualizations	
overlay	multiple	dimensions	
and	can	be	context	aware
Integrating	External	Tools
Tclin,	Kinase
Tdark,	GPCR
Pharos
TinX
Enhanced	Documentation
Entity	Dossier
• As	you	explore	the	knowledge	base	it’s	useful	keep	
track	of	data
• Pharos	implements	a	dossier	function
• Analogous	to	e-commerce	shopping	carts
• Support	for	task-specific	dossiers
• Download	a	dossier	as	a		ZIP	file
Entity	Dossier
Multiple	dossiers
Set	operationsVisualization	tools
Download
Longer	term,	dossiers	will	be	automatically	enriched	with	
linked	items	and	recommendations
Dossiers	as	Context
Overlay	data	from	targets	in	a	dossier
Quantifying	Knowledge	About	Targets
• The	Harmonizome represents	the	data	available	
around	a	given	target
• Compute	the	number	of	associations	for	each	gene	in	
a	data	source	and	convert	to	ECDF
• Precomputed in	TCRD
• Used	by	Harmonogram and	radar	chart	viz
• Define	the	Knowledge	Availability	Score	(KAS)
KAST = Ci
i=1
n
∑
Knowledge	Availability	Score
0
50
100
150
0 20 40 60
Knowledge Availability Score
Frequency
Knowledge	Availability	in	Pharos
KAS	vs.	Other	measures
KAS	vs.	Other	measures
• Best	correlation	with	Pubmed count
• As	expected,	data	for		Tdark is	noisier
• Of	interest	are	those	targets	with	higher	values	of	
knowledge	availability	but	small	values	of	another	
metric
• In	particular	the	Jensen	
Pubmed Score	seems	to	
lead	to	such	targets
1
100
10000
0 20 40 60
Knowledge Availability Score
JensenPubmedScore
Tbio
Tchem
Tclin
Tdark
(Dis)similarity	in	Knowledge	Space
• There	are	114	unique	data	sources	via	the	
Harmonizome
• We	represent	each	target	as	a	114-element	vector
• Where	a	source	has	no	data	for	the	target,	we	set	it's	value	
to	0
• Not	necessarily	the	best	choice,	since	it's	really	missing	
data
• Uniform	weighting	may	not	be	appropriate
• Compute	a	pairwise	Euclidean	distance	matrix	or	
cosine	similarity	matrix	for	1757	targets
Cosine	similarity	in	Knowledge	Space
Similarity	in	Knowledge	Space
• Consider	Euclidean	distance	matrix
• Of	particular	interest	is	to	identify	Tdark targets	
that	have	a	knowledge	profile	that	is	most	similar	
to	targets	that	are	not	Tdark
• 44	such	targets
• Within	this	set,	10	targets	have	a	knowledge	profile	
that	is	most	similar	to	a	Tchem or	Tclin target
Target	Similarity
• Compute	target	
similarity	in	
“Harmonizome space”
• Supports	
recommendations,	
prioritization
• Currently	extending	to	
a	generalized	Target	
Knowledge	Vector	
approach
Tdark targets	whose	most	
similar	target	is	not	Tdark
What	might	this	mean?
• Publications	are	one	way	to	prioritize	targets
• But	we	should	also	consider	the	extent	of	data	
around	targets
• Alternatively,	all	(or	multiple	types	of)		the	data	
about	a	target	is	subsumed	into	a	small	set	of	
publications
• One	paper	might	include	RNAseq,	CNV,	pharmacology	
• Publications	lag	data
• Tdark targets	with	a	(relative)	higher	knowledge	
availability	value	but	low	publication	based	score	
could	be	rising	stars?
Next	Steps	- Target	Knowledge	Vectors
• Based	on	sparse	vector	representation	of	data	
availability,	applied	to	20K	targets
• A	target	is	a	document	mixture	of	discrete	and	
continuous	variable	descriptors
• Set	of	facet	values/terms	and	frequencies
• Amino	acid	sequence	length	and	individual	AA	residue	
profiles
• Counts	of	related	publications,	ligands,	Xtals,	diseases,	
protein-protein	interactions,	etc.
• Similar	to	TD-IDF,	facet	value	frequencies	are	
inversely	weighted	by	popularity
• The	similarity	is	calculated	as	generalized	Tanimoto
Outreach	&	Dissemination	Activities
User Feedback Deployment
Webinars Documentation
NER API for
targets & diseases
@idg_pharos
Recent	papers	to	
Pharos	links	via	
Tweets
Pharos	Usage
• Usage	statistics	over	
the	last	one	year	are	
generally	increasing
• 89K	pageviews
• 14K	sessions
• 7.5K	users
Pharos	Indexing
Now	includes	hits	in	
partner	databases	such	
as	KEGG	and	ChEMBL
The	Long	Term	Vision
• Incorporate	dependencies
between	data	types	to	support
inference	and	sophisticated	filters
• From	presentation	to	summarization
• Use	explicit	links	&	computational	
inference	to	generate	(semi-)	natural	language
summary	using	all	known	data
• Influenced	by	the	query
• The	result	is	a	biological	dashboard,	
customized	for	the	user	and	the	query
Target X has been implicated in 3
diseases related to skeletal, urological
and nervous systems. It has been
investigated in 5 in vitro assay, 2 in
vivo assays. There are 4 compounds
active against this target, 3 of which
are in clinical trials.
Feedback
• Explore	the	UI,	try	it,	break	it,	and	let	us	know	what	
works	and	what	doesn’t
• Are	there	data	types	and	relations	that	would	help	
you	but	are	not	available?
• Nguyen	&	Mathias	et	al,	Nucl.	Acids	Res.,	2017
https://pharos.nih.gov
https://spotlite.nih.gov/pharos
https://hub.docker.com/r/ncats/pharos/
pharos@nih.gov
@idg_pharos
Acknowledgements
• Dac-Trung Nguyen,	Kyle	Brinacombe,	Timothy	
Sheils,	Geetha Mandava,	Noel	Southall,	Ajit Jadhav
• Steve	Mathias,	Oleg	Ursu,	Jeremy	Yang,	Christian	
Bologa,	Daniel	Canon,	Tudor	Oprea
• Nicholas	Fernandez,	Andrew	Rouillard,	Avi Mayan
• Finkbeiner lab,	Tomita	Lab
• Ajay	Pillai,	Aaron	Pawlyk,	Christine	Colvis

More Related Content

PDF
Pharos: Putting targets in context
PDF
Pharos: A Torch to Use in Your Journey in the Dark Genome
PDF
BioAssay Research Database Presentation at the Chem Axon UGM 2013
PPTX
Conference presentation from #iccs2014 in Noordwijkerhout
PDF
ELSS use cases and strategy
PDF
Big Data in Pharma - Overview and Use Cases
PPTX
Data reuse and scholarly reward: understanding practice and building infrastr...
PPTX
Martone grethe
Pharos: Putting targets in context
Pharos: A Torch to Use in Your Journey in the Dark Genome
BioAssay Research Database Presentation at the Chem Axon UGM 2013
Conference presentation from #iccs2014 in Noordwijkerhout
ELSS use cases and strategy
Big Data in Pharma - Overview and Use Cases
Data reuse and scholarly reward: understanding practice and building infrastr...
Martone grethe

What's hot (20)

PPTX
Research data and scholarly publications: going from casual acquaintances to ...
PDF
Knowledge Exchange, Nov 2011, Bonn
PPTX
Leveraging publication metadata to help overcome the data ingest bottleneck
PPTX
Why should researchers care about data curation?
PPT
Data Mining and Big Data Analytics in Pharma
PPTX
The Dryad Digital Repository: Published evolutionary data as part of the gre...
PPTX
NCBO haendel talk 2013
PPTX
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
PDF
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
PPTX
MPS webinar master deck
PPTX
Pharma data analytics
PPT
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
PPSX
Rii stock centerdir_aug9_2016
PPTX
effective data sharing for a learning healthcare system
PDF
Gaining credit for sharing research data
PPT
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
PDF
Considerations and challenges in building an end to-end microbiome workflow
PDF
dkNET Poster Experimental Biology 2019
PPT
Pulverer-embo-source data-nfdp13
PDF
Digital transformation of translational medicine
Research data and scholarly publications: going from casual acquaintances to ...
Knowledge Exchange, Nov 2011, Bonn
Leveraging publication metadata to help overcome the data ingest bottleneck
Why should researchers care about data curation?
Data Mining and Big Data Analytics in Pharma
The Dryad Digital Repository: Published evolutionary data as part of the gre...
NCBO haendel talk 2013
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics
MPS webinar master deck
Pharma data analytics
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
Rii stock centerdir_aug9_2016
effective data sharing for a learning healthcare system
Gaining credit for sharing research data
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Considerations and challenges in building an end to-end microbiome workflow
dkNET Poster Experimental Biology 2019
Pulverer-embo-source data-nfdp13
Digital transformation of translational medicine
Ad

Similar to Pharos – A Torch to Use in Your Journey In the Dark Genome (20)

PPT
NIH Drug Discovery and Development - NCTT and CTSAs
PDF
FAIRness and Accountability BioIT 2019 FAIR track
PPTX
Opensourcepharma Dr Nibedita rath
PDF
The Translational Medicine
PPTX
The Learning Health System: Thinking and Acting Across Scales
PPTX
MedicalResearch.com: Medical Research Exclusive Interviews July 2 2015
PPTX
Atul Butte NIPS 2017 ML4H
PPTX
MedicalResearch.com: Medical Research Exclusive Interviews January 7 2014
KEY
6-005-1430-Keeppanasseril
PPTX
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
PPT
Restorative Therapies for Erectile Dysfunction
PDF
Evidence based Practice in Emergency Medicine
PPTX
Nlp for the precision medicine
PPTX
Scope and Applications of Bioinformatics --Nishikant Bhojane.pptx
PPT
Drug discovery and development overview
PDF
Epigen Biosciences I-Corps@NIH 121014
PDF
Mobilizing informational resources webinar
PPT
medical-test-reviews-genetic.ppt
PPTX
Predictive analytics for personalized healthcare
NIH Drug Discovery and Development - NCTT and CTSAs
FAIRness and Accountability BioIT 2019 FAIR track
Opensourcepharma Dr Nibedita rath
The Translational Medicine
The Learning Health System: Thinking and Acting Across Scales
MedicalResearch.com: Medical Research Exclusive Interviews July 2 2015
Atul Butte NIPS 2017 ML4H
MedicalResearch.com: Medical Research Exclusive Interviews January 7 2014
6-005-1430-Keeppanasseril
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
Restorative Therapies for Erectile Dysfunction
Evidence based Practice in Emergency Medicine
Nlp for the precision medicine
Scope and Applications of Bioinformatics --Nishikant Bhojane.pptx
Drug discovery and development overview
Epigen Biosciences I-Corps@NIH 121014
Mobilizing informational resources webinar
medical-test-reviews-genetic.ppt
Predictive analytics for personalized healthcare
Ad

More from Rajarshi Guha (20)

PDF
Pharos - Face of the KMC
PDF
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
PDF
What can your library do for you?
PDF
So I have an SD File … What do I do next?
PDF
Characterization of Chemical Libraries Using Scaffolds and Network Models
PDF
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
PDF
Robots, Small Molecules & R
PDF
Fingerprinting Chemical Structures
PDF
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
PDF
When the whole is better than the parts
PDF
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
PDF
Pushing Chemical Biology Through the Pipes
PDF
Characterization and visualization of compound combination responses in a hig...
PDF
The BioAssay Research Database
PDF
Cloudy with a Touch of Cheminformatics
PDF
Chemical Data Mining: Open Source & Reproducible
PDF
Chemogenomics in the cloud: Is the sky the limit?
PDF
Quantifying Text Sentiment in R
PDF
PMML for QSAR Model Exchange
PDF
Smashing Molecules
Pharos - Face of the KMC
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
What can your library do for you?
So I have an SD File … What do I do next?
Characterization of Chemical Libraries Using Scaffolds and Network Models
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
Robots, Small Molecules & R
Fingerprinting Chemical Structures
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
When the whole is better than the parts
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Pushing Chemical Biology Through the Pipes
Characterization and visualization of compound combination responses in a hig...
The BioAssay Research Database
Cloudy with a Touch of Cheminformatics
Chemical Data Mining: Open Source & Reproducible
Chemogenomics in the cloud: Is the sky the limit?
Quantifying Text Sentiment in R
PMML for QSAR Model Exchange
Smashing Molecules

Recently uploaded (20)

PPT
Mutation in dna of bacteria and repairss
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PDF
Science Form five needed shit SCIENEce so
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPTX
Platelet disorders - thrombocytopenia.pptx
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PDF
Packaging materials of fruits and vegetables
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Chapter 3 - Human Development Poweroint presentation
PPTX
PMR- PPT.pptx for students and doctors tt
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PDF
CuO Nps photocatalysts 15156456551564161
PPTX
congenital heart diseases of burao university.pptx
PPTX
perinatal infections 2-171220190027.pptx
PDF
Social preventive and pharmacy. Pdf
Mutation in dna of bacteria and repairss
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Science Form five needed shit SCIENEce so
TORCH INFECTIONS in pregnancy with toxoplasma
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
Platelet disorders - thrombocytopenia.pptx
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
Packaging materials of fruits and vegetables
Presentation1 INTRODUCTION TO ENZYMES.pptx
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Chapter 3 - Human Development Poweroint presentation
PMR- PPT.pptx for students and doctors tt
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
CuO Nps photocatalysts 15156456551564161
congenital heart diseases of burao university.pptx
perinatal infections 2-171220190027.pptx
Social preventive and pharmacy. Pdf

Pharos – A Torch to Use in Your Journey In the Dark Genome