SlideShare a Scribd company logo
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Amarnath Gupta
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
NIF is an initiative of the NIH Blueprint consortium of institutes
What types of resources (data, tools, materials, services) are available to the
neuroscience community?
How many are there?
What domains do they cover? What domains do they not cover?
Where are they?
Web sites
Databases
Literature
Supplementary material
Who uses them?
Who creates them?
How can we find them?
How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
http://neuinfo.org
June10, 2013dkCOIN Investigator's Retreat3
A portal for finding and using
neuroscience resources
 A consistent framework for
describing resources
 Provides simultaneous search
of multiple types of
information, organized by
category
 Supported by an expansive
ontology for neuroscience
 Utilizes advanced technologies
to search the “hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database
Federation
Registry
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
How do we create an information infrastructure that is
able to connect a person or a community with the
resources they need to accomplish their task at hand?
Resource
Anything that is tangible and accessible
a product, a person, an institution, a piece of data, a connection …
Information Infrastructure
Enables the entire life cycle of information from acquisition
to (potential)archival
Allows people to find, access, understand and work with
information
A domain-specific example:
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Credit: Bhaskar Ghosh, LinkedIn
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Credit: Bhaskar Ghosh, LinkedIn
Oracle, MySQL, Voldemort, Espresso
Zoei, Bobo, Sensei, GraphDB
Kafka, Databus
Hadoop,
Teradata,
Azkaban
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Oracle or
Espresso
Search
Index
Graph
Index
Read
Replica
Updates
Standar
dization
Data Change Events
Credit: Hien Luu, Sid Anand, LinkedIn
Oracle or
Espresso
Updates
Databus
Web
Servers
Teradata
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
 Where is data about X?
 How does Y relate to Z?
 Accumulate and Analyze
 Compare X and Y
 Subscribe to topic T
 Recommend Resource
 Funding reports
 Search and Explore
 News
 Resource Promotion
 Utilization Search
 Cross-Utilization
 Experiential Services
Researcher Activity Resource Activity
Data
Data Infrastructure
Analysis & Science
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
 Where is data about X?
 How does Y relate to Z?
 Accumulate and Analyze
 Compare X and Y
 Subscribe to topic T
 Recommend Resource
 Funding reports
 Search and Explore
 News
 Resource Promotion
 Utilization Search
 Cross-Utilization
 Experiential Services
Researcher Activity Resource Activity
Data
Data Infrastructure
Analysis & Science
The problem
starts here
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Given an infinite number of web accessible
resources, which are relevant for Neuroscience?
Easy Case
A resource assigned by a trusted source
Reasonably Easy Case
A resource recommended by a potentially trustable
source
Not-so-easy Case
“Mine” for resources from literature/crawler
Auto-filter by semantic classification
Fully validate by curatorial staff/community
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Is this a potential neuroscience resource?
Two-pronged classification problem
If it belongs to the class, a reasonable portion of the
document term vector will align with a neuroscience
vocabulary
Necessary but not sufficient
The “spread” of the document term vector with respect
to a reasonable domain ontology will be limited
Pragmatic problems
What additional (recognizable) descriptors does the
resource have?
Is the resource “current”?
Are there “other ways” of getting to the content of the
site?
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
DISCO – NIF’s Ingestion Manager and Data
Tracker
“Relationalizes” incoming data when needed
Feeds the curators’ dashboard for ingestion, update
and index management
Executes automatic updates per schedule
Keeps track of chains of derived views defined by
curators
Maintains annotations on data and its views
Propagates data updates to all derived views
through curator notification
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
The data come
From too many disparate sources
6000+ neuroscience resources
In too many different formats and models
Relational, XML, RDF, Text, domain-specific, …
Having all too diverse semantics
“GRM1”: a string, a gene, a chromosomal region, a list of interesting
SNPs in mice?
There is a massive data integration problem because only
integration of data will lead to insight
What possible drugs might be repurposed for human inclusion body
myopathy (HIBM)?
Data about/from the following to be integrated
Organisms, diseases, cross-organism anatomy, phenotypes, genes,
proteins, interactions, pathways, genomic variations, pharmaceutical
compounds, assays and publications
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Data integration using schema mappings for
similar resources
Semantics-based integration
Using ontologies as the unifying structure
Using vocabularies as the connecting substrate
Using linked-data graph where applicable
Link inference
Link prediction
Using machine learning
Term association using active learning with
conditional random fields
When possible
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Commercial solutions exist
They are not very scalable as number of schemas
increase
Many groups working on it
Data Tamer – an MIT-Intel partnership project on
Big Data
Matches schema where possible
Crowd-sources ambiguous cases
Performs entity-consolidation by data clustering
Our approach
Curators use small scale schema mappings using
ontology and Google Refine
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Gather the language of the domain
Terms, their variations and term properties
“Verbs”, relationships, their linguistic variations and
their properties
Constraints that hold in the domain
Sources
Ontologies
Folksonomies
image tags, text annotations, …
Literature
figure and table captions
Data (structured or semi-structured)
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
What is an ontology?
• represents a domain terminology
• a node and edge labeled graph
• edge labels are associated with rules
specified in Description Logic
•Additional constraints can be specified
with some subset of FOL
• embeds a spanning DAG under the is-
a relationship over non-literal nodes
•represented using OWL (Web
Ontology Language) or RDF
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
A graph query engine for ontological graphs
Accepts OWL, RDFS, RDF
Ingestors for special cases can be developed
MESH XML DTD
Has a service API
1. Get all human phenotypes
http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/concepts/search/HP_
2. Find superclasses of "exocrine pancreas"
First: http://nif-services-stage.neuinfo.org/ontoquest-
lamhdi/concepts/term/exocrine+pancreas
Then, from the result of the above: http://nif-services-stage.neuinfo.org/ontoquest-
lamhdi/rel/superclasses/UBERON_0000017
3. Find all direct properties of UBERON_0000017
http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/children/UBERON_0000017
COMBINED WITH
http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/parents/UBERON_0000017
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Which skeletal structures in the zebrafish
develop from the mesenchyme?
Return $x where
($x (subclassOf)* ‘skeletal element’) and
($x develops_from* ‘ZFA:mesenchyme’)
Return $y where
($y subclassOf* $x)
Query Rewriting
Return $x where
($x develops_from* `mesenchyme’) and
($x has_ontology $o) ($x equivalent_to $z) ($z has_ontology ‘ZFA’) and
($x (subclassOf)* ‘skeletal element’)
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
“develops_from”
http://nif-services-
stage.neuinfo.org/ontoquest-
lamhdi/rel/edge-relation/id/RO_0002202
Subclasses of “skeletal element” (incl. its
equivalenceClasses) in ZFA
http://nif-services-stage.neuinfo.org/ontoquest-
lamhdi/rel/subclasses/term/skeletal%20element?level=3
<relationship>
<subject InternalId="778440-1"
id="ZFA_0001635">intramembranous
bone</subject><property InternalId="5389-15"
id="subClassOf">subClassOf</property><object
InternalId="777874-1" id="ZFA_0001514">bone
element</object>
</relationship>
“equivalenceClass”
http://nif-services-stage.neuinfo.org/ontoquest-
lamhdi/rel/children/term/skeletal%20element?lev
el=1
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Tagging of Antibody Records using a machine
learning technique with Conditional Random
Fields
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
A high-level statement
The ontology O is a graph of
concepts and inter-concept
relationships
The data is a semistructured
object S with groupings
A mapping structure is a
graph of mapping edges from
S to O + intra-source map
edges + intra-ontology map
edges
Table: protein(ID, symbol, name, pathway)
Mapping:
pathway mapsTo ontoID(Pathway).
(exhibitedIn(ontoID(Pathway), ontoID(mammal))
AND occursIn(value(ID), value(pathway)))
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
What are the top 10 genes you have most
information about?
Which sources have information about “genes”?
Find the ontology term for gene  CHEBI_23367
Call http://cm.neuinfo.org:8080/cm_services/column/mapping/ontoterm?ontologyTermId=CHEBI_23367
http://cm.neuinfo.org:8080/cm_services/column/valuefreqs?srcNifId=nlx_146253&tableNifId=nlx_146253-
1&columnName=transgenic_line
Now group and rank merge
in parallel if needed
Get top 10
For each table and column thus found, call
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Lucene Query Language
FIND:(image video) hippocampus
anatomy:hippocampus component:"plasma membrane“
anatomy::organism:human
anatomy:hippocampus[::organism:human]
RELATED:(Tenascin rabbit) RELATED::measuredBy:(“cell signaling”
cytometry)
+
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
FIND:(image video) hippocampus
Find the ontological class of the query term “hippocampus”
Ans: anatomical_entity
Does “hippocampus” have a non-empty has_part tree underneath?
Every node in the ontology keeps an approximate statistics of descendant
counts across various edge labels
Rewrite query to:
FIND: (image video) has_part*(synonyms(hippocampus))
Issue:
A cell is a part of any brain region.
Should the expansion include the cells of hippocampus?
No, because “cell” is a different module of the ontology whose top-level is
“cell”.
Partonomic expansion stays within module of the ontology
Final rewrite:
FIND: (image video) anatomy::(has_part*(synonyms(hippocampus)))
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
The PDSP Ki database is a unique resource in the public domain which provides information on the abilities of
drugs to interact with an expanding number of molecular targets. The Ki database serves as a data warehouse for
published and internally-derived Ki, or affinity, values for a large number of drugs and drug candidates at an
expanding number of G-protein coupled receptors, ion channels, transporters and enzymes.
Which marijuana related genes
are of interest to NCI? Let’s
just use PDSP as an example.
public static final String THC_QUERY = "cannabis thc marijuana";
public void demonstrateFederation() throws Exception {
final String kiDatabaseId = "nif-0000-01866-1";
FederationQuery query = FederationQuery.builder(kiDatabaseId, THC_QUERY).get();
for (Facets facets: searcher.getFacets(query, 10, 0, 1)) {
// Find all receptor (gene) facets
if (!facets.getCategory().equals("Receptor")) {
continue;
}
for (Facet facet: facets.getFacets()) {
// Get grants related to these genes from NCI
query = FederationQuery.builder("nif-0000-10319-1", """ + facet.getFacet() +""")
.facet("Funding Institute", "national cancer institute")
.exportType(ExportType.data)
.rows(1000).get();
TableData data = searcher.getTableData(query);
for (FederationModelData model: data.getResult()) {
System.out.println(facet.getFacet() + "," + model.get("project_number") + "," + model.get("project_title"));
}
}
}
}
Which marijuana related genes
are of interest to NCI? Let’s just
use PDSP as an example.
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Whichmarijuanarelatedgenesareofinterestto
NCI?
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Running on Gordon (with no tweaking)
Configuration
• 1 server
• 98 GB RAM
• 24 core Intel Xeon CPU @2.8GHz
• RAID 5 SSD
• 2 Solr instances serving the
• Federation and literature cores.
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
The Neuroscience Information Framework is not
really dependent on Neuroscience
Applying it to
Diabetes and kidney diseases
Model organisms
Earthcube for geo-science data
(Hopefully) social science data for economists
We need
More scalability
Improved complex query handling
A distribution framework

More Related Content

PDF
Semantics and linked data at astra zeneca
PPTX
PDF
ITWS Capstone (RPI, Fall 2013)
PDF
ESA Ignite talk on UC3 Dash platform for data sharing
PDF
Trustworthy AI and Open Science
PPTX
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
PPTX
Minimal viable-datareuse-czi
PPTX
The Uniform Resource Layer
Semantics and linked data at astra zeneca
ITWS Capstone (RPI, Fall 2013)
ESA Ignite talk on UC3 Dash platform for data sharing
Trustworthy AI and Open Science
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Minimal viable-datareuse-czi
The Uniform Resource Layer

What's hot (20)

PPTX
Content + Signals: The value of the entire data estate for machine learning
PPTX
Martone grethe
PPTX
Data and Donuts: The Impact of Data Management
PDF
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
PDF
Nordic health data metadata
PPTX
Designing the Garden: Getting Grounded in Linked Data
PPTX
The need for a transparent data supply chain
PPTX
Building a Network of Interoperable and Independently Produced Linked and Ope...
PDF
Metadata as Linked Data for Research Data Repositories
PPT
Chemical information instruction in the age of Google(TM)
PPTX
Funding data for research
PPT
Integrating Government Data New
PPTX
DOI Library Training Session Presentation - Locating Scientific Government In...
PPTX
Locating scientific government information on the web
PPTX
Finding and accessing human genome data with Repositive
PPTX
End-to-End Learning for Answering Structured Queries Directly over Text
PPTX
AgriFood Data, Models, Standards, Tools, Use Cases
PPTX
A Deep Survey of the Digital Resource Landscape: Perspectives from the Neuros...
PDF
dkNET-NURSA Challenge Kick-Off Webinar 04/27/2017
PPT
Semantic search
Content + Signals: The value of the entire data estate for machine learning
Martone grethe
Data and Donuts: The Impact of Data Management
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Nordic health data metadata
Designing the Garden: Getting Grounded in Linked Data
The need for a transparent data supply chain
Building a Network of Interoperable and Independently Produced Linked and Ope...
Metadata as Linked Data for Research Data Repositories
Chemical information instruction in the age of Google(TM)
Funding data for research
Integrating Government Data New
DOI Library Training Session Presentation - Locating Scientific Government In...
Locating scientific government information on the web
Finding and accessing human genome data with Repositive
End-to-End Learning for Answering Structured Queries Directly over Text
AgriFood Data, Models, Standards, Tools, Use Cases
A Deep Survey of the Digital Resource Landscape: Perspectives from the Neuros...
dkNET-NURSA Challenge Kick-Off Webinar 04/27/2017
Semantic search
Ad

Viewers also liked (8)

PPTX
Neuroscience as networked science
PPTX
Martone acs presentation
PPTX
NIF services overview
PPTX
PPTX
INCF 2013 - Uniform Resource Layer
PPT
Data Landscapes - Addiction
PPTX
The real world of ontologies and phenotype representation: perspectives from...
PDF
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neuroscience as networked science
Martone acs presentation
NIF services overview
INCF 2013 - Uniform Resource Layer
Data Landscapes - Addiction
The real world of ontologies and phenotype representation: perspectives from...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Ad

Similar to The Neuroscience Information Framework: A Scalable Platform for Information Exploration and Semantic Search Computing (20)

PPTX
Applied semantic technology and linked data
PDF
Databases and Ontologies: Where do we go from here?
PPTX
A Deep Survey of the Digital Resource Landscape
PPTX
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
PPTX
The Neuroscience Information Framework: Establishing a practical semantic fra...
PPTX
The possibility and probability of a global Neuroscience Information Framework
PPTX
Big data from small data: A deep survey of the neuroscience landscape data via
PPTX
Navigating the Neuroscience Data Landscape
PPT
Data Landscapes: The Neuroscience Information Framework
PPTX
Data-knowledge transition zones within the biomedical research ecosystem
PDF
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
PPTX
The real world of ontologies and phenotype representation: perspectives from...
PPT
How do we know what we don’t know: Using the Neuroscience Information Framew...
PPTX
The Neuroscience Information Framework: Making Resources Discoverable for the...
PPTX
Role of Semantic Web in Health Informatics
PPTX
The Science of Data Science
PPTX
Big Data Content Organization, Discovery, and Management
PPTX
The Neuroscience Information Framework:The present and future of neuroscience...
PPTX
Scio12 sem web_final
PDF
Final Acb All Hands 26 11 07.Key
Applied semantic technology and linked data
Databases and Ontologies: Where do we go from here?
A Deep Survey of the Digital Resource Landscape
RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework
The Neuroscience Information Framework: Establishing a practical semantic fra...
The possibility and probability of a global Neuroscience Information Framework
Big data from small data: A deep survey of the neuroscience landscape data via
Navigating the Neuroscience Data Landscape
Data Landscapes: The Neuroscience Information Framework
Data-knowledge transition zones within the biomedical research ecosystem
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
The real world of ontologies and phenotype representation: perspectives from...
How do we know what we don’t know: Using the Neuroscience Information Framew...
The Neuroscience Information Framework: Making Resources Discoverable for the...
Role of Semantic Web in Health Informatics
The Science of Data Science
Big Data Content Organization, Discovery, and Management
The Neuroscience Information Framework:The present and future of neuroscience...
Scio12 sem web_final
Final Acb All Hands 26 11 07.Key

More from Neuroscience Information Framework (17)

PDF
Why should my institution support RRIDs?
PDF
Why should Journals ask fo RRIDs?
PPT
NIF Data Registration
PPTX
PPTX
NIF: A vision for a uniform resource layer
PPTX
Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATIO...
PPTX
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
PPTX
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
PPTX
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
PPTX
Neuroscience Information Framework Ontologies: Nerve cells in Neurolex and NI...
PDF
Defined versus Asserted Classes: Working with the OWL Ontologies
PPTX
NIFSTD: A Comprehensive Ontology for Neuroscience
PPTX
NIF as a Multi-Model Semantic Information System
Why should my institution support RRIDs?
Why should Journals ask fo RRIDs?
NIF Data Registration
NIF: A vision for a uniform resource layer
Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATIO...
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
Neuroscience Information Framework Ontologies: Nerve cells in Neurolex and NI...
Defined versus Asserted Classes: Working with the OWL Ontologies
NIFSTD: A Comprehensive Ontology for Neuroscience
NIF as a Multi-Model Semantic Information System

Recently uploaded (20)

PPTX
New Microsoft PowerPoint Presentation - Copy.pptx
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
DOCX
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
PPTX
5 Stages of group development guide.pptx
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PPT
Chapter four Project-Preparation material
DOCX
Euro SEO Services 1st 3 General Updates.docx
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PPTX
Lecture (1)-Introduction.pptx business communication
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
DOCX
Business Management - unit 1 and 2
PPTX
Business Ethics - An introduction and its overview.pptx
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
New Microsoft PowerPoint Presentation - Copy.pptx
Reconciliation AND MEMORANDUM RECONCILATION
Power and position in leadershipDOC-20250808-WA0011..pdf
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
5 Stages of group development guide.pptx
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Chapter four Project-Preparation material
Euro SEO Services 1st 3 General Updates.docx
ICG2025_ICG 6th steering committee 30-8-24.pptx
DOC-20250806-WA0002._20250806_112011_0000.pdf
Lecture (1)-Introduction.pptx business communication
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
Business Management - unit 1 and 2
Business Ethics - An introduction and its overview.pptx
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise

The Neuroscience Information Framework: A Scalable Platform for Information Exploration and Semantic Search Computing

  • 1. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Amarnath Gupta
  • 2. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials, services) are available to the neuroscience community? How many are there? What domains do they cover? What domains do they not cover? Where are they? Web sites Databases Literature Supplementary material Who uses them? Who creates them? How can we find them? How can we make them better in the future? http://neuinfo.org • PDF files • Desk drawers
  • 3. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California http://neuinfo.org June10, 2013dkCOIN Investigator's Retreat3 A portal for finding and using neuroscience resources  A consistent framework for describing resources  Provides simultaneous search of multiple types of information, organized by category  Supported by an expansive ontology for neuroscience  Utilizes advanced technologies to search the “hidden web” UCSD, Yale, Cal Tech, George Mason, Washington Univ Literature Database Federation Registry
  • 4. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California How do we create an information infrastructure that is able to connect a person or a community with the resources they need to accomplish their task at hand? Resource Anything that is tangible and accessible a product, a person, an institution, a piece of data, a connection … Information Infrastructure Enables the entire life cycle of information from acquisition to (potential)archival Allows people to find, access, understand and work with information A domain-specific example:
  • 5. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Credit: Bhaskar Ghosh, LinkedIn
  • 6. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Credit: Bhaskar Ghosh, LinkedIn Oracle, MySQL, Voldemort, Espresso Zoei, Bobo, Sensei, GraphDB Kafka, Databus Hadoop, Teradata, Azkaban
  • 7. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Oracle or Espresso Search Index Graph Index Read Replica Updates Standar dization Data Change Events Credit: Hien Luu, Sid Anand, LinkedIn Oracle or Espresso Updates Databus Web Servers Teradata
  • 8. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California  Where is data about X?  How does Y relate to Z?  Accumulate and Analyze  Compare X and Y  Subscribe to topic T  Recommend Resource  Funding reports  Search and Explore  News  Resource Promotion  Utilization Search  Cross-Utilization  Experiential Services Researcher Activity Resource Activity Data Data Infrastructure Analysis & Science
  • 9. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
  • 10. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California  Where is data about X?  How does Y relate to Z?  Accumulate and Analyze  Compare X and Y  Subscribe to topic T  Recommend Resource  Funding reports  Search and Explore  News  Resource Promotion  Utilization Search  Cross-Utilization  Experiential Services Researcher Activity Resource Activity Data Data Infrastructure Analysis & Science The problem starts here
  • 11. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
  • 12. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Given an infinite number of web accessible resources, which are relevant for Neuroscience? Easy Case A resource assigned by a trusted source Reasonably Easy Case A resource recommended by a potentially trustable source Not-so-easy Case “Mine” for resources from literature/crawler Auto-filter by semantic classification Fully validate by curatorial staff/community
  • 13. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Is this a potential neuroscience resource? Two-pronged classification problem If it belongs to the class, a reasonable portion of the document term vector will align with a neuroscience vocabulary Necessary but not sufficient The “spread” of the document term vector with respect to a reasonable domain ontology will be limited Pragmatic problems What additional (recognizable) descriptors does the resource have? Is the resource “current”? Are there “other ways” of getting to the content of the site?
  • 14. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California DISCO – NIF’s Ingestion Manager and Data Tracker “Relationalizes” incoming data when needed Feeds the curators’ dashboard for ingestion, update and index management Executes automatic updates per schedule Keeps track of chains of derived views defined by curators Maintains annotations on data and its views Propagates data updates to all derived views through curator notification
  • 15. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California The data come From too many disparate sources 6000+ neuroscience resources In too many different formats and models Relational, XML, RDF, Text, domain-specific, … Having all too diverse semantics “GRM1”: a string, a gene, a chromosomal region, a list of interesting SNPs in mice? There is a massive data integration problem because only integration of data will lead to insight What possible drugs might be repurposed for human inclusion body myopathy (HIBM)? Data about/from the following to be integrated Organisms, diseases, cross-organism anatomy, phenotypes, genes, proteins, interactions, pathways, genomic variations, pharmaceutical compounds, assays and publications
  • 16. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Data integration using schema mappings for similar resources Semantics-based integration Using ontologies as the unifying structure Using vocabularies as the connecting substrate Using linked-data graph where applicable Link inference Link prediction Using machine learning Term association using active learning with conditional random fields When possible
  • 17. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Commercial solutions exist They are not very scalable as number of schemas increase Many groups working on it Data Tamer – an MIT-Intel partnership project on Big Data Matches schema where possible Crowd-sources ambiguous cases Performs entity-consolidation by data clustering Our approach Curators use small scale schema mappings using ontology and Google Refine
  • 18. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Gather the language of the domain Terms, their variations and term properties “Verbs”, relationships, their linguistic variations and their properties Constraints that hold in the domain Sources Ontologies Folksonomies image tags, text annotations, … Literature figure and table captions Data (structured or semi-structured)
  • 19. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California What is an ontology? • represents a domain terminology • a node and edge labeled graph • edge labels are associated with rules specified in Description Logic •Additional constraints can be specified with some subset of FOL • embeds a spanning DAG under the is- a relationship over non-literal nodes •represented using OWL (Web Ontology Language) or RDF
  • 20. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California A graph query engine for ontological graphs Accepts OWL, RDFS, RDF Ingestors for special cases can be developed MESH XML DTD Has a service API 1. Get all human phenotypes http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/concepts/search/HP_ 2. Find superclasses of "exocrine pancreas" First: http://nif-services-stage.neuinfo.org/ontoquest- lamhdi/concepts/term/exocrine+pancreas Then, from the result of the above: http://nif-services-stage.neuinfo.org/ontoquest- lamhdi/rel/superclasses/UBERON_0000017 3. Find all direct properties of UBERON_0000017 http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/children/UBERON_0000017 COMBINED WITH http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/parents/UBERON_0000017
  • 21. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Which skeletal structures in the zebrafish develop from the mesenchyme? Return $x where ($x (subclassOf)* ‘skeletal element’) and ($x develops_from* ‘ZFA:mesenchyme’) Return $y where ($y subclassOf* $x) Query Rewriting Return $x where ($x develops_from* `mesenchyme’) and ($x has_ontology $o) ($x equivalent_to $z) ($z has_ontology ‘ZFA’) and ($x (subclassOf)* ‘skeletal element’)
  • 22. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California “develops_from” http://nif-services- stage.neuinfo.org/ontoquest- lamhdi/rel/edge-relation/id/RO_0002202 Subclasses of “skeletal element” (incl. its equivalenceClasses) in ZFA http://nif-services-stage.neuinfo.org/ontoquest- lamhdi/rel/subclasses/term/skeletal%20element?level=3 <relationship> <subject InternalId="778440-1" id="ZFA_0001635">intramembranous bone</subject><property InternalId="5389-15" id="subClassOf">subClassOf</property><object InternalId="777874-1" id="ZFA_0001514">bone element</object> </relationship> “equivalenceClass” http://nif-services-stage.neuinfo.org/ontoquest- lamhdi/rel/children/term/skeletal%20element?lev el=1
  • 23. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Tagging of Antibody Records using a machine learning technique with Conditional Random Fields
  • 24. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
  • 25. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California A high-level statement The ontology O is a graph of concepts and inter-concept relationships The data is a semistructured object S with groupings A mapping structure is a graph of mapping edges from S to O + intra-source map edges + intra-ontology map edges Table: protein(ID, symbol, name, pathway) Mapping: pathway mapsTo ontoID(Pathway). (exhibitedIn(ontoID(Pathway), ontoID(mammal)) AND occursIn(value(ID), value(pathway)))
  • 26. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
  • 27. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California What are the top 10 genes you have most information about? Which sources have information about “genes”? Find the ontology term for gene  CHEBI_23367 Call http://cm.neuinfo.org:8080/cm_services/column/mapping/ontoterm?ontologyTermId=CHEBI_23367 http://cm.neuinfo.org:8080/cm_services/column/valuefreqs?srcNifId=nlx_146253&tableNifId=nlx_146253- 1&columnName=transgenic_line Now group and rank merge in parallel if needed Get top 10 For each table and column thus found, call
  • 28. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
  • 29. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Lucene Query Language FIND:(image video) hippocampus anatomy:hippocampus component:"plasma membrane“ anatomy::organism:human anatomy:hippocampus[::organism:human] RELATED:(Tenascin rabbit) RELATED::measuredBy:(“cell signaling” cytometry) +
  • 30. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California FIND:(image video) hippocampus Find the ontological class of the query term “hippocampus” Ans: anatomical_entity Does “hippocampus” have a non-empty has_part tree underneath? Every node in the ontology keeps an approximate statistics of descendant counts across various edge labels Rewrite query to: FIND: (image video) has_part*(synonyms(hippocampus)) Issue: A cell is a part of any brain region. Should the expansion include the cells of hippocampus? No, because “cell” is a different module of the ontology whose top-level is “cell”. Partonomic expansion stays within module of the ontology Final rewrite: FIND: (image video) anatomy::(has_part*(synonyms(hippocampus)))
  • 31. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California The PDSP Ki database is a unique resource in the public domain which provides information on the abilities of drugs to interact with an expanding number of molecular targets. The Ki database serves as a data warehouse for published and internally-derived Ki, or affinity, values for a large number of drugs and drug candidates at an expanding number of G-protein coupled receptors, ion channels, transporters and enzymes. Which marijuana related genes are of interest to NCI? Let’s just use PDSP as an example. public static final String THC_QUERY = "cannabis thc marijuana"; public void demonstrateFederation() throws Exception { final String kiDatabaseId = "nif-0000-01866-1"; FederationQuery query = FederationQuery.builder(kiDatabaseId, THC_QUERY).get(); for (Facets facets: searcher.getFacets(query, 10, 0, 1)) { // Find all receptor (gene) facets if (!facets.getCategory().equals("Receptor")) { continue; } for (Facet facet: facets.getFacets()) { // Get grants related to these genes from NCI query = FederationQuery.builder("nif-0000-10319-1", """ + facet.getFacet() +""") .facet("Funding Institute", "national cancer institute") .exportType(ExportType.data) .rows(1000).get(); TableData data = searcher.getTableData(query); for (FederationModelData model: data.getResult()) { System.out.println(facet.getFacet() + "," + model.get("project_number") + "," + model.get("project_title")); } } } } Which marijuana related genes are of interest to NCI? Let’s just use PDSP as an example.
  • 32. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Whichmarijuanarelatedgenesareofinterestto NCI?
  • 33. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
  • 34. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California Running on Gordon (with no tweaking) Configuration • 1 server • 98 GB RAM • 24 core Intel Xeon CPU @2.8GHz • RAID 5 SSD • 2 Solr instances serving the • Federation and literature cores.
  • 35. 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California The Neuroscience Information Framework is not really dependent on Neuroscience Applying it to Diabetes and kidney diseases Model organisms Earthcube for geo-science data (Hopefully) social science data for economists We need More scalability Improved complex query handling A distribution framework