SlideShare a Scribd company logo
Exploring Large Chemical
        Data Sets
 Interactive Analysis and Visualization



          Kyle Lutz and Marcus D. Hanwell

                 August 21, 2012
                Skolnik Symposium
Overview
● An open-source, cross-platform
  cheminformatics tool
● A general-purpose tool for chemical data
  exploration and analysis
● Interactive, editable and queryable
  database of chemical data on the desktop
● Part of the Open Chemistry application
  suite (Avogadro and MoleQueue)
● Leverages several open-source projects:
  Qt, VTK, Chemkit, Open Babel, MongoDB
Architecture
● Native, cross-platform C++ application built with Qt
● Stores chemical data in a NoSQL MongoDB database
● Uses VTK for 2D and 3D data set visualization
Main Window
Molecule Details
Queries

Supports different
queries:
● Name
● Formula
● InChI
● InChIKey
● Structure and
   Substructure
Similarity Searching
Charts and Plots




            Scatter Plot          Histogram of logP
   of Polar Surface Area (TPSA)
      against Volume (VABC)
Multidimensional Analysis
● Provide tools for viewing and analyzing large
  amounts of data with multiple dimensions
   ○ Scatter Plot Matrix
   ○ Parallel Coordinates
   ○ K-Means Clustering
● Interactive charts supporting selection
● Easy to add new chemical descriptors
Scatter Plot Matrix




      Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
Parallel Coordinates




     Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
K-Means Clustering
● ~30 numeric molecular descriptors
● 1D, 2D, and 3D visualization
● Selection and extraction of molecules from clusters
Similarity Visualization
● Similarity Clustering
● Calculated from fingerprint similarity or structural
  similarity
Similarity Visualization




                           60%
      30%




      45%
ChemicalJSON
                                                           Example: ethane.cjson

●   JSON (JavaScript Object Notation) is
    a "lightweight data-interchange
    format"
●   Store molecular structure, geometry,
    identifiers and descriptors all as a
    single JSON object
●   Benefits:
    ○ More compact than XML/CML
    ○ Native language of MongoDB and
      JSON-RPC
    ○ Easily converted to a binary
      representation (BSON)




                  Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
ChemicalJSON in MongoDB
● Nearly identical to what is stored in a file
   ○ A few extra fields stored
     ■ 2D diagram (as PNG)
     ■ Heavy atom count (for substructure searching)
     ■ Binary fingerprints (for similarity searching)
     ■ InChIKey for indexing and as a unique key
     ■ Mongo's OID ("_id") field
● Trivial to write out to a .cjson file:
     db.molecules.find({"name" : "ethanol"},
                       {"diagram" : 0,
                        "heavyAtomCount" : 0,
                        "fp2_fingerprint" : 0,
                        "_id" : 0})
Open Chemistry with ParaViewWeb
● Uses ParaView's client-server architecture
● Interactive 3D rendering
● Runs in any modern web browser




        URL: http://paraviewweb.kitware.com/OpenChemistry/
Open Chemistry with ParaViewWeb
    ChemData
RPC / Avogadro Integration
● Uses JSON-RPC to communicate with other
  applications (most notably Avogadro)
● Visualize data directly from the database
● Uses ChemicalJSON to represent molecular
  structures and transfer molecular information
Future Directions
● Direct integration with 3rd party databases
  (PubChem, PDB, ...)
● Broader support for storing and analyzing
  computational job results
   ○ Linked with molecular structures
   ○ Direct from CML or converted/parsed
● Plugins to facilitate extension
   ○ Descriptors
   ○ Visualization
   ○ Chemical file input/output
● Scaling studies, working with multiple data
  servers and terabytes of data
Comments/Questions?
                  Home Page
   http://wiki.openchemistry.org/ChemData

                  Source Code
 https://github.com/OpenChemistry/chemdata

              ParaViewWeb Demo
http://paraviewweb.kitware.com/OpenChemistry

More Related Content

ODP
FastReport VCL6 Nuremberg 2018
PPT
econstruct summary
PDF
DBpedia Viewer - LDOW 2014
PDF
polystore_NYC_inrae_sysinfo2021-1.pdf
PPT
Triplificating and linking XBRL financial data
PDF
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
PDF
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
PPTX
NoSQL document oriented data access for .net systems with postgresql and marten
FastReport VCL6 Nuremberg 2018
econstruct summary
DBpedia Viewer - LDOW 2014
polystore_NYC_inrae_sysinfo2021-1.pdf
Triplificating and linking XBRL financial data
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
Mobicents Summit 2012 - Alexandre Mendonca - Mobicents jDiameter
NoSQL document oriented data access for .net systems with postgresql and marten

What's hot (19)

PPTX
Analytical data processing
PPTX
Elasticsearch: Getting Started Part 1
PPTX
Elasticsearch: Getting Started Part 3 Aggregations
PDF
Service Composition for Mobile Ad Hoc Networks using Distributed Matching
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
PDF
Wikidata as a linking hub for knowledge organization systems? Integrating an ...
PPTX
CHAOS Platform presentation, The Royal Library in Copenhagen.
PDF
Big data uservices
PDF
Graph Analytics with ArangoDB
PPTX
MongoDB NoSQL - Developer Guide
PPTX
Academy PRO: D3, part 1
PPTX
Integration and Exploration of Financial Data using Semantics and Ontologies
PDF
Dirk Goldhahn: Introduction to the German Wortschatz Project
PDF
Regal - a Repository for Electronic Documents and Bibliographic Data
PDF
[scala.by] Launching new application fast
PDF
Greedy Enough for the Grid?
PDF
Brett Ragozzine - Graph Databases and Neo4j
KEY
Legislation.gov.uk
Analytical data processing
Elasticsearch: Getting Started Part 1
Elasticsearch: Getting Started Part 3 Aggregations
Service Composition for Mobile Ad Hoc Networks using Distributed Matching
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Wikidata as a linking hub for knowledge organization systems? Integrating an ...
CHAOS Platform presentation, The Royal Library in Copenhagen.
Big data uservices
Graph Analytics with ArangoDB
MongoDB NoSQL - Developer Guide
Academy PRO: D3, part 1
Integration and Exploration of Financial Data using Semantics and Ontologies
Dirk Goldhahn: Introduction to the German Wortschatz Project
Regal - a Repository for Electronic Documents and Bibliographic Data
[scala.by] Launching new application fast
Greedy Enough for the Grid?
Brett Ragozzine - Graph Databases and Neo4j
Legislation.gov.uk
Ad

Similar to Exploring Large Chemical Data Sets (20)

PDF
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
PDF
Data Integration Solutions Created By Koneksys
PDF
The Open Chemistry Project
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
PPTX
Introduction to the BioLink datamodel
PDF
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
PPTX
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
PDF
MongoDB and Web Scrapping with the Gyes Platform
PDF
Mongo db basics
PDF
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
PDF
MongoDB Basics Unileon
PDF
Avogadro 2 and Open Chemistry
PDF
Cassandra meetup slides - Oct 15 Santa Monica Coloft
PPTX
Mongo db
PDF
Towards constrained semantic web
PDF
3DRepo
PDF
An Open Source NoSQL solution for Internet Access Logs Analysis
PPTX
MADICES Mungall 2022.pptx
DOCX
What are the major components of MongoDB and the major tools used in it.docx
PDF
Big Linked Data Federation - ExtremeEarth Open Workshop
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Data Integration Solutions Created By Koneksys
The Open Chemistry Project
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Introduction to the BioLink datamodel
A Study of the Similarities of Entity Embeddings Learned from Different Aspec...
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
MongoDB and Web Scrapping with the Gyes Platform
Mongo db basics
EUGM 2013 - Andras Stracz (ChemAxon) - ChemAxon Plexus: A desktop application...
MongoDB Basics Unileon
Avogadro 2 and Open Chemistry
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Mongo db
Towards constrained semantic web
3DRepo
An Open Source NoSQL solution for Internet Access Logs Analysis
MADICES Mungall 2022.pptx
What are the major components of MongoDB and the major tools used in it.docx
Big Linked Data Federation - ExtremeEarth Open Workshop
Ad

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing

Exploring Large Chemical Data Sets

  • 1. Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium
  • 2. Overview ● An open-source, cross-platform cheminformatics tool ● A general-purpose tool for chemical data exploration and analysis ● Interactive, editable and queryable database of chemical data on the desktop ● Part of the Open Chemistry application suite (Avogadro and MoleQueue) ● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB
  • 3. Architecture ● Native, cross-platform C++ application built with Qt ● Stores chemical data in a NoSQL MongoDB database ● Uses VTK for 2D and 3D data set visualization
  • 6. Queries Supports different queries: ● Name ● Formula ● InChI ● InChIKey ● Structure and Substructure
  • 8. Charts and Plots Scatter Plot Histogram of logP of Polar Surface Area (TPSA) against Volume (VABC)
  • 9. Multidimensional Analysis ● Provide tools for viewing and analyzing large amounts of data with multiple dimensions ○ Scatter Plot Matrix ○ Parallel Coordinates ○ K-Means Clustering ● Interactive charts supporting selection ● Easy to add new chemical descriptors
  • 10. Scatter Plot Matrix Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 11. Parallel Coordinates Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume
  • 12. K-Means Clustering ● ~30 numeric molecular descriptors ● 1D, 2D, and 3D visualization ● Selection and extraction of molecules from clusters
  • 13. Similarity Visualization ● Similarity Clustering ● Calculated from fingerprint similarity or structural similarity
  • 15. ChemicalJSON Example: ethane.cjson ● JSON (JavaScript Object Notation) is a "lightweight data-interchange format" ● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object ● Benefits: ○ More compact than XML/CML ○ Native language of MongoDB and JSON-RPC ○ Easily converted to a binary representation (BSON) Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON
  • 16. ChemicalJSON in MongoDB ● Nearly identical to what is stored in a file ○ A few extra fields stored ■ 2D diagram (as PNG) ■ Heavy atom count (for substructure searching) ■ Binary fingerprints (for similarity searching) ■ InChIKey for indexing and as a unique key ■ Mongo's OID ("_id") field ● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})
  • 17. Open Chemistry with ParaViewWeb ● Uses ParaView's client-server architecture ● Interactive 3D rendering ● Runs in any modern web browser URL: http://paraviewweb.kitware.com/OpenChemistry/
  • 18. Open Chemistry with ParaViewWeb ChemData
  • 19. RPC / Avogadro Integration ● Uses JSON-RPC to communicate with other applications (most notably Avogadro) ● Visualize data directly from the database ● Uses ChemicalJSON to represent molecular structures and transfer molecular information
  • 20. Future Directions ● Direct integration with 3rd party databases (PubChem, PDB, ...) ● Broader support for storing and analyzing computational job results ○ Linked with molecular structures ○ Direct from CML or converted/parsed ● Plugins to facilitate extension ○ Descriptors ○ Visualization ○ Chemical file input/output ● Scaling studies, working with multiple data servers and terabytes of data
  • 21. Comments/Questions? Home Page http://wiki.openchemistry.org/ChemData Source Code https://github.com/OpenChemistry/chemdata ParaViewWeb Demo http://paraviewweb.kitware.com/OpenChemistry