Open Molecules 2025: A Look Inside

Open Molecules 2025: A Look Inside

Last week, we released the Open Molecules 2025 (OMol25) dataset. This was a massive effort, resulting in over 100 million high-level DFT calculations, covering biomolecules, metal complexes, electrolytes, reactivity, and more. We wanted to try and cover as much exciting chemical space as possible. I can’t wait to see what awesome science this data will help enable across the chemistry community.

Many people have posted about this release (e.g. here, here, here, and here) so I wanted to dig a bit into the data itself. One might think when making a dataset this large that the hard part is running all the DFT, but really the hard part is generating enough high-quality, interesting inputs (the DFT-at-scale problem was already solved for past Meta datasets by, among others, the incomparable Kyle Michel and Muhammed Shuaibi). You want to see a little something about everything, including things that are higher energy, but not waste time with too many of the same thing or things that are just totally unreasonable. Manually inspecting all the inputs is not possible, so you have to set up really robust pipelines.

For biomolecules, we sampled heavily from the RCSB Protein Data Bank and used Schrödinger tools for ensuring proper protonation and tautomeric states of both proteins and ligands. Protonation states are often neglected from surveys of the PDB, and trying multiple protonation states for the same pocket is vital in structure-based drug discovery. We sampled nearly every unique protein-ligand environment in the PDB, augmented with molecular dynamics (MD) and docking simulations. We also sampled nucleic acid-protein and protein-protein interactions which are vital to both macromolecular structure and biologics drug design. 

For electrolytes, I have to thank my partner in MD simulation, Muhammad Risyad Hasyim. Together, we generated a combinatorial explosion of MD boxes with different concentrations, temperatures, and force fields and sampled clusters of various sizes that could also be shrunk, reduced, oxidized, and expanded. We even included some nuclear quantum effects with ring polymer MD.

Metal complexes utilized the stellar Architector package from Michael Taylor et al. With Michael’s help, we were able to generate millions of complexes in different conformations. We specifically went for wide-ranging on these complexes: in addition to more "standard" complexes, we allowed complex combinations of ligands and metals that I would never have tried to make when I was getting my PhD in organometallic chemistry. This diversity helps us push the boundaries of the space (and our metal complex evaluation sets are all real crystal structures taken from the literature).

This is just a taste of data in OMol25. Check it out for yourself!

Blog: https://lnkd.in/giJPKyhG

Paper: https://lnkd.in/gA9bemrQ

Dataset: https://lnkd.in/gqZWzcvn (License: CC-BY-4.0)

Code: https://lnkd.in/gfTqAXtS (License: MIT) and more coming soon

Thanks to the whole team: Muhammed Shuaibi, Evan Spotte-Smith, Michael Taylor, Muhammad Risyad Hasyim, Kyle Michel, Ilyes Batatia, Gábor Csányi, Misko Dzamba, Peter Eastman, Nathan Frey, Xiang Fu, Vahe Gharakhanyan, Aditi Krishnapriyan, Joshua Rackers, Sanjeev Raja, Ammar Rizvi, Andrew Rosen, Zachary Ulissi, Santiago V., Larry Zitnick, Samuel Blau, Brandon Wood

Additional thank yous for the ORCA team at FACCTs and former lab-mate Narbe Mardirossian for making a great DFT functional back in our Berkeley days.

Jérôme Gonthier

Senior Quantum Research Scientist at NVIDIA

3mo

That is outstanding work! I was just reading another blog post about it. Congratulations Daniel!

Dr. Mushtaq Ali

Computational Scientist | GenAI | PhD | MBA | M.Sc. | MCA |

4mo

I am curious to explore the application of this dataset. Thanks for sharing

Muhammad Uzair Khan

PhD Candidate | ML/DL for Materials Research | Computational Modelling | Applied Agentic AI

4mo

Amazing Contribution 👏🏻. Looking forward to exploring the dataset and its applications.

Corin Wagen

founder/CEO, Rowan Scientific

4mo

What sorts of reactions are covered here? Radical reactions? Pericyclics? Organometallic reactivity?

To view or add a comment, sign in

Others also viewed

Explore content categories