Open Molecules 2025: A Look Inside
Last week, we released the Open Molecules 2025 (OMol25) dataset. This was a massive effort, resulting in over 100 million high-level DFT calculations, covering biomolecules, metal complexes, electrolytes, reactivity, and more. We wanted to try and cover as much exciting chemical space as possible. I can’t wait to see what awesome science this data will help enable across the chemistry community.
Many people have posted about this release (e.g. here, here, here, and here) so I wanted to dig a bit into the data itself. One might think when making a dataset this large that the hard part is running all the DFT, but really the hard part is generating enough high-quality, interesting inputs (the DFT-at-scale problem was already solved for past Meta datasets by, among others, the incomparable Kyle Michel and Muhammed Shuaibi). You want to see a little something about everything, including things that are higher energy, but not waste time with too many of the same thing or things that are just totally unreasonable. Manually inspecting all the inputs is not possible, so you have to set up really robust pipelines.
For biomolecules, we sampled heavily from the RCSB Protein Data Bank and used Schrödinger tools for ensuring proper protonation and tautomeric states of both proteins and ligands. Protonation states are often neglected from surveys of the PDB, and trying multiple protonation states for the same pocket is vital in structure-based drug discovery. We sampled nearly every unique protein-ligand environment in the PDB, augmented with molecular dynamics (MD) and docking simulations. We also sampled nucleic acid-protein and protein-protein interactions which are vital to both macromolecular structure and biologics drug design.
For electrolytes, I have to thank my partner in MD simulation, Muhammad Risyad Hasyim. Together, we generated a combinatorial explosion of MD boxes with different concentrations, temperatures, and force fields and sampled clusters of various sizes that could also be shrunk, reduced, oxidized, and expanded. We even included some nuclear quantum effects with ring polymer MD.
Metal complexes utilized the stellar Architector package from Michael Taylor et al. With Michael’s help, we were able to generate millions of complexes in different conformations. We specifically went for wide-ranging on these complexes: in addition to more "standard" complexes, we allowed complex combinations of ligands and metals that I would never have tried to make when I was getting my PhD in organometallic chemistry. This diversity helps us push the boundaries of the space (and our metal complex evaluation sets are all real crystal structures taken from the literature).
This is just a taste of data in OMol25. Check it out for yourself!
Blog: https://lnkd.in/giJPKyhG
Paper: https://lnkd.in/gA9bemrQ
Dataset: https://lnkd.in/gqZWzcvn (License: CC-BY-4.0)
Code: https://lnkd.in/gfTqAXtS (License: MIT) and more coming soon
Thanks to the whole team: Muhammed Shuaibi, Evan Spotte-Smith, Michael Taylor, Muhammad Risyad Hasyim, Kyle Michel, Ilyes Batatia, Gábor Csányi, Misko Dzamba, Peter Eastman, Nathan Frey, Xiang Fu, Vahe Gharakhanyan, Aditi Krishnapriyan, Joshua Rackers, Sanjeev Raja, Ammar Rizvi, Andrew Rosen, Zachary Ulissi, Santiago V., Larry Zitnick, Samuel Blau, Brandon Wood
Additional thank yous for the ORCA team at FACCTs and former lab-mate Narbe Mardirossian for making a great DFT functional back in our Berkeley days.
Senior Quantum Research Scientist at NVIDIA
3moThat is outstanding work! I was just reading another blog post about it. Congratulations Daniel!
Computational Scientist | GenAI | PhD | MBA | M.Sc. | MCA |
4moI am curious to explore the application of this dataset. Thanks for sharing
PhD Candidate | ML/DL for Materials Research | Computational Modelling | Applied Agentic AI
4moAmazing Contribution 👏🏻. Looking forward to exploring the dataset and its applications.
founder/CEO, Rowan Scientific
4moWhat sorts of reactions are covered here? Radical reactions? Pericyclics? Organometallic reactivity?