This document discusses challenges in scaling machine learning for drug discovery as data grows. The author describes their work developing automated workflows and pipelines for building predictive models on large datasets using techniques like Hadoop, Spark, and cloud computing. Their goal is to enable non-experts to build accurate models and make predictions in real-time as structures are modified. The document outlines several projects applying these techniques to problems like site-of-metabolism prediction, target prediction, and next-generation sequencing analysis. It evaluates challenges in scaling modeling to many datasets and targets on high performance computing clusters and private clouds.
Related topics: