SlideShare a Scribd company logo
COMPARING APPROACHES After performance of both the neural network and least squares-based methods were optimized using truncated test datasets, the whole database was processed.  PLS proved to be much faster than neural networks while maintaining similar performance.  Both methods were also compared with the database-based method. FIGURE 5 .   13 C prediction result. The best results for  13 C chemical shift prediction are shown above. The HOSE code approach refers to the database method.  The ‘Hybrid’ method first tries to find a structure in the database and if no close analogs are found it employs increments-based neural networks and/or PLS.  The neural network used in the calculation had 100 neurons in the input layer, and 25 and 5 neurons in two hidden layers; the network was provided with cross-increments up to the 3 rd  sphere.  FIGURE 6 .   1 H prediction result. The best results for proton chemical shift predictions are shown.  The best neural network had 30 input neurons and no hidden layer, six spheres were used for structure description and cross-increments were used up to the third sphere for atoms separated by no more than one bond.  CONCLUSIONS In the work presented here, two different approaches to chemical shift prediction have been systematically compared.  That of least squares-based regression and neural networks. From this work, we conclude that: The usage of neural networks does not help to reduce the number of atomic/cross-increments needed for accurate chemical shift prediction.  Neural networks also do NOT achieve accurate results without cross-increments. The quality of the best possible results obtained with an optimized least squares scheme and neural network is approximately the same.  The mean error can be as low as 1.5 ppm for  13 C and 0.2 ppm for  1 H chemical shift prediction. Both PLS and neural networks can be coupled with a database search, resulting in an even more effective hybrid method. METHODS Programming was performed in a Borland Delphi 5 environment using MTX libraries for the linear algebra calculations.  Adjustment of the neural network parameters was in MATLAB with the neural networks package.  Training Database The three most important factors in choosing the training database are size, diversity and quality.  This ensures that the algorithms derived are applicable.  Fortunately, these requirements are met by using the same database used in ACD/Labs’ commercially available products.  This database contains approximately 2,160,000  13 C and 1,440,000  1 H chemical shifts.  Equally as important is the selection of the test dataset.  It should be as independent as possible from the training dataset.  To avoid overlaps with the training data, 11,000 new compounds (150,000 chemical shifts) described in the literature in 2005–2006 were chosen as the test dataset.  This avoided overlap since the training database included only compounds that were described in the literature before 2005. Structure Description Traditionally, chemical structures are described in terms of separate atoms.  FIGURE 1 . The Structure Description consists of a central atom (for which the prediction is made) and substituents which are located at different distances from the central atom.  All atoms separated from the center by  n  covalent bonds are called the  n -th sphere. In total, 66 atom types were used.  Generally, atoms are classified based on element number, number of attached hydrogens, hybridization, and valency.  Additional descriptors were used to take into account conjugation, stereochemistry, and solvent effects.  For pairs of atoms separated by a few covalent bonds, separate inputs were provided (aka “cross-increments” or “correction factors”).  The number and nature of atom types, number of spheres, and number of cross-increments were all subject to optimization. PARTIAL LEAST SQUARES Calculation of the regression coefficients for PLS is generally a faster procedure than neural net training.  Moreover, regression is a “deterministic” procedure which leads to the same result every time, unlike neural net learning which relies on stochastically chosen initial weights.  Thus regression was chosen as the main method for developing a structure description model. Yegor D. Smurnyy 1 , Kirill A. Blinov 1 , Mikhail E. Elyashberg 1 ,  Brent A. Lefebvre 2 , Antony J. Williams 2   1 Advanced Chemistry Development, Ltd., Moscow, Russia,  2 Advanced Chemistry Development, Inc., Toronto, ON, Canada NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms INTRODUCTION In silico  prediction of small molecule properties is widely used today in industry and academia. NMR spectra, in particular, are predicted by a variety of software packages.  In this array of software options, two main approaches are used: Database-based . Compounds are compared against a database, the result is calculated using data for close structural relatives found in the dataset.  Regression-based . An experimental database is used to calculate the parameters for non-linear regression.  The chemical shift is calculated by a non-linear function of variables which describe characteristic features of the molecule of interest. These two outlined approaches require different strategies for implementation and optimization.  Database-based results are improved by acquiring larger databases and/or including user-specific data into the calculation.  Non-linear regression algorithms can be improved through the regression itself, or by improving the structural descriptors. Regression   Regression itself can be improved.  The goal in this case is to make sure that the minimum found by the algorithm is a global minimum, the solution is stable, and available computer resources are enough to process databases of up to one million chemical shifts.  Given these goals, two major classes of algorithms are the most popular: Neural networks —artificial neurons form a network which can be “taught” (regression parameters are adjusted). Least squares algorithms —starting with partial least squares (PLS). Neural networks are much more popular these days and are used sometimes because of their effectiveness, sometimes just because they are an exciting area of research.  We believe that these two methods are able to produce results of similar quality and the ultimate choice should not be made out of popularity, but after running benchmarks using actual data. Structural Descriptors Independent variables used for regression can be extended to precisely describe a chemical structure, including not only chemical topology itself, but also 3D information and experimental conditions (most commonly, solvent).  However, care should be taken not to “over-describe” a structure—a description that is too detailed tends to include structural features that have minor impact on the observed chemical shift.  This leads to increased prediction error. GOALS In the current work we focused on two areas: Validation and improvement of the chemical descriptor schemes, with a special emphasis on the level of detail necessary and sufficient for inclusion into the description. Comparison of partial least squares and neural network methods.  We tried to do our best to ensure that both methods are used in the optimal way.  Unlike PLS, neural networks have a number of adjustable parameters, hence we ran a series of calculations separately to ensure that the set of parameters used for comparisons with PLS method was optimal. 20 input neurons,  10 hidden 30 input neurons, no hidden 50 input neurons, no hidden The first attempt with linear regression gave a significant prediction error, even for the large number of spheres that were used.  We concluded that linear regression was not appropriate and non-linear regression should be used.  This was accomplished by adding “cross-increments” as a type of correction factor to make the regression model non-linear.  FIGURE 2 . The dependence of prediction quality on the number of spheres used.   The best result is achieved with a total of six spheres taken into account, with cross-increments added for atoms located up to the third sphere. NEURAL NETWORKS Typically, neural networks are considered to be superior to older least squares-based methods.  It is commonly suggested that a trained neuron, given input variables  x 1 ,…x n  automatically takes into account all possible non-linear combinations of variables, such as  x 1 x 2 ,  x 1 x 2 x 3 , etc.  Given this argument, a neural network should not require cross-increments.  We tested this hypothesis using our test data set of approximately 400,000  13 C chemical shifts. FIGURE 3 . Mean Error for Neural Net test with Cross-Increments.  Three spheres were taken into account; cross-increments were either absent or added for atoms either one or two bonds apart. The data clearly show that the neural network still requires cross-increments to be explicitly included into the structure description. FIGURE 4 . Mean Error vs. Standard Deviation for different Neural Network topologies.  The way inputs were normalized and transfer functions used was varied (not reflected on the graph). Max number  of bonds Max number of spheres Mean error on test set, ppm Error, ppm HOSE codes Hybrid PLS Neural Network Standard  deviation Mean error CH CH 3 C O HO CH C O NH 2 Spheres 1 st   2 nd   3 rd   One atom  increment Two bonds cross-increment One bond cross-increment Standard  deviation Error,  ppm Mean error HOSE codes Neural network PLS Mean error on test set, ppm

More Related Content

PDF
Towards More Reliable 13C and 1H Chemical Shift Prediction: A Systematic Comp...
PDF
Application of three graph Laplacian based semisupervised learning methods to...
PPTX
3D QSAR
PPT
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
PPTX
Software resources in drug design
PPTX
Molecular modelling
PPTX
Towards More Reliable 13C and 1H Chemical Shift Prediction: A Systematic Comp...
Application of three graph Laplacian based semisupervised learning methods to...
3D QSAR
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A fast clustering based feature subset selection algorithm for high-dimension...
Software resources in drug design
Molecular modelling

What's hot (20)

PPT
Energy Minimization Using Gromacs
PPTX
Molecular modelling and docking studies
PPTX
Molecular modelling (1)
PDF
MD Simulation
PPT
Molecular maodeling and drug design
PPT
PROGRAM PHASE IN LIGAND-BASED PHARMACOPHORE MODEL GENERATION AND 3D DATABASE ...
PDF
A systematic approach for the generation and verification of structural hypot...
PDF
Artificial Neural Network and Multi-Response Optimization in Reliability Meas...
PPTX
docking
PPTX
molecular mechanics and quantum mechnics
PPTX
3 d qsar approaches structure
PPTX
CoMFA CoMFA Comparative Molecular Field Analysis)
PPTX
Review On Molecular Modeling
PPTX
Richard Cramer 2014 euro QSAR presentation
PPTX
consensus superiority of the pharmacophore based alignment, over maximum comm...
PDF
An Automatic Clustering Technique for Optimal Clusters
PPTX
STATISTICAL METHOD OF QSAR
ODP
13.cartesian coordinates
PPTX
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
PPTX
QSAR by Faizan Deshmukh
Energy Minimization Using Gromacs
Molecular modelling and docking studies
Molecular modelling (1)
MD Simulation
Molecular maodeling and drug design
PROGRAM PHASE IN LIGAND-BASED PHARMACOPHORE MODEL GENERATION AND 3D DATABASE ...
A systematic approach for the generation and verification of structural hypot...
Artificial Neural Network and Multi-Response Optimization in Reliability Meas...
docking
molecular mechanics and quantum mechnics
3 d qsar approaches structure
CoMFA CoMFA Comparative Molecular Field Analysis)
Review On Molecular Modeling
Richard Cramer 2014 euro QSAR presentation
consensus superiority of the pharmacophore based alignment, over maximum comm...
An Automatic Clustering Technique for Optimal Clusters
STATISTICAL METHOD OF QSAR
13.cartesian coordinates
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR by Faizan Deshmukh
Ad

Similar to NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms (20)

PDF
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
PDF
The Performance Validation of Neural Network Based 13C NMR Prediction Using a...
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Introduction to Chainer Chemistry
PDF
Retrosynthesis tutorial v2
PDF
Event 32
PDF
Accelerating materials property predictions using machine learning
PDF
Machine Learning for Chemistry: Representing and Intervening
PDF
The interplay between data-driven and theory-driven methods for chemical scie...
PPTX
241223_JW_labseminar[Neural Message Passing for Quantum Chemistry].pptx
PDF
AI for automated materials discovery via learning to represent, predict, gene...
PDF
Machine Learning for Chemical Sciences
PDF
Garrett Goh, Scientist, Pacific Northwest National Lab
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PDF
AI in Chemistry: Deep Learning Models Love Really Big Data
PPTX
Materials Science in the Era of Knowledge Discovery and Artificial Inteligence
PDF
Open Science Data Repository - Dataledger
PDF
PERFORMANCE ANALYSIS OF NEURAL NETWORK MODELS FOR OXAZOLINES AND OXAZOLES DER...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
The Performance Validation of Neural Network Based 13C NMR Prediction Using a...
Deep learning for molecules, introduction to chainer chemistry
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Introduction to Chainer Chemistry
Retrosynthesis tutorial v2
Event 32
Accelerating materials property predictions using machine learning
Machine Learning for Chemistry: Representing and Intervening
The interplay between data-driven and theory-driven methods for chemical scie...
241223_JW_labseminar[Neural Message Passing for Quantum Chemistry].pptx
AI for automated materials discovery via learning to represent, predict, gene...
Machine Learning for Chemical Sciences
Garrett Goh, Scientist, Pacific Northwest National Lab
Deep learning methods applied to physicochemical and toxicological endpoints
AI in Chemistry: Deep Learning Models Love Really Big Data
Materials Science in the Era of Knowledge Discovery and Artificial Inteligence
Open Science Data Repository - Dataledger
PERFORMANCE ANALYSIS OF NEURAL NETWORK MODELS FOR OXAZOLINES AND OXAZOLES DER...
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release

NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms

  • 1. COMPARING APPROACHES After performance of both the neural network and least squares-based methods were optimized using truncated test datasets, the whole database was processed. PLS proved to be much faster than neural networks while maintaining similar performance. Both methods were also compared with the database-based method. FIGURE 5 . 13 C prediction result. The best results for 13 C chemical shift prediction are shown above. The HOSE code approach refers to the database method. The ‘Hybrid’ method first tries to find a structure in the database and if no close analogs are found it employs increments-based neural networks and/or PLS. The neural network used in the calculation had 100 neurons in the input layer, and 25 and 5 neurons in two hidden layers; the network was provided with cross-increments up to the 3 rd sphere. FIGURE 6 . 1 H prediction result. The best results for proton chemical shift predictions are shown. The best neural network had 30 input neurons and no hidden layer, six spheres were used for structure description and cross-increments were used up to the third sphere for atoms separated by no more than one bond. CONCLUSIONS In the work presented here, two different approaches to chemical shift prediction have been systematically compared. That of least squares-based regression and neural networks. From this work, we conclude that: The usage of neural networks does not help to reduce the number of atomic/cross-increments needed for accurate chemical shift prediction. Neural networks also do NOT achieve accurate results without cross-increments. The quality of the best possible results obtained with an optimized least squares scheme and neural network is approximately the same. The mean error can be as low as 1.5 ppm for 13 C and 0.2 ppm for 1 H chemical shift prediction. Both PLS and neural networks can be coupled with a database search, resulting in an even more effective hybrid method. METHODS Programming was performed in a Borland Delphi 5 environment using MTX libraries for the linear algebra calculations. Adjustment of the neural network parameters was in MATLAB with the neural networks package. Training Database The three most important factors in choosing the training database are size, diversity and quality. This ensures that the algorithms derived are applicable. Fortunately, these requirements are met by using the same database used in ACD/Labs’ commercially available products. This database contains approximately 2,160,000 13 C and 1,440,000 1 H chemical shifts. Equally as important is the selection of the test dataset. It should be as independent as possible from the training dataset. To avoid overlaps with the training data, 11,000 new compounds (150,000 chemical shifts) described in the literature in 2005–2006 were chosen as the test dataset. This avoided overlap since the training database included only compounds that were described in the literature before 2005. Structure Description Traditionally, chemical structures are described in terms of separate atoms. FIGURE 1 . The Structure Description consists of a central atom (for which the prediction is made) and substituents which are located at different distances from the central atom. All atoms separated from the center by n covalent bonds are called the n -th sphere. In total, 66 atom types were used. Generally, atoms are classified based on element number, number of attached hydrogens, hybridization, and valency. Additional descriptors were used to take into account conjugation, stereochemistry, and solvent effects. For pairs of atoms separated by a few covalent bonds, separate inputs were provided (aka “cross-increments” or “correction factors”). The number and nature of atom types, number of spheres, and number of cross-increments were all subject to optimization. PARTIAL LEAST SQUARES Calculation of the regression coefficients for PLS is generally a faster procedure than neural net training. Moreover, regression is a “deterministic” procedure which leads to the same result every time, unlike neural net learning which relies on stochastically chosen initial weights. Thus regression was chosen as the main method for developing a structure description model. Yegor D. Smurnyy 1 , Kirill A. Blinov 1 , Mikhail E. Elyashberg 1 , Brent A. Lefebvre 2 , Antony J. Williams 2 1 Advanced Chemistry Development, Ltd., Moscow, Russia, 2 Advanced Chemistry Development, Inc., Toronto, ON, Canada NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms INTRODUCTION In silico prediction of small molecule properties is widely used today in industry and academia. NMR spectra, in particular, are predicted by a variety of software packages. In this array of software options, two main approaches are used: Database-based . Compounds are compared against a database, the result is calculated using data for close structural relatives found in the dataset. Regression-based . An experimental database is used to calculate the parameters for non-linear regression. The chemical shift is calculated by a non-linear function of variables which describe characteristic features of the molecule of interest. These two outlined approaches require different strategies for implementation and optimization. Database-based results are improved by acquiring larger databases and/or including user-specific data into the calculation. Non-linear regression algorithms can be improved through the regression itself, or by improving the structural descriptors. Regression Regression itself can be improved. The goal in this case is to make sure that the minimum found by the algorithm is a global minimum, the solution is stable, and available computer resources are enough to process databases of up to one million chemical shifts. Given these goals, two major classes of algorithms are the most popular: Neural networks —artificial neurons form a network which can be “taught” (regression parameters are adjusted). Least squares algorithms —starting with partial least squares (PLS). Neural networks are much more popular these days and are used sometimes because of their effectiveness, sometimes just because they are an exciting area of research. We believe that these two methods are able to produce results of similar quality and the ultimate choice should not be made out of popularity, but after running benchmarks using actual data. Structural Descriptors Independent variables used for regression can be extended to precisely describe a chemical structure, including not only chemical topology itself, but also 3D information and experimental conditions (most commonly, solvent). However, care should be taken not to “over-describe” a structure—a description that is too detailed tends to include structural features that have minor impact on the observed chemical shift. This leads to increased prediction error. GOALS In the current work we focused on two areas: Validation and improvement of the chemical descriptor schemes, with a special emphasis on the level of detail necessary and sufficient for inclusion into the description. Comparison of partial least squares and neural network methods. We tried to do our best to ensure that both methods are used in the optimal way. Unlike PLS, neural networks have a number of adjustable parameters, hence we ran a series of calculations separately to ensure that the set of parameters used for comparisons with PLS method was optimal. 20 input neurons, 10 hidden 30 input neurons, no hidden 50 input neurons, no hidden The first attempt with linear regression gave a significant prediction error, even for the large number of spheres that were used. We concluded that linear regression was not appropriate and non-linear regression should be used. This was accomplished by adding “cross-increments” as a type of correction factor to make the regression model non-linear. FIGURE 2 . The dependence of prediction quality on the number of spheres used. The best result is achieved with a total of six spheres taken into account, with cross-increments added for atoms located up to the third sphere. NEURAL NETWORKS Typically, neural networks are considered to be superior to older least squares-based methods. It is commonly suggested that a trained neuron, given input variables x 1 ,…x n automatically takes into account all possible non-linear combinations of variables, such as x 1 x 2 , x 1 x 2 x 3 , etc. Given this argument, a neural network should not require cross-increments. We tested this hypothesis using our test data set of approximately 400,000 13 C chemical shifts. FIGURE 3 . Mean Error for Neural Net test with Cross-Increments. Three spheres were taken into account; cross-increments were either absent or added for atoms either one or two bonds apart. The data clearly show that the neural network still requires cross-increments to be explicitly included into the structure description. FIGURE 4 . Mean Error vs. Standard Deviation for different Neural Network topologies. The way inputs were normalized and transfer functions used was varied (not reflected on the graph). Max number of bonds Max number of spheres Mean error on test set, ppm Error, ppm HOSE codes Hybrid PLS Neural Network Standard deviation Mean error CH CH 3 C O HO CH C O NH 2 Spheres 1 st 2 nd 3 rd One atom increment Two bonds cross-increment One bond cross-increment Standard deviation Error, ppm Mean error HOSE codes Neural network PLS Mean error on test set, ppm