Thursday, July 14, 2011

Practical Tips on using Google Apps Scripts for Chemistry Applications

A few weeks ago I described our use of Google Apps Scripts, developed by Rich Apodaca and Andrew Lang, as an intuitive interface to information related to a chemistry laboratory notebook. Since then we have been using these tools to actively plan and record experiments (e.g. UC-EXP269) and we have learned their strengths and weaknesses.

The most problematic aspect of Google Apps Scripts running within Google Spreadsheets turns out to be the way caching and refreshing operate. There does not appear to be an obvious way to refresh a single cell. So if a script times out or fails, Google stores that failed output on their servers and will not run it again until some time has elapsed (which seems to be on the order of about an hour). Typing in a new input for that cell will cause the script to run again but entering a previously entered input will only retrieve the cached output, even a failed output. For example, if you have a cell calculating the MW from "benzene" entered in another cell and the script fails for any reason, typing in "ethanol" will get it to run again for the new input, but going back to "benzene" will just pull up the cached output of "Failed".

Nevertheless, I did come across some tricks to force a refresh indirectly. If you insert a row or column then re-enter the desired scripts in the new cells, they will run again. You simply need to then delete the old column with failed outputs. This is fine for simple sheets but it can be a headache for sheets that have several calculation dependencies between cells.

To avoid these complications, simply refresh the entire sheet by duplicating it, deleting the old sheet and then renaming the new one to the original name. The problem now is that it will refresh all the cells, not just those that had failed outputs. And if there are a large number of scripts on that sheet the odds are good that at least one will fail on that particular attempt, especially if several are hitting the same web server.

As a result of all these problems, I would not recommend using these services as I had initially hoped, where a researcher would enter data into a template sheet loaded with scripts to automatically generate a series of calculated outputs. There is a way to achieve this end but it requires thinking about the scripts in a slightly different way.

As I mentioned above, there are tricks for refreshing an entire sheet or a column or row. In order to avoid re-running the scripts that already returned desired outputs, we need to lock them in. This can be done by highlighting the completed cells, copying them (either control-c or Edit->Copy) then pasting them as values (from the Edit menu). Now refreshing will only be done on the cells with failed outputs and these can be locked in as well as soon as they complete.

The downside of this approach is that you lose the information about which script was run to generate the output values. And to change an input requires re-selecting the desired script. But in practice it is so convenient to hit a dropdown menu and hit getMW (for example) that this downside is quite minimal, especially when contrasted with the upside of knowing that others will see your information reliably, independent of how the services are running at a particular time.

Over the past few weeks we have found that some services fail more often than others and it would be advantageous to have some redundancies. This has been particularly problematic for the cactus services recently, which we often use for resolving common names. By using ChemSpiderIDs (CSIDs), the cactus services can be bypassed for several of the gONS services. So a good practice for any application is to generate and lock in SMILES and CSIDs right away from the common name. CAS numbers can be used too but the gChem service that Rich has created sometimes yields multiple CAS numbers and these will fail as input for a subsequent script.

We now have a chemistry Google Apps Scripts spreadsheet to keep track of which inputs are allowed for all the available services, along with information about the output, creator and description. We also keep track of requests and plans for new scripts, marked as "pending" under the status field.


Surprisingly, pasting images "as values" within a Google Spreadsheet cell does not ensure that they will appear consistently - often the cells are just blank upon loading. This makes the idea of using an embedded sheet to display reaction schemes within a wiki lab notebook page not practical. However, using the scripts and a template to generate the scheme by just typing the name, SMILES or CSID for the reactants and product is a very efficient way to generate a consistent look for schemes within a notebook. It only requires a final step of taking the image of the screen and cropping using Paint. For example, here is a scheme thus generated for UC-EXP269.


Taking into account all of these factors, the reaction template sheet we provide does not have by default any scripts running within cells (except for the images). However, it is set up to quickly adapt to other reactions for planning amounts of reactants (by weight or volume), calculating concentrations, yields, melting points (experimental and predicted), solubilities, links to ChemSpider, 2D rendering of structures (including full schemes) and links to interactive NMR spectra using ChemDoodle. It simply requires users to hit one of the 3 drop-down menus (gChem, gCDK or gONS) and select the appropriate script for a particular cell.

Even if the user does not want to use this particular reaction template it still makes sense to make a copy of the template sheet because it is an easy way to copy all of the necessary Google Script without opening the editor.

Labels: , , ,

Friday, April 30, 2010

NMR integration web service expanded

The ONS Challenge has extensively used a web service created by Andrew Lang to automatically calculate solubility from NMR spectra. One of the constraints of the service was that the JCAMP-DX file had to be deposited in a special folder on a server at Drexel.

Andy has now modified the script so that the JCAMP-DX file can be located anywhere on the internet. I have prepared a modified Google Spreadsheet to serve as a template for SAMS calculations (Semi-Automated Measurement of Solubility). Simply enter the url to the JCAMP-DX file in the appropriate column and fill in the ppm ranges and corresponding hydrogen numbers for the solvent and solute, and molecular weight and density data. (The predicted density of solids can be found on Chemspider). The concentration of the solute will then be automatically calculated based on an assumption of volume additivity.

The web service (which handles baseline correction) could be used for any other purpose involving the integration of spectra. Just make a copy of the Google Spreadsheet and modify.

Note that the JCAMP-DX files must be in XY format. If your instrument saves spectra in a compressed format they must be converted to XY. The desktop version of Robert Lancashire's JSpecView can be used to carry out the conversion.

This template spreadsheet also features a service in a cell to display the NMR spectrum by simply clicking on the link inside the cell. This is very handy because it obviates the need to create an HTML file which must normally accompany the JCAMP-DX file for viewing. Being able to quickly view a spectrum from a particular row within the Google Spreadsheet makes tracking data provenance very intuitive and errors easy to spot.

Labels: , , ,

Wednesday, March 10, 2010

Updated Chemistry Web Services - now with Density

I mentioned a while back the web services that Rajarshi Guha had set up for us. We are often in need of molecular weight and density data for both solutes and solvents since we rely on an assumption of volume additivity when calculating concentration.

Since Rajarshi moved to the NIH, the location of the services has changed. We now have the CDK installed on a Drexel server so some of the simple services like MW and SMILES generation are still available there.

However density has been challenging to provide as a service. Experimental density values for solvents are commonly available but the calculated densities of solids is hard to find. ChemSpider is one of the few sources where calculated densities of solids and liquids are freely available. Unfortunately there are currently no ChemSpider density web services.

As an interim solution for the UsefulChem and ONSChallenge projects we have set a look-up table as a Google Spreadsheet (SolventLookUp) for most solvents of potential interest. Solutes added to our SolubilitiesSum sheet are automatically added to a SoluteLookUp SQL database running at Oral Robert University and the ChemSpider densities are added there via an automated but slow process.

Andrew Lang has used these resources to provide web services returning densities and other properties or descriptors. These data sources are especially important for the nearly automated production of new editions of the ONS Challenge Solubility Book. This is not a general solution since it only includes compounds of interest to our group and would not scale (at least for licensing reasons) to millions of compounds.

But it does come in handy for us because we can quickly call these services within a Google Spreadsheet to do a variety of useful calculations, minimizing the possibility of error by copy and pasting.

As an example see the following ChemServices sheet. Enter the common name for a solvent or solute and the number of millimoles and the sheet will automatically calculate the corresponding number of milligrams or microliters. [Note that Google Spreadsheets can only handle a maximum of 50 web service calls at a time - a useful trick is to highlight cells after the calculations then copy and "paste as values". Make sure to keep some cells with the web service calls in case you need to do more calculations in the future]

Labels: , , ,

Sunday, April 12, 2009

Automatic Back-Up of ONS files: Google Spreadsheets, JCAMP-DX, Flickr

As many of you know, we have been heavily dependent upon publicly editable Google Spreadsheets for storing results and calculations relating to our Open Notebook Science projects. We have recently integrated automated processing of NMR files in JCAMP-DX format to calculate solubility data by using web services called directly from within the Spreadsheets.

That represents a lot of distributed technology that is susceptible to network or server problems. Andy Lang, who wrote the web services that currently calculate the solubility, has enabled the recall of previously calculated values via a quick database look-up. While this substantially reduces server load by avoiding lengthy calculations, it does mean that the final numbers do not exist in the Spreadsheets themselves.

In addition to these concerns, every time I give a talk to a group of librarians the issues of archiving and curation of new forms of scholarship are raised. These are valid concerns and I've been trying to work with several groups to deal with the problem in as automatic a way as possible.

We had initially considered a spidering service that would automatically follow every file linked to the ONS wikis and download the documents on a daily basis. This has turned out to be problematic because many of the links don't terminate directly on files, but rather user interfaces. For example, a typical link to a Google Spreadsheet does not lead to a simple HTML page that can be copied but rather to an interface to add data and set up calculations.

It turns out we can take a semi-automated solution that gets us to where we want to be but requires a bit more manual work. Google Spreadsheets can be exported as Excel spreadsheets, which store the results of web service calculations as simple values and include the link to the web service as a cell comment. All calculations within the Spreadsheet are also retained in this way. The trick is to "publish" the spreadsheet using the advanced option of exporting as an Excel file. This then becomes a simple URL.


Now, the only manual step left in the process is to copy these URLs to another BackUp Google Spreadsheet. Andy has created a little executable that steps through a list of these URLs and creates a backup on any Windows computer under a C:\ONS directory. It is simply then a question of setting up a Windows Scheduler service to run once a day and call the executable. All the files are named with the date as a first part of the name for easy sorting.



Besides Google Spreadsheets backed-up as Excel files, spectral JCAMP-DX files and Flickr images can be processed in the same way. In both these cases the user must specify the JDX or DX or JPG file directly. In Flickr you have to go through a few clicks to the download page for a given image but once you have that it works fine.

Andy has versioned this as V0.1 for good reason. It does do exactly what we want but there are a few caveats:

1) Any errors in specifying a file will abort the rest of the back-up. In future versions there would be tolerance for errors, with appropriate reporting of problems, perhaps by email.

2) Files don't necessarily have the correct extensions. For example, backed up Wikispaces pages have to be renamed with an HTML extension to be viewed in a browser. Note that Wikispaces has its own sophisticated back-up system that will put the entire wiki with all files directly uploaded onto the wiki into a single ZIP file - in either HTML or WIKITEXT format. Of course this will not include files residing outside - like Google Spreadsheets. Still I think there is no harm in including the wiki pages in the the list of files to be backed-up by Andy's system.

Going forward there are two types of collaborations that could help a lot:

1) Librarians who would be willing to archive UsefulChem and ONSChallenge files. Right now these are just a few Megs a day but this will increase as we continue to add to the list. To be reasonable about space I could see a protocol of keeping only one back-up per week or month for dates more than 30 days in the past. This is about what the Internet Archive does I think. It would certainly be unambigous to know for certain what was known at what time with multiple libraries maintaining archives.

2) Someone who knows how Google creates URLs for downloadable XLS exports would be mightly helpful. Similar for Flickr and JPG exports. Even just writing a script to spider all HTML pages linked to the wikis and blogs would save a lot of manual labor. The nice thing is that the results of the spidering code would just have to be dumped into the Back-Up Google Spreadsheet - which already backs itself up conveniently.

Labels: , , , ,

Creative Commons Attribution Share-Alike 2.5 License