Thursday, June 16, 2011

My talk at SLA on Trust in Science and Open Melting Point Collections

On June 14 and 15, 2011 I attended the Special Libraries Association conference and made presentations on two panels on the role of trust in science with a case-study of the Open Melting Point collections that Andrew Lang, Antony Williams and I have been assembling and curating.

The first panel was on the "International Year of Chemistry: Perils and Promises of Modern Communication in the Sciences". My colleague Laurence Souder from the Department of Culture and Communications at Drexel presented on "Trust in Science and Science by Blogging", using as an example the NASA press release on arsenic replacing phosphorus in bacteria and subsequent controversy taking place in the blogosphere. (see post in Scientific American blog today)

Watch Lawrence Souder's presentation screencast and slides.

The second panel was on "New Forms of Scholarly Communications in the Sciences". Don Hagen from the National Technical Information Service presented on "NTIS Focus on Science and Data: Open and Sustainable Models for Science Information Discovery" and Dorothea Salo discussed the evolving role of libraries and institutional repositories on scholarly communication and archiving.

Watch Don Hagen's presentation screencast and slides.

My own slides and screencast from the second panel are available below:



Labels: , , , ,

Monday, February 21, 2011

Alfa Aesar melting point data now openly available

A few weeks ago, John Shirley - Global Marketing Manager at Alfa Aesar - contacted me to discuss the Chemical Information Validation results I posted from my 2010 Chemical Information Retrieval class. Our research showed that Alfa Aesar was the second most common source of chemical property information from the class assignment.
We explored some possible ways that we could collaborate. With our recent report of the use of melting point measurements to predict temperature solubility curves, the Alfa Aesar melting point data collection could prove immensely useful for our Open Notebook Science solubility project.

However, since we are committed to working transparently, the only way we could accept the dataset is if it were shared as Open Data. I am extremely pleased to report that Alfa Aesar has agreed to this requirement and we hope that this gesture will encourage other chemical companies to follow suit.

The initial file provided by Alfa Aesar did not store melting points in a database ready format - it included ranges, non-numeric characters and entries reporting decomposition or sublimation. One of benefits we could provide back to the company was cleaning up the melting point field to pure numerical values ready for sorting and other database processing. This processed collection contains 12986 entries. Note that these entries are not necessarily different chemical compositions since they refer to specific catalog entries with different purities or packaging.

For our purposes of prioritizing organic chemicals for solubility modeling and applications we curated this initial dataset by collapsing redundant chemical compositions and excluded inorganics (including organometallics) and salts. We did retain organosilicon, organophosphorus and organoboron compounds. Because the primary key for all of our projects depend on ChemSpiderIDs, all compounds were assigned CSIDs by deposition in the ChemSpider database if necessary. SMILES were also provided for each entry, as well as a corresponding link to the Alfa Aesar catalog page. This curated collection contains 8739 entries.

For completeness, we thought it would be useful to merge the Alfa Aesar curated dataset with other collections for convenient federated searches. We thus added the Karthikeyan melting point dataset, which has been used in several cases to model melting point predictions. This dataset was downloaded from Cheminformatics.org. Although we were able to use most of the structures in that collection, a few hundred were left out because of some difficulty in resolving some of the SMILES, perhaps related to the differences in algorithms used by OpenBabel and OpenEye. Hopefully this issue will be resolved in a simple way and the whole dataset can be incorporated in the near future. This final curated collection contains 4084 entries.

Similarly the smaller Bergstrom dataset was included after processing the original file to a curated collection of 277 drug molecules.

Finally, the melting point entries from the ChemInfo Validation sheet itself, generated by student contributions, is added to amount to a collection of currently 13,436 Open Data melting point values. We believe that this is currently the largest such collection and that it should facilitate the development of completely transparent and free models for the prediction of melting points. As we have argued recently, improved access to measured or predicted melting points is critical to the prediction of the temperature dependence of solubility.

In addition to providing the melting point data in tabular format, Andrew Lang has created a convenient web based tool to explore the combined dataset. A drop down menu at the top allows quick access to a specific compound and reports the average melting point as well as a link to the information source. In the case of an Alfa Aesar source, a link to the catalog is provided, where the compound can be conveniently ordered if desired.
In another type of search, a SMARTS string can be entered with an optional range limit for the melting points. In the following example 14 hits are obtained for benzoic acid derivatives with melting points between 0C and 25C. Clicking on an image will reveal its source. (BTW even if you don't know how to perform sophisticated SMARTS queries, simply looking up the SMILES for a substructure on ChemSpider or ChemSketch will likely be sufficient for most types of queries).

Preliminary tests on a Droid smartphone indicate that these search capabilities work quite well.

Finally, I would like to thank Antony Williams, Andrew Lang and the people at Alfa Aesar (now added as an official sponsor) who contributed many hours to collecting, curating and coding for the final product we are presenting here. We hope that this will be of value to the researchers in the cheminformatics community for a variety of open projects where melting points play a role.

Labels: ,

Sunday, December 05, 2010

Dana Vanderwall on Cheminformatics at Drexel

Dana Vanderwall, Associate Director of Cheminformatics at Bristol-Myers Squibb, presented for my last Chemical Information Retrieval class on December 2, 2010.

The first part covered "Cheminformatics & The evolving relationship between data in the public domain & pharma" and included a general discussion of modern drug discovery and the details of a malaria dataset recently released from the pharmaceutical industry to the public.
The second part described a project based on "Molecular Clinical Safety Intelligence", where tracking side effects from approved drugs can help in the design of new drugs.
It was a very nice way to close out the course, showing very practical applications of the concepts we covered over the term. The recording is available below.

Labels: , , ,

Saturday, March 27, 2010

Education 2.0: Leveraging Collaborative Tools for Teaching

On March 25, 2010 I presented at the Drexel E-Learning 2.0 Conference on "Education 2.0: Leveraging Collaborative Tools for Teaching". It was an opportunity to update my slides with what I did and learned from the Chemical Information Retrieval course I taught over the Fall 2009 term.

I described using a wiki to organize course content and to allow students to contribute useful resources. Their assignments were also designed to be useful to other students in the class as well as to the general library and chemistry community.

I covered using wikis and other collaborative tools to mentor students doing laboratory research with Open Notebook Science. At the end I provided a quick overview of using games and Second Life for educational purposes.

Labels: , , , , , , , ,

Thursday, March 04, 2010

Nature Precedings as an Archiving Tool for ONS Solubility Book

The issue of archiving and citation is a topic that is usually raised whenever I give a talk about Open Notebook Science. We have recently tried to address this using several complementary strategies.

The publication of a book containing a snapshot of all the values obtained from the Open Notebook Science Solubility Challenge has turned out to be a convenient mechanism. By using LuLu, the book can be either downloaded for free as a PDF or ordered as a physical copy for just the printing and shipping charges.

However, Lulu does not have a convenient method of keeping track of different editions of the book and it is unclear how to best cite them.

Nature Precedings solves both of these problems quite nicely. I have uploaded the PDF of each book edition to NP and the versions are automatically linked to each other. In fact if you try to access an older edition, NP pops up a warning that a more recent version is available with the corresponding link (see image below).

Precedings also provides information about how to cite the document, including a DOI for each version. Unfortunately it appears that it can take some time for the DOIs to resolve. Links to different versions can also be formatted like this:
http://precedings.nature.com/documents/4243/version/1
http://precedings.nature.com/documents/4243/version/2
http://precedings.nature.com/documents/4243/version/3
Links to the Lulu version of each book are also provided, which is convenient for anyone who might want to order a physical copy.

At this time Precedings does not accept zip files containing the full archive of the source files for each book version - although a link to the archive is provided in the preface of the book. We have found that our library's DSpace repository is a convenient location for these.

Labels: , , , , ,

Monday, February 22, 2010

Science Commons Symposium Thoughts

UPDATE: the recording of my talk is here, following Cameron Neylon. Also see other sessions.

The Science Commons Symposium held at the Microsoft Campus in Redmond on Feb 20, 2010 turned out to be the best conference I have attended in the past year. Hope Leman and Lisa Green did a fantastic job of lining up an electric group of speakers and making sure that everything ran smoothly. Chris Pirillo provided streaming video of the talks and the liveblogging on FriendFeed and Twitter was pretty active. The recordings will be made available shortly.

It was utterly captivating from start to finish. Cameron Neylon started us off with "Science in the Open: Why do we need it? How do we do it?" by outlining the tremendous opportunities of doing science more openly while remaining aware of the obstacles. I followed up with a specific Open Science implementation "Using Free Hosted Web2.0 Tools for Open Notebook Science", including the recent work I did with Andrew Lang on creating snapshot archives of a notebook with source files.

Antony Williams followed with "ChemSpider: Collecting and Curating the World’s Chemistry with the Community", convincingly demonstrating the power of crowdsourcing to curate Open Data. Peter Murray-Rust then covered "Open Data and how to achieve it", pointing out the role of an embargo period in getting people to start to participate in exposing data. All of these presentations made the symposium fairly chemistry centric but I don't think the audience minded - and there were a few chemists in the audience.

After lunch Heather Joseph from SPARC talked about "Is Open Access the “New Normal”?". Her views were about the role of policy change to support OA, for example how NIH funded work is required to be OA within 12 months of publication. Stephen Friend blew a lot of minds with his talk on "Setting Expectations: Need for Distributed Tasks and Evolving Disease Models". I'm not quite sure I completely get his network approach compared to our current disease models of targeting a specific receptor but I am sure I'll come across it again since it depends on the processing of (vast amounts of) Open Data.

Peter Binfield proudly recounted the achievements of PLoS ONE, of course including the article-level metrics: "PLoS ONE and article-level metrics – A case study in the Open Access publication of scholarly journals". I didn't agree with his call for converting all the metrics to a single number for academic performance reporting - but that did lead to a vigorous discussion on FriendFeed.

Finally John Wilbanks from Science Commons delivered the keynote. It was a mesmerizing overview of what is needed to make Open Science more productive and the importance of working at the bottleneck. He described the elegant way in which the CC0 license allows for a very simple way of making data available as if it were public domain, regardless of the laws in various countries. He also showed his current work on trying to make automatic licenses for processes under patent protection and material transfer agreements.

Brian Glanz has provided a detailed summary of all the sessions, including a wealth of links to slides and additional information.

My slides:

#scspn

Labels: , , , ,

Tuesday, August 18, 2009

Spectral Game talk at ACS Fall 09

Yesterday (August 17, 2009) I gave my talk on the Spectral Game at the Using Technology to Enhance Learning in Organic Chemistry symposium at the American Chemical Society meeting. I was not able to attend the entire symposium but luckily I did catch David Soulby's talk on using Google groups to distribute NMRs for labs that require many students to submit samples. I am a fan of using free and hosted services to simplify workflows of all types.

Also in attendance at the symposium were Liz Dorland and Bob Hanson. It was good to catch up with them. Bob shared a story of how he has been assigning his students tasks in his organic chemistry class which lead to updating Wikipedia. There is so much potential for using the educational infrastructure to create better scientific content for everyone.

My talk on the Spectral Game highlighted the role of openness in teaching and research to create new educational tools, especially for learning NMR. Tony Williams said a few words at the end about ChemSpider, RSC and some upcoming opportunities to publish synthesis articles on ChemSpider.

Labels: , , , , , , , ,

Saturday, February 14, 2009

Web based Spectra Game

Yesterday I used the NMR game in Second Life during our 2-hour Friday workshop in CHEM242. (We used a new location on Drexel island SLURL) The students who attended had looked at little or no material prior to the workshop. By the end I ended up explaining chemical shifts, complex coupling patterns and diastereotopic hydrogens differentiated by the presence of a chiral center. The only concept we didn't cover is integration, although we used peak size to take a guess about groups with lots of hydrogens (like trimethyl).

I think it was a very efficient way to teach NMR and the students can now go off and continue to practice till our next workshop Monday. Second Life has some advantages - such as the ability to mediate group study sessions where students from remote location can come together to play and discuss spectral assignments using either voice or chat. It is also nice to see the molecules in 3D, especially for bridged cyclic systems.

However, there is a bit of a learning curve to get into Second Life and not all computers have a suitable video card. So it is nice to now have the ability to play the game on a web browser. Andy set up the game play so that the score reflects the number of correct answers obtained in a row. There are also only 3 molecules to choose from instead of 5 in Second Life.

We're using JSpecView to render the spectra so expanding peaks simply requires dragging the mouse across the area of interest. It is also possible to integrate and view the metadata by right clicking.

Currently we mainly have H NMR spectra but we'll be adding lots more C NMR, IR, UV, MS, etc. It all depends on how many Open Data contributions we can find. If anyone has spectra to donate please upload them to ChemSpider and don't forget to check the box for Open Data.

This has been a wonderful example of rapid collaboration by Andrew Lang, Rajarshi Guha, Antony Williams, Robert Lancashire and myself.

Give the web Spectra Game a spin and see if you can beat the high score....


Labels: , , , , ,

Thursday, October 02, 2008

NISO meeting on Open Research Data Standards

I spent the day in Baltimore yesterday at the National Information Standards Organization. We discussed the role of standards in Research Data, with a large focus on Open Data (see meeting blog). The FriendFeed discussion is here. A publication will result from this and I'll link to it when available.

A lot of the discussion revolved around the citation of datasets. My own view (and something that Cameron Neylon champions as well) is that a good way to encourage sharing of data is to make saving datasets convenient and part of the researcher's workflow. I recommended 4 simple options:
  1. use the open JCAMP-DX format for XY datasets (e.g. spectra) and Robert Lancashire's JSpecView for easy manipulation in a browser
  2. use GoogleDocs
  3. use Google DataSets
  4. do Open Notebook Science using your favorite tool (we use Wikispaces)
Then, following the Southampton Resolution on Open Science, after your paper comes out share these datasets.

Labels: , , ,

Creative Commons Attribution Share-Alike 2.5 License