ChemSpotlight - Indexing Chemistry on a Mac
Apple's Spotlight technology is a system-wide indexing and searching technology, built-in to OS X 10.4, Tiger. It offers full-text searching of e-mail messages, word processing documents, spreadsheets, presentations, PDF files, etc. It also features built in document metadata, so users can restrict searches to pictures in the last year, with a particular digital camera, resolution, etc. This metadata will also show up in the Finder, for example in the "More Info" section of the Get Info window and can include numeric data, text, dates, boolean true/false, and other standard data.
Most importantly, Spotlight allows developers to extend the architecture to new types of file formats and new metadata. This makes it ideal for indexing and searching scientific data. Essentially, large parts of the indexing, database, and searching mechanisms are handled by Spotlight. All you need to do is focus on the key parts for your file formats or data types.
Even better, the API offers searching through the system-wide Spotlight menu, but also exposes command-line utilities (mdfind) and a full API for third-party programs (e.g., MoRU, HoudahSpot). If your Spotlight plugin defines new metadata types, these programs automatically gain the ability to handle searches -- for example searching by chemical formula.
ChemSpotlight overview
ChemSpotlight is an open-source Spotlight plugin which indexes common chemical file formats (e.g., MDL .mol, .mdl, .sd, .sdf, Tripos .mol2, Protein Data Bank .pdb, Chemical Markup Language .cml, and XYZ) using the Open Babel chemistry library. Any format supported by Open Babel (57 at last count) can be handled with ChemSpotlight. It is provided as a Universal Binary for PowerPC and Intel, for optimized performance on both. It adds molecular formats, molecular weight, and a variety of other information for searching and Finder windows.
Because Spotlight indexes anytime a file is created or modified automatically downloading files from a website, creating them in a chemical program, or copying them from Windows friends, ChemSpotlight will handle them in fractions of a second.
Let me emphasize that again. It's fast. Unless you suddenly grab 100MB+ of new files, chances are that you won't notice any hit from indexing. And that's on my slow three-year old PowerBook G4.
In many ways, it's better to show ChemSpotlight in action (below) or let you download it and try it for yourself. Seeing the Finder suddenly learn how to handle a PDB file is fun.
Notice the computed chemical formula and molecular weight information for this file:
Developing a Spotlight plugin
Apple's developer tools offer example code for Spotlight plugins as well as documentation.
In the case of ChemSpotlight, reading in the file and calculating much of the chemical metadata is handled by the Open Babel library. This means that ChemSpotlight is actually only ~300 lines of code.
Ideally, an importer should read as little of the file as needed to index important metadata. This provides the best possible performance, since the code is called any time the file is copied, moved, or altered.
Importantly, when developing a mdimport plugin, you should pay attention to three files besides the code itself. The first is the info.plist file, which defines which file types are supported by the plugin. These are defined by the Universal Type Identifier (UTI) system. The second is the schema.xml which defines any custom metadata keys in your plugin, as well as what types of metadata are to be indexed for particular file types. Yes, this means that you must make sure that your filetypes are included in both the info.plist and schema.xml files. Finally, the schema.strings defines localized translations for the metadata types, to show up in the Finder and third-party utilities.
Current metadata
The ChemSpotlight plugin tries to index as much common chemical/biomolecular metadata as possible. This includes residue sequence, molecular weight and exact isotopic mass, molecular formula (e.g., C42H56S15), the number of atoms, bonds, residues, or molecules (for multi-molecular files), dimensionality (since some files are 2D or 3D), and common chemical identifiers such as Daylight SMILES and IUPAC/NIST InChI.
In addition, many chemical file formats (e.g., CML, MDL SDF, PDB) store arbitrary metadata internally as key/value pairs. While Spotlight requires importers to declare metadata keys beforehand (in the schema.xml file, mentioned above), ChemSpotlight does allow full-text searching on these keys and data as well. However, there is currently no way to restrict a search to return only these values like you can do with the metadata fields below.
| Metadata Field | Notes |
|---|---|
| net_sourceforge_openbabel_Chirality | True/False (1/0) |
| net_sourceforge_openbabel_Dimension | 0D/2D/3D depending on the coordinates found |
| net_sourceforge_openbabel_DisplayFormula | Formula with subscripts for Finder “Get Info” windows |
| net_sourceforge_openbabel_Formula | Chemical formula in standard “Hill Order” |
| net_sourceforge_openbabel_Mass | Standard molecular weight in a.m.u. (g/mol) |
| net_sourceforge_openbabel_ExactMass | Molecular mass of most common isotopes for mass spectra |
| net_sourceforge_openbabel_NumAtoms | Number of atoms in the molecule |
| net_sourceforge_openbabel_NumBonds | Number of bonds in the molecule |
| net_sourceforge_openbabel_NumMols | Number of molecules in the file |
| net_sourceforge_openbabel_NumResidues | Number of biomolecule residues |
| net_sourceforge_openbabel_SMILES | Daylight SMILES string for this molecule |
| net_sourceforge_openbabel_InChI | IUPAC/NIST canonical identifier |
| net_sourceforge_openbabel_Sequence | Biomolecule residue sequence (e.g., LYS-ARG-GLY). |
Current limitations & bugs
There are some hits and misses with ChemSpotlight right now, some due to my programming, and some due to the architecture of Spotlight right now.
- Although the code (and the command-line mdls tool) always round the molecular weight to 4 decimal places, the Finder sometimes shows +/- 10-12 or so.
- Spotlight currently seems unable to handle "compound extensions." For example, it seems impossible to index a file like 1abc.pdb.gz by declaring that ChemSpotlight can handle files of ".pdb.gz" extension.
- Spotlight currently does not allow any sort of "pass-through" indexing. So Daylight SMILES files which typically end in ".smi" extensions don't work -- this extension is reserved by Apple for self-mounting disk images. Or, for example, generic XML files don't work -- ChemSpotlight would have to handle every type of XML files, not just chemistry-based XML namespaces. The same holds for chemical structures in Microsoft Word or PDF files. (In principle, ChemSpotlight would be ready to handle these sorts of chemical data as well.)
- Spotlight is designed to handle individual files, not databases. This is why Mail.app splits mail messages out into separate files, and the address book database has "hints" for Spotlight, separating each entry into a separate file. For ChemSpotlight, this means that formats like MDL SD file, which can contain hundreds or thousands of individual "molecular records," are not always handled cleanly. While the file is indexed correctly and searching identifies the SD file which contains the desired structure, there's no way to go to specifically the correct entry in the file. It also means that metadata for such files includes arrays of molecular weights, formulas, etc. -- but there's no way of grouping the metadata to indicate that a given formula, molecular weight, etc. go together as an individual record.
For the last item, there are some workarounds for particular programs. As mentioned, Address Book and Microsoft Entourage split out "hints" to Spotlight as individual entries. But since these are stand-alone programs, they can take the search result from Spotlight and associate it back with the internal database. So for a stand-alone scientific application, the same approach can be performed, although it does waste disk space. For ChemSpotlight as a Spotlight plugin by itself, there is no useful workaround.
More to come?
ChemSpotlight is open source, so it will continue to evolve and add features. Bugs, suggestions, comments, and of course coding contributions are more than welcome. Of course there are also likely to be improvements to Spotlight when Apple releases Leopard.
More importantly, ChemSpotlight is designed to work for you. It's easy to customize to handle your metadata, resulting in a solution which dynamically indexes your files and your filetypes on your drives with no restriction on organization. No external database is needed.



Comments
Podcast and Demo
I should mention that I'm currently working on a podcast based on a recent conference talk I gave, as well as a QuickTime movie demonstration of ChemSpotlight in action.
It really is much more impressive live.
Between CoreData and Spotlight, I really think Apple has provided a large abstraction for scientific database applications. It's great!
-Geoff
Searching the chemical metadata
By now I hope you have installed ChemSpotlight, the next step is to look at searching the data.
There are several options:-
You could use Spotlight but it is not ideal for only searching the chemical metadata.
You can use mdfind using the Terminal and there is an excellent summary here (http://developer.apple.com/documentation/Darwin/Reference/ManPages/man1/mdfind.1.html).
Fredrik Wallner has written a perl-script (http://wallner.nu/fredrik/?postid=4) that adds sub-structure searching.
iBabel (http://www.macinchem.org)also has the ability to search the metadata