Reading SMILES with MX
The latest release of MX, the Java toolkit for cheminformatics, now supports reading a subset of SMILES strings. Although incomplete, full support for this feature is planned within a few releases.
To get an idea of how to use the new SMILES reader, we can use interactive JRuby. Assuming we've downloaded the mx-0.105.0 jarfile to our working directory, we can use:
$ jirb irb(main):001:0> require 'mx-0.105.0.jar' => true irb(main):002:0> import com.metamolecular.mx.io.daylight.SMILESReader => Java::ComMetamolecularMxIoDaylight::SMILESReader irb(main):003:0> bromobenzene = SMILESReader.read 'C1=CC=CC=C1Br' => #<Java::ComMetamolecularMxModel::DefaultMolecule:0x8a2023 @java_object=com.metamolecular.mx.model.DefaultMolecule@182a70> irb(main):004:0> bromobenzene.count_atoms => 7 irb(main):005:0> bromobenzene.get_atom(6).get_symbol => "Br"
Five Questions About the InChI Resolver 2
Yesterday the Royal Society of Chemistry (RSC) and ChemZoo (of ChemSpider fame) announced a plan to collaborate on the creation of an InChI Resolver service. From the announcement:
Using the InChI - an IUPAC standard identifier for compounds - scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future.
...
The InChI Resolver will be based on ChemSpider's existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. 'ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources' adds Antony Williams of ChemSpider, 'We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.'
It's encouraging to see a major scientific publisher lend its support to InChI in further evidence of the broad adoption of the identifier. And an InChI key resolver is something I've previously said might be a good idea.
Still, InChI and InChI Key represent a significant change in platform for the field of chemistry, in which CAS Registry Numbers are the gold standard for chemical identification.
If we've learned anything from the last 30 years of information technology, it's that once a platform (no matter how dysfunctional) becomes entrenched, nothing short of a game-changing strategy and herculean effort can replace it. The failure of Windows Vista offers a stark reminder of the power of an entrenched platform. Closer to home, the failure of V3000 molfiles to gain significant traction against V2000 offers another.
With these thoughts in mind, here are some questions about the new InChI Resolver service:
What problem is the service really trying to solve? Although it might be obvious to those close to the situation, it's not quite clear to me. Many, if not most, of the desktop cheminformatics packages sold today now have support for generating InChIs. It's also possible to embed InChI in text documents without using a Web service. Convenient it's not, which may be the point. But if that's the case then the focus of the service should be convenience, simplicity, and ease of use.
How hard would it be to crack an InChI hash? Before dismissing this as impossible, consider that an InChI key is a form of encryption, and a weak one at that. Breaking encryption schemes has a long history in computer science. Given the regularity of InChI syntax, how hard would it be to create software that can computationally provide the InChI that was used to generate an InChI key? What alternative hashing method might make it easier to do so? If there is one, it would become the standard, not the one currently being used.
How will the authenticity of a hashed InChI from an untrusted source be verified? An InChI key might take the form of 'AAAAAAAAAAA-BBBBBBB-XYZ'. Given an arbitrary InChI key provided by an untrusted third party, how would you independently verify that it actually represents a valid key? In the absense of software like that described in Question 2, it would be impossible.
What about BINOLs and Ferrocenes? InChI can't distinguish between stereoisomers arising from axial chirality such as that found in widely-used molecules such as BINOL. There are multiple ways to represent organometallics such as ferrocene using InChI, and each will give rise to a unique InChI key. This is a Bad Thing.
Why bother with an InChI key at all? Consider a hypothetical InChI key: 'AAAAAAAAAAA-BBBBBBB-XYZ'. To an end user uninterested in information technology, why does it matter how the key was generated? One selling point might be that given an arbitrary key, the chemical structure it represents can be decoded independently of any service. But that service is the core of the RSC/ChemSpider proposal - and it will apparently only be able to resolve previously-deposited InchI keys. Sound familiar? This is essentially how the CAS Registry system works, except the CAS system can differentiate BINOL stereoisomers, uniquely identify organometallics, and even handle polymers and complex mixtures.
Within the RSC/ChemZoo proposal is a gem of an idea. The CAS Registry system is closed and in all likelihood will remain forever so. Verifying the authenticity of CAS number/chemical structure assignments is a big problem made worse by the closed nature of the CAS Registry system. Chemists must have a reliable method to reference chemical structures. There are no doubt many solutions to this problem with big payoffs to the field of chemistry for the one that actually works.
Open Source Cheminformatics in Python with MX
MX is an open source cheminformatics toolkit written in Java. One of the reasons Java was selected as MX's development platform is the excellent support now available to interface the Java Virtual Machine to a variety of scripting languages. Of the scripting languages used in cheminformatics, Python stands out for its widespread adoption. This article will outline the steps needed to use MX from Python.
About Jython
Jython is a Java implementation of the Python interpreter. Although specific benchmarking numbers are surprisingly difficult to find around the Web, anecdotal evidence suggests the Jython interpreter is only slightly slower than the C Python interpreter in most areas, but may actually be faster than C Python in others, such as threading.
Another approach to Java-Python integration is JPype, which uses the Java Native Interface (JNI). The advantage is that its not even necessary to switch your Python interpeter to begin using any Java library.
Creating a Jython Environment
Jython comes complete with a GUI installer that worked flawlessly on my Ubuntu Linux system.
My only gripe about Jython is its lack of readline support out of the box. The symptom consists of getting the following in an interactive jython session after hitting the up-arrow to retrace your command history:
>>> ^[[A
Although there is some documentation on enabling readline support, in my hands it failed.
However, I was successful in installing Jythonconsole, which I configured to be run from the command line with:
$ jipy
Jythonconsole offers some nice touches, including dropdown code-completion and, of course, command line history - although the latter isn't persistent across sessions.
Scripting MX With Python
Before we use MX from Jython, we'll need to specify a location for the MX jarfile. Assuming mx-0.104.0.jar is in our working directory, this can be accomplished with:
$ export CLASSPATH='mx-0.104.0.jar'
Invoking Jython now gives us access to the complete set of MX functionality.
Hello, Benzene
We can create a benzene molecule in Python using the following commands:
Jython Completion Shell Jython 2.5b0 (trunk:5540, Oct 31 2008, 13:55:41) [Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.5.0_16 >>> from com.metamolecular.mx.io import Molecules >>> benzene = Molecules.createBenzene() >>> benzene.countAtoms() 6
That's it. We can now access any new feature of MX through Python without writing or debugging a single line of bridge code.
Other Uses of Jython in Cheminformatics
A few sources discuss the use of Jython to interface Java-based cheminformatics libraries. One of the most prolific is Noel O'Boyle, who has written a series of articles on the subject, including this introduction and this performance comparison. Noel's software project Cinfony uses Python to bridge cheminformatics toolkits written in different languages.
Conclusions
Scripting languages like Python offer a rapid and immediate way to test and develop software. This article has shown how simple it is to use the Java cheminformatics library MX from Python using Jython.
Update
It's also possible to import the MX jarfile into Jython using sys.path.append:
>>> import sys
>>> sys.path.append("mx-0.104.0.jar")
>>> from com.metamolecular.mx.io import Molecules
Flexible Depth-First Search With MX 2
Graph theory is an essential component of cheminformatics, if you dig deeply enough. MX is a lightweight cheminformatics toolkit written in Java with a major goal of exposing the most important cheminformatics graph manipulations in a flexible, Java-centric way. Previous releases have focused on implementing subgraph monomorphism functionality for use in substructure search. The new MX release, 0.104.0, introduces support for depth-first traversal. This article will give a simple example using this feature.
Downloading MX
MX can be downloaded in source or binary form:
mx-0.104.0.jar Platform-independent bytecode.
mx-0.104.0-src.tar.gz Source code and regression tests.
Scripting MX with JRuby
A previous article outlined the simple steps needed to install JRuby on unix-based systems for scripting MX.
Finding All Paths From a Given Atom
A fundamental graph operation in cheminformatics is finding all paths through a molecule from a starting atom. MX makes this easy with the com.metamolecular.mx.path.PathFinder class. Depth-first traversal is used in creating molecular fingerprints. Another use is in creating SMILES strings, although a limited form of depth-first traversal is used in which each atom in a molecule is traversed only once.
We can create a short library to print out all of the paths through a molecule in JRuby:
require 'mx-0.104.0.jar'
import 'com.metamolecular.mx.path.PathFinder'
class PathPrinter
def initialize
@finder = PathFinder.new
end
def print_paths atom
paths = @finder.find_all_paths atom
puts "printing all paths through the molecule"
paths.each do |path|
print_path path
end
end
private
def print_path path
path.each do |atom|
print atom.get_index
print '-' unless path.get(path.length - 1).equals(atom)
end
puts
end
endSaving the above code in a file called pathprinter.rb, we can test it from interactive JRuby:
$ jirb irb(main):001:0> require 'pathprinter' => true irb(main):002:0> import com.metamolecular.mx.io.Molecules => Java::ComMetamolecularMxIo::Molecules irb(main):003:0> benzene=Molecules.create_benzene => #<Java::ComMetamolecularMxModel::DefaultMolecule:0x43da1b @java_object=com.metamolecular.mx.model.DefaultMolecule@8a2023> irb(main):004:0> p=PathPrinter.new => #<PathPrinter:0x19ed7e @finder=#<Java::ComMetamolecularMxPath::PathFinder:0x3727c5 @java_object=com.metamolecular.mx.path.PathFinder@1140709>> irb(main):005:0> p.print_paths benzene.get_atom(0) printing all paths through the molecule 0-5-4-3-2-1 0-1-2-3-4-5 => nil
How It Works
Two classes collaborate in this traversal: com.metamolecular.mx.path.PathFinder and com.metamolecular.mx.path.DefaultStep.
Creating a depth-first traversal of your own is as simple as creating a DefaultStep from an Atom and implementing a walk method similar to the one shown below:
public void walk(Step step)
{
if (!step.hasNextBranch())
{
// do something with the completed branch
return;
}
while(step.hasNextBranch())
{
Atom next = step.nextBranch();
if (step.isBranchFeasible(next))
{
walk(step.nextStep(next));
step.backTrack();
}
}
}Conclusions
Depth-first traversal is an important tool in any cheminformatics library. MX offers an implementation of this traversal strategy that can be easily customized.
Goodbye Subversion, Hello Git and GitHub 1
A source code version control system is an essential ingredient of software development. But it's just low-level technology that doesn't change the way you think about creating software and so can safely be ignored. At least that's what I thought a couple of months ago before I started looking into Git, the version control system co-developed by Linus Torvalds.
About Git
If you've only used CVS or Subversion, Git may seem like a radical departure from sanity. For starters, Git does away with the idea of a central code repository. Instead, each contributor maintains their own repository. Authority is determined the way humans have always done it - through a chain of trust.
Forking, long thought of as a symptom of bad project management, becomes part of the normal workflow in Git. The key to making this system work was to make merging actually work.
About GitHub
GitHub uses Git as a core technology while extending its basic ideas into the direction of social networking and publication. Everything you do with your source code on GitHub becomes a Web resource, complete with its own URL. For public projects, these URLs are world-readable. Imagine the possibilities...
Specific Example: GitHub and MX
Let's say you're curious to see how difficult it would be to use MX to implement a cheminformatics feature you must have - a SMILES parser. If MX were using Subversion, you'd probably consider contacting me to negotiate write access to the 'central repository.' The other committers and I may be busy or on vacation, or otherwise not able to get back to you for some time. When we do, you may have moved onto other things. Considering this possible sequence of events, you may decide not to take the first step.
Something that should be spontaneous, simple and fast isn't that way at all.
With GitHub, you'd, you simply go to my MX repository homepage and fork it, creating your own independent repository. I'd get a notification that you had forked and your fork would show up in your public profile. We could then both automatically keep up-to-date on what the other was doing through a variety of tools such as RSS feeds.
If at some point you felt your SMILES parser had become something that should be included in MX, you'd send me a pull request. If the code passed standards I may have set such as copyright notice, compilability, and test coverage, I'd then be in a position to merge it back into the MX code base.
If there were significant changes that might be needed to get our respective repositories in synch, I'd have the choice of either making them myself or asking if you'd make them on your own repository. The latter would be the smart choice for both of us since you know your SMILES parser a lot better than I do.
The same kind of process can also be applied by me with my own repository. For example, I currently have an open branch aimed at building a depth-first traverser for MX to be used in SMILES writers, substructure search tools and the like.
Not Just a New Tool?
Any technology that can change the way people interact is more than just a tool. Although it's still early days, Git and GitHub might fall into that category. Only time will tell.

