Software review: the KNIME workflow environment and its applications in genetic programming and machine learning

O’Hagan, Steve; Kell, Douglas B.

doi:10.1007/s10710-015-9247-3

Software review: the KNIME workflow environment and its applications in genetic programming and machine learning

Published: 25 July 2015

Volume 16, pages 387–391, (2015)
Cite this article

Download PDF

Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Software review: the KNIME workflow environment and its applications in genetic programming and machine learning

Download PDF

Steve O’Hagan^1,2 &
Douglas B. Kell^1,2

4619 Accesses
23 Citations
2 Altmetric
Explore all metrics

1 Introduction

Software comes in various forms, from the hair shirt style of the command line to fully blown, GUI-based commercial offerings. The former tends to give its users more control, but disenfranchises many other potential users who cannot themselves program yet who might otherwise benefit from it. A kind of halfway house is represented by software environments that provide both flexibility (power) and ease of use. A particular subset is represented by Workflow environments, in which loosely coupled, individual processing nodes can be ‘bolted together’ to permit complex computational operations. Taverna [7] (http://www.taverna.org.uk/) is a very well known scientific workflow system, especially in bioinformatics. It is a fully open environment, freely available, and workflows can be shared via its sister site myExperiment http://www.myexperiment.org/. It has some extensions for cheminformatics [2]. A particular strength is the means by which it can use Web services to link federated Web-based resources, a particular feature of bioinformatics.

For cheminformatics (see also [3]), we have been using the KNIME environment [4, 5]. KNIME stands for the KonstaNz Information MinEr [1] and is pronounced ‘NIME’ (with a silent ‘K’, like knife). It is freely available via www.knime.org for unrestricted use on the desktop (and with versions that operate under MS-Windows, Linux and Mac OSX). As datasets may be large, a reasonably beefy machine is advised. The download itself is just over 1 Gb, and installation is both automated and simple. (There is an otherwise identical commercial offering available at www.knime.com; its chief differences are that the environment may be extended to servers, and to clusters that run the Sun Grid Engine). Our main experience is with the Windows desktop version. Under the hood, KNIME is built on the ECLIPSE environment, with Java as its main internal language. Many other languages can be used with it, however, as detailed below.

The interface is configurable, but the more-or-less default version is shown in Fig. 1. This shows a workflow (written by SO’H) that takes a ChEMBL dataset (https://www.ebi.ac.uk/chembl/target/inspect/CHEMBL4333) of drugs binding to a particular receptor, and compares three data analytical methods (GP, random forests and partial least squares regression). To create the workflow, nodes are dragged from the node repository, dropped into the main workflow window, and linked using the mouse to join their output and input ports (shown as small triangles). Those nodes in the node repository that contain letters as typed into the appropriate window are shown. A vast number of nodes exist, including general ones for data and text mining, statistics and machine learning (at least one of KNIME’s originators has a background in fuzzy logic), with other more specialised ones for cheminformatics, mass spectrometry, image processing, time series analysis, and so on. Trusted users can contribute new nodes or entire collections; these are optionally downloadable and/or updated nightly. A right click on each node allows one to configure it. Thus a node for reading in an MS-Excel file would require information on the filename, whether the first row defines column headers, and so on. A second right-click allows a successfully configured node to be executed. If it does so, the ‘traffic light’ system shown on each node goes green, as in Fig. 1. Each node can be annotated with a simple description of its function, again as in Fig. 1. Large and complex workflows do not necessarily fit legibly into the main window, and a navigation window appears below under ‘outline’. Left clicking on a node provides a description of what it does (and, if the description is well written, how to configure it).

The particular beauty of KNIME for cheminformaticians is that a great many tools have been produced that allow standard procedures to be implemented without additional programming, e.g. converting chemical structures to computer-readable encodings. We regularly use the RDKit (e.g. [6]) nodes. Most nodes shown in the figure come with the vanilla-flavoured version of KNIME and/or the many free add-ons. One such is the ‘Tree Ensemble Learner’, which is from KNIME labs. However, for programmers it is possible to create nodes of arbitrarily complex function by ‘wrapping’ code in any nodes that ‘understand’ (parse) one of a number of languages, such as Matlab, R, Perl and Python (native nodes use a freely available SDK and are in Java). Thus the PLS regression node simply wraps a call to a standard R library, while the GP metanode wraps a fairly standard but detailed GP written (by SO’H) in Python. This metanode can easily be configured by its user. The chief disadvantage of this implementation is that one cannot see the GP running, but its progress can be recorded post hoc and exported (here to show fitness vs. time for training and validation sets). The final two windows show a list of available workflows (top left) and a list of frequently or recently used nodes.

To give an idea of speed, to write the GP metanode took a few days. Given this, however, to assemble the workflow of Fig. 1 took just a couple of hours, and to run it for 1000 GP generations with a population size of 200 and including niching (the slow step) took only 20 min on a standard desktop PC.

Where KNIME and related workflow systems come to the fore is in their ability to let ‘naïve’ users (re)create complex analyses just by reusing existing nodes or whole workflows, and even just by changing file names for instance. Thus some rather sophisticated workflows that compared the structures of ‘natural’ human metabolites with those of marketed drugs and other chemicals, outputting the analysis in the form of a 2D-biclustered heatmap [4, 5], were actually just a single workflow with simple filename changes. Given the base workflow, a novice could learn to do these changes in less than an hour, though of course time spent learning to create new workflows can be almost limitless. There are also API links to commercial software such as the Spotfire visualisation system.

2 Conclusion

Overall, this is a very sophisticated and professional piece of software. Because of its flexibility, it is nowadays our chief cheminformatics workhorse, and voting with one’s feet is surely the best possible endorsement. The KNIME philosophy and business model of mixed commercial and free (but Open) software, allows its continued improvement while making it freely available to desktop users. Some minor gripes relate to the fact that it seems only to read but not write .xlsx files—we are confident that someone will write a node to let it do so soon. There is a substantial community of users, increasing all the time, and many training schools and the like. Because of this, we think it will continue to grow in popularity. It is well worth a look for the GP community.

References

M.R. Berthold et al., KNIME: the Konstanz information miner, in Data analysis, machine learning and applications, ed. by C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Springer, Berlin, 2008), pp. 319–326. doi:10.1007/978-3-540-78246-9_38
Chapter Google Scholar
T. Kuhn, E.L. Willighagen, A. Zielesny, C. Steinbeck, CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinform 11, 159 (2010). doi:10.1186/1471-2105-11-159
Article Google Scholar
M.P. Mazanetz, R.J. Marmon, C.B.T. Reisser, I. Morao, Drug discovery applications for KNIME: an open source data mining platform. Curr. Top. Med. Chem. 12, 1965–1979 (2012). doi:10.2174/1568026611212180004
Article Google Scholar
S. O’Hagan, D.B. Kell, Understanding the foundations of the structural similarities between marketed drugs and endogenous human metabolites. Front Pharmacol 6, 105 (2015). doi:10.3389/fphar.2015.00105
Google Scholar
S. O’Hagan, N. Swainston, J. Handl, D.B. Kell, A ‘rule of 0.5′ for the metabolite-likeness of approved pharmaceutical drugs. Metabolomics 11, 323–339 (2015). doi:10.1007/s11306-11014-10733-z
Article Google Scholar
S. Riniker, G.A. Landrum, Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Cheminform. 5, 26 (2013). doi:10.1186/1758-2946-5-26
Article Google Scholar
K. Wolstencroft et al., The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucl Acids Res 41, W557–W561 (2013). doi:10.1093/nar/gkt328
Article Google Scholar

Download references

Acknowledgments

We thank the Biotechnology and Biological Sciences Research Council (BBSRC) for financial support under Grant BB/M017702/1. This is a contribution from the Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM).

Author information

Authors and Affiliations

School of Chemistry, The University of Manchester, 131 Princess St, Manchester, M1 7DN, UK
Steve O’Hagan & Douglas B. Kell
The Manchester Institute of Biotechnology, The University of Manchester, 131 Princess St, Manchester, M1 7DN, UK
Steve O’Hagan & Douglas B. Kell

Authors

Steve O’Hagan
View author publications
You can also search for this author in PubMed Google Scholar
Douglas B. Kell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Douglas B. Kell.

Rights and permissions

Reprints and permissions

About this article

Cite this article

O’Hagan, S., Kell, D.B. Software review: the KNIME workflow environment and its applications in genetic programming and machine learning. Genet Program Evolvable Mach 16, 387–391 (2015). https://doi.org/10.1007/s10710-015-9247-3

Download citation

Published: 25 July 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10710-015-9247-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Software review: the KNIME workflow environment and its applications in genetic programming and machine learning

1 Introduction

2 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation