Saturday, December 21, 2013

John May released CDK 1.5.4

Short post, but with big implications. John released CDK 1.5.4 which mostly consists of great work on his side on various important corners of the CDK. Make sure to read his blog, and some highlights:

Everyone who is still using a version from the CDK 1.4 series, should really consider switching over (just read the 1.5.2, 1.5.3, and 1.5.4 changelogs for the reason why). It will make your product faster and more functional.

My contribution to this release mostly consists of the new Isotopes work. As John nicely summarized:

Monday, December 09, 2013

Programming in the Life Sciences #16: Open PHACTS LDA usage

Hot on the heels of the announcement that the Open PHACTS LDA was hit over 8M times, here are some usage statistics of the Programming in the Life Sciences course. Basically consisting of six practical days (with expected reading at home), we see a spiked pattern:

You can always hope that students continue their work at home, and some do:

And some students wait until the very end with finishing their work:

Thanks for Paul Groth for suggesting It's not Open, I think, but functionality like that is very useful indeed.

Sunday, December 08, 2013

Programming in the Life Sciences #15: Sixth project screenshot

After five earlier screenshots, here follow a sixth one. That sums up all the projects I have received at this moment for Programming in the Life Science course. The team of Sam and Oskar decided to use a tree layout to show the pathways in which a selected compound is found, and a subset of targets and compounds for that pathway. To deal with all the asynchronous aspects, they first aggregate all data (from four different API calls), show the progress, and the amount of data found:

And once that data has been collected, the user can create the tree, which will then show something like this for citric acid:

Saturday, December 07, 2013

SPARQL endpoint uptime

Ever since I upgraded to Virtuoso OS 7.0 I have had trouble keeping the SPARQL endpoint at Uppsala University for the ChEMBL-RDF v13 data online (see doi:10.1186/1758-2946-5-23; ChEMBL is CC-BY-SA by the Overington team). It seems I am not getting all the settings right, or not as right as for VOS6 which ran for more than two years without the same downtime issues.

Pierre-Yves Vandenbussche operated a cool uptime service that monitored a SPARQL endpoint at set times and reported on this via a webpage but also a RSS feed. This project now found a new home and renewed development by Pierre-Yves, with the Open Knowledge Foundation:

It looks great, and looking forward to a reinstated RSS feature in a later version. But besides uptime, the new service also reports about SPARQL 1.0 and 1.1 support, though I am not sure how the testing works, because I cannot imaging VOS7 does not support the full of SPARQL 1.1. This is the report for my SPARQL endpoint for ChEMBL-RDF:

Programming in the Life Sciences #14: Two more projects

Two more projects were handed in for Programming in the Life Sciences late last night (see also the first three). Roberto developed a web page where you can enter a search term after which it will search targets based on that term, count the number of pharmacological data, and when selecting a target, it will summarize the IC50 values, pCHEMBL values, and molecular weights, like in this screenshot:

Anniek and Darja had a really interesting idea: start with Alzheimer and find possible drug targets, and with the Open PHACTS 1.3 API that should be possible with, e.g. the Alzheimer pathway in WikiPathways. However, while they got Ensembl identifiers for the targets in the pathway, after struggling for half a day, they could not find any pharmacology data. It turned out that the mappings between the Ensembl and Uniprot IDs were not to be found in the 1.3 cache (which still is the case). So, in the end, I created a JSON identifier mapping data for them to look up the mappings in. They ended up visualizing the results in a HTML table, where the target names were dynamically added to the table's "Coding Protein" column, addressing the asynchronous calling of the Open PHACTS APIs:

Friday, December 06, 2013

Programming in the Life Sciences #13: Another screenshot

I got a one more source code zip file from the Maastricht Science Programme students (see also the first two screenshots). Vincent and Błażej extended the d3.js tree view, showing classification information from ChEBI (they also submitted three patches to the Open PHACTS ops.js):

Programming in the Life Sciences #12: First screenshots

Yesterday was the last Programming in the Life Sciences practical day, and the 2nd and 3rd year B.Sc. MSC students presented their results yesterday afternoon. I am impressed with the results that they reached in only six practical days. I have suggested them to upload the presentations to SlideShare or FigShare (with the advantage that you get a DOI), and asked them to send them their tools. Below are some screenshots.

The first app is by Tim and Taís, and look up activities from the Open PHACTS platform and filters it for activities related to a set of five anti-oxidants (see also their FigShare):

The next app is by Janneke and Lukas and uses the Open PHACTS API to report on single protein targets for the compound the user enters (see also their SlideShare):

More apps will follow soon.

Friday, November 08, 2013

Looking for a PhD and a Postdoc to work on Open Science Nanosafety

I am happy that I got my first research grant awarded (EU FP7), which should start after all the contracts are signed etc, somewhere early 2014. The project is about setting up data needs for the analysis of nanosafety studies. And for this, I have the below two position vacancies available now. If you are keen on doing Open Science (CDK, Bioclipse, OpenTox, WikiPathways, ..., ...), working within the European NanoSafety Cluster, and have an affinity with understanding the systems biology of nanomaterials, then you may be interested in applying. Click the screenshots for full details.

PhD position

Postdoc position

Wednesday, October 30, 2013

Programming in the Life Sciences #11: HTML

HTML (HyperText Markup Language), the language of the web, is no longer the only language of the web. But it still is the primary language in which source code of webpages is shared. Originally, HTML pages were always static: the only HTML source of a web page was that was downloaded from a website. Nowadays, much HTML the is visualized in your web browser, is generated on the fly with JavaScript. In fact, that is exactly what you will learn to do in this course.

HTML has many dialects, and HTML5 is the upcoming next version. The features have become so extensive that we will not have capture half of them; instead, we will stick to the bare minimum needed. But even at an minimum, writing a web page with HTML code is basically writing source code. The compiled version is the view of the webpage your web browser shows you. One important difference is that HTML is much more like a data model representation than it is like computational instructions. That is, rather than saying things like put("String", xCoord, yCoord), we define what is to be shown in in what order with general instructions. Well, in pure HTML that is. Cascading Style Sheets (CSS) is quite outside the scope of this course.

A minimal HTML page looks like:

  Hello world!

When we think about this structure, we notice that it is not unlike the key-value maps we covered earlier. For example, compare it to this JSON:

  "html": {
    "body":{value:"Hello world!"}

Even if we introduce HTML attributes:

  <h1><a name="hello">Hello world!</a></h1>

The JSON equivalent would be:

  "html": {
          value:"Hello world!"

So, while these are quite different languages than programming languages, we can clearly see they have been made up by the same (computer science) people. But in my opinion, this is an advantage: because we only need to learn the underlying patterns and can then much more easily switch between different language.

Now, returning to the HTML example, we introduce a bit of terminology. Let's start with the last example:

<h1><a name="hello">Hello world!</a></h1>

This HTML code example shows the <h1> element which has one child element <a>. This child element has an attribute @name. Elements can contain string content, such as the <a> element has, and one or more child elements (and any combination of that). Attributes can only have string content. The HTML specification defines in detail which elements can be child elements of other elements. For example, the <head> element can only be a child element of <html>. Similarly, each HTML element can only have specific attributes, though some attributes can be attached to any element.

There is plenty of documentation on the web, but there are also tools that can help us write HTML. For example, the This website detects errors in your HTML code, and is quite helpful if you are new to editing HTML, as well as useful if you have a lot of HTML experience.

HTML elements you may find useful include the following:
  • <h1>, <h2>, ..., <h5>: these are header and can be used to make sections
  • <p>: indicates a paragraph
  • <div id="someID">: indicates a section of text. The content of any element with an id attribute can be replaced by any appropriate HTML content with JavaScript
  • <a href="http://...">some link</a>: this is used to make hyperlinks, href means hyperlink reference
  • <a name="mark1">some text</a>: this is used to create bookmarks. with <a href="#mark1">jump to section Mark 1</a>
  • <script>: used to include JavaScript code in your HTML page
  • <head>: this HTML blob contains metadata, a list of libraries to be loaded, but also JavaScript which is executed before the HTML <body> is processed
  • <body>: this contains the HTML that is depicted in your browser window
Keep the HTML simple; the programming is more important.

Exercise: below is part of the HTML/JavaScript source code behind this app. Please indicate which lines are HTML source code, and what is JavaScript.

  "-//W3C//DTD HTML 4.01 Transitional//EN"

Copyright (c) 2013  Egon Willighagen <>

 Permission is hereby granted, free of charge, to any person
 obtaining a copy of this software and associated documentation
 files (the "Software"), to deal in the Software without
 restriction, including without limitation the rights to use,
 copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the
 Software is furnished to do so, subject to the following

 The above copyright notice and this permission notice shall be
 included in all copies or substantial portions of the Software.


  <title>OpenPHACTS Jasmine Spec Runner</title>
  <script src="lib/jquery-1.9.1.min.js"></script>
  <script type="text/javascript" src="lib/purl.js"></script>

  <!-- include source files here... -->
  <script type="text/javascript" src="src/OPS.js"></script>
  <script type="text/javascript" src="src/ConceptWikiSearch.js"></script>

  <!-- setup -->
  <script type="text/javascript">
  // get the app_key and app_id from the webpage call -->
var prmstr =;
var prmarr = prmstr.split ("&");
var params = {};
for ( var i = 0; i < prmarr.length; i++) {
    var tmparr = prmarr[i].split("=");
    params[tmparr[0]] = tmparr[1];

  <h3>Search Results</h3>
  <p><div id="table"></div></p>
  <h3>Compound Details</h3>
  <p><div id="details"></div></p>
  <h3>JSON reply</h3>
  <p><div id="json">Nothing yet</div></p>
  <script type="text/javascript">
var searcher = new Openphacts.ConceptWikiSearch(
  params["app_id"], params["app_key"]
var callback = function(success, status, response){
  document.getElementById("json").innerHTML = JSON.stringify(response);
  html = "<table>";
  for (var i=0; i<response.length; i++) {
    html += "<tr>";
    html += "<td>";
    html += "Name: <span>" +
      response[i].prefLabel +
    html += "</td>";
    html += "</tr>";
  html += "</table>";
  document.getElementById("table").innerHTML = html;
  'Aspirin', '5', '4',

Programming in the Life Sciences #10: JavaScript Object Notation (JSON)

As said, JSON is the format we will use as serialization format for answers given by the Open PHACTS LDA. The API actually supports XML, RDF, HTML, and TSV too, but I think JSON is a good balance between expressiveness and compactness. Moreover, and perhaps a much better argument, JSON works very well in a JavaScript environment: it is very easy to convert the serialization into a data model:

var jsonData = JSON.parse(jsonString);

Now, we previously covered maps. Maps have keys and values: the keys unlock a particular value. For example, take this JavaScript:

var map = { "key": "value", "key2": "value2" };

We define here a key-value object, and we can access the two values with the two keys:

map["key2"]; // == value2

These examples are JavaScript source code. Not a string. The content of the map variable is a data structure. But when we communicate with a web service, we need a (string) serialization of the data model, because we cannot send around memory pointers (which a variable is) because they are only valid on a single machine.

This is where the JSON format comes in. We can convert the content of the above map variable into a string representation with this code:

var mapStringified = JSON.stringify(map);

which gives us the following output:


This string looks an awful lot like the JavaScript code we wrote earlier.

And, likewise we can convert the JSON string back into a JavaScript data model again, with:

var mapAgain = JSON.parse(mapStringified);

Now, I did warn you earlier that values can be lists and maps itself again, so consider this JSON example from Wikipedia:

    "id": 1,
    "name": "Foo",
    "price": 123,
    "tags": [ "Bar", "Eek" ],
    "stock": {
        "warehouse": 300,
        "retail": 20

Here we see that the value behind the stock key is another map, and the value behind the tags key is a list. This creates a quite flexible serialization format, which is happily used by Open PHACTS. (And for the semantic web readers, yes, we can make JSON more semantic. The Open PHACTS LDA supports a "rdfjson" format.)

Programming in the Life Sciences #9: APIs and Web Services

Continuing on the theory covered in this course, this part will talk about application programming interfaces (APIs) and web services.

Application Programming Interfaces
APIs define how programs can be used by other programs. An API defines how methods are called and what feedback you can expect. It basically is the combination of documentation and the program itself. But, unlike any piece of software, an API is aimed at users, rather than use in the same program. The API is how you communicate between programs.

Now, in this course we will see two key types of APIs. The first are the APIs provided by the libraries that we use. For example, we already indicated that we will be using at least the following two libraries, ops.js and d3.js. These libraries are a collection of functional bits (e.g. classes and methods). For example, ops.js defines an API which wraps closely the Open PHACTS Linked Data API (LDA) itself. The API requires as to do a few things: 1. create a wrapper for the LDA; 2. define a call back function; 3. invoke the actual

call.var searcher = new Openphacts.ConceptWikiSearch(
  "", appID, appKey
var callback=function(success, status, response){  
searcher.findCompounds('Aspirin', '20', '4', callback);

Web Services
Web services are a special kind of APIs: they expose an API over the web. That imposes some features of these APIs: first, they are based on a web transport layer, commonly HTTP, but XMPP is possible too. HTTP is used by your web browser too. Secondly, the web server needs a common communication language to serialize the method call. Here, two key standards are used, XML and JSON. We will cover these in more detail later. For now, it suffices to think of these as envelopes in which are message is sent. Now, another aspect standardized is how to call the web services. For that, SOAP and REST are the most important standards for the life sciences (though I still think Wagener's XMPP approach is still worthwhile checking out!). SOAP and REST use XML and JSON are underlying serialization format.

So, web services are theoretically complex. For this course, most of it is hidden by the client library that will take care of the HTTP and SOAP/REST layers. The students who wish to use Java instead of JavaScript, will face the problem that you first need to find a Java client library for the LDA. There is this library, but that needs exploring for use with the latest Open PHACTS LDA. Higher stakes, higher rewards.

Take home message
Practically, you do not need to know much of the technologies behind web services, just like you do not need to know machine instructions CPUs follow to run your program. But, it is important to have seen these terms. You will run into them, and need enough context to know where and how to find answers to the questions that you will have.

There is one exception: JavaScript Object Notation, JSON. That is the format in which the data is returned by the service, and you will have to handle that. JSON will be the topic of the next post.

Tuesday, October 29, 2013

Programming in the Life Sciences #8: coding standards

Never underestimate the power of lack of coding standards in code obfuscation. Just try randomly to read code you wrote a year ago or four years ago. You'll be surprised with what you find. Coding standards are like the grammar in writing: they ensure that our message gets understood. Of course, the primary goal is that the CPU understands what you mean, but because programming languages are not your native language, you may not always say what you think you are saying.

Copyright and Licensing
First standard is attribution: if you use the solution of someone else, you write in your source code whom wrote the solution. Secondly, you must allow others to do the same. Therefore, you always add your name (and normally email address) to your source code, and under what conditions people may use your code. This is commonly done by assigning a license. Open Source licenses promote (scientific) collaboration, and give others the rights to use your solution, redistribution modifications, etc. They may explicitly require attributions, but often not. In a scholarly setting, you always give attribution, even if not required by the license. Remember, that software falls under copyright but algorithms typically not. Copyright/author and license information is typically added to source code using a header.

The second thing is to document what your code is supposed to do, what assumptions are made, how people should use it, and preferably under what conditions it will fail. Comments in your source are just as much documentation as a tutorial in Word format. They are complementary, and documentation must not only be targeted at users, but also at yourself so that you understand why you added that weird check. You will not (have to) remember in two years.

Coding standards
Just like English has coding standards, programming language have too. Both also have styles, and a selection of a style is up to the author, but consistency is important. What coding standards should you be thinking about, include consistent use of variable and method names, keeping code blocks small, etc. For example, compare the following two code examples which do the same thing:

var method = function(string) {
  number = 0
  for (var i=0; i<string.length; i++) {
    if (string[i] == "A") number = number +1 
  return number

And this version:

var countTheANucleotides = function(dnaSequence) {
  count = 0
  // iterate over all nucleotides in the DNA string
  for (var i=0; i<dnaSequence.length; i++) {
    if (dnaSequence[i] == "A") count = count +1 
  return count

Which one do you find easier to understand the function of?

The exact coding standards differ from one project to another, very much like British journals tend to prefer to criticise while American journal prefer to critisize your manuscript. But there are some ground rules:
  1. use clear, descriptive variable and method names
  2. use source code comments to describe the intention of source code
  3. keep source code lines short enough that you can read the full line without (horizontal) scrolling
  4. keep code blocks short enough that the fit a single screen (say, 25 lines max)
Unit testing
It is important to realize that what you intend to have the computer to calculate is something different that what your source code actually tells the computer to do. Even more important is to realize that it is not always your fault if the calculation goes wrong; in particular, the input you pass to some program can always be crafted such, that it will fool your code in doing unintended things.

But, a common cause of misbehaving code is the author itself. At first (and many, many times after that) it's just getting the code to compile: missing semi-colons, typos in variable names, etc, etc. After a bit, and hunting you down to your grave, are bugs caused by unintuitive features of programming language, libraries you're using, etc. Common (and often expensive) mistakes include for-loops missing the first or the last element, incorrect conversion of units (125 M$ expensive!), etc.

Fortunately, we can call in the help of computers for this too. We have code checking tools, and importantly, libraries to help us define (unit) tests. These tests call running code, and check if the calculated results are matching our expectation. For example, for JavaScript we could use the MIT-licensed qunit. For example, we could write the following tests (in qunit):

test( "counting tests", function() {
  equal(1, countTheANucleotides("AGCT"));
  equal(4, countTheANucleotides("AAAA"));
  equal(0, countTheANucleotides("GCGC"));

OK, you get the idea. That other scientists really start to care about these things, is shown by these two recent papers:

Prlić, A., Procter, J. B., Dec. 2012. Ten simple rules for the open development of scientific software. PLoS Comput Biol 8 (12), e1002802+.
Sandve, G. K., Nekrutenko, A., Taylor, J., Hovig, E., Oct. 2013. Ten simple rules for reproducible computational research. PLoS Comput Biol 9 (10), e1003285+.

Friday, October 25, 2013

Isotopes: my very first Android App hits the app store

Two weeks ago I hacked up my very first Android app. It basically exposes the Blue Obelisk Data Repository's isotope data on your device, using the Chemistry Development Kit. Nothing more, nothing less. But as the saying goes, every journey starts with a small step. Now, today a second step was made, and my Isotopes app is now available from the F-Droid app store!

Wednesday, October 23, 2013

Programming in the Life Sciences #7: theory

No course, with some good theory. In this six-day course, I plan to cover this computing theory. It's very practice oriented:

That should give them enough head start to work on something like this. The material will be more extensive, but I'll give myself a head start, with some initial content.

Programming in the Life Sciences is done to solve problems in the life sciences, but only problems that can be solved with pen and paper too. Programming cannot measure metabolites in a cell. For that, you need equipment that gives the things it measured as data as input to the computer.

Instead, the program defines some computation that is done on the computer. For example, noise reduction, DNA/RNA/protein sequence alignment, metabolite identification, etc. But all computation start with input data.

The program tells the computer what it should do, step by step. Get the data from the LC/MS; find peaks; group peaks at the same retention time; match that against a metabolite spectral database; determine the best match; report the best three matches to the user via the screen. Step by step.

The computer consists of input/output devices (to get data; to present results), various kinds of memory (to remember things), and a central processing unit (CPU) that performs the computation steps.

Considering all this, programming is to define what the computer should do, in a (programming) language that the computer understands. Note that I say "the computer understands" rather than "the CPU understands". The CPU only speaks one language (machine instructions). But we use a higher level programming language, which is much more compact and easier to read/understand. A compiler translate this higher level language into machine instructions (sometimes more compilers).

Data Types
The programming language says do this, do that. It does not know about data. Fortunately, it knows about bit, and bits we can use to store data. That way, we can instruct the CPU to do things like: OK, take the measured LC/MS data, take the MS at retention time 5, then start with the first m/z value, and if that is larger than 10, then... etc. We do not want to hard code the data in our program, so we instruct the CPU to remember it. The computer has various levels of memory that are relevant (ignoring those at a CPU level!): variables stored in the working memory, and data stored on external memory (hard disk, USB disk, LC/MS machine).

Exercise: write a program that counts the sum of all numbers starting with 1 up to 50 without using variables.

Some programming languages have variables types. This variable is a non-integer number, this variable is a text string. This ensures that you cannot sum up "cat" with 5.3. This is called variable typing. Some programming language have hard typing (types are defined in the source code), while others have dynamic typing (the program figures it out when it is compiling), and some even no typing at all (the computer will complain when it runs).

Example basic variable types include: string, integers, floats, and booleans. Strings can be used to remember names; integers are needed for counts and iterations (how many m/z values did I already look at again??), and floats are needed for pretty much all scientific data. A boolean is a yes/no type, or true/false.

Also, variables do not have units. Remember those high school days? "John, six *WHAT*??", "Umm, six mole, sir." Variables do not have units. Thus while you cannot calculate the sum of "cat" and 5.3, a computer has no problem calculating the sum of six mole and three days.

Complex Types

Exercise: What variable type would you use for that photo you took last week of that western blot?

It is clear that these basic types don't suffice. This touches on the topic of computer representation. How does a computer keep a western blot in memory? That photo you tool with your Android digitized the western blot into a matrix of numbers: if it was a greyscale photo, then a single integer per position.

Programming languages have various complex types, though most even support the definition of even more complicated data structures. But the more basic complex types first: list. A list, vector, or array all refer to the same concept: a list of variables, typically of the same type. For example, a mathematical vector is a list of floats (e.g. float[] in JavaScript, where the [] refers to the list or array nature). A string, actually, which we marked as a "basic" variable type, is really a complex one too: it is a list of characters. That is, the string "cat" is a list of three characters. Importantly, each item in the list has an index, and the full list has a length. Depending on the programming language, the first item in the list has index 0 or 1.

As said, a list typically contains variables of the same type, just because it is easier to work with. But the list can contain complex types too. For example, we can create a list of lists (we would write float[][]). Each element in the top list is a list again; that is, the first element of the outer list is again a list. This matches vary closes the mathematical matrix.

A second complex type important in this course is the map. A map is basically a list of key-value pairs, where they keys take the role of the index in lists. Instead of asking for the list item with index 7, we ask for the value behind a certain key. And, like we could make a list of lists, we can also make a map of maps, etc. Keep this in mind! We will use this extensively in this course.

Now that we know how the CPU uses memory, we turn back to what the processor must do, according to our program. First, I mentioned the step by step at the start. This is critical: the processor has a linear progression through the steps it must do. I can only go forward, and only step by step. It cannot go back. Yet, that is exactly what we write in a for-loop, like in this four line JavaScript example:

var sum = 0;
for (var i=1; i<50; i=1+1) 
  sum = sum + i;

This code defines the variable sum in the first line, and then starts counting, from 1 to 50, one by one, and adding that number to the sum. This loop is only for our convenience. This is how the computer will run this program (and at a CPU machine instruction level it's even longer):

var sum = 0;
var i=1;
sum = sum + i;
sum = sum + i;
sum = sum + i;
sum = sum + i;
sum = sum + i;
sum = sum + i;
sum = sum + i;
sum = sum + i;
// ...

OK, I won't give the full sequence of steps the computer takes. I guess you can see the virtues of higher level programming languages :) Importantly, it is a linear list of steps it takes.

Another important control structures in programming languages is the if-statement. This gives us the power of making decisions. For example, we can skip the 7 in the above summation:

var sum = 0;
for (var i=1; i<50; i=1+1) {
  if (i == 7) {
  } else {
     sum = sum + i;

But I yet did not discuss another important concept: the operator. The operator tells the computer what operation to perform, and how. This last source code example uses various operators: =, <, +, and ==. The first is an assignment operator: it assigns the value '0' to the variable sum. This operation does not return anything. The < operator compares two variable values, or a variable value with a specific value. For example, the above code compares the value behind the 'i' variable with 50; indeed, it does not compare 50 with "i", which is the variable name. The + operator follows the mathematical + operator for floats and integers; for strings the + operator performs a concatenation: "cat" + "fish" is not one less fish, but a "catfish". Note that these two operators, < and +, return a new value. The < returns a boolean (yes, it's smaller; no, it's not smaller); the + returns an integer if it was summing integers, or a string when it concatenated two strings. The == operator also returns a boolean: true of the two variables are the same (in general). During the course, we will see several more operators. Look out for them!

In some way, this brings us to the next topic: functions of parameters. An operator is a special kind of function, and that will become more clear if I give an example function:

function add(first, second) {
  var sum = first + second;
  return sum;

Effectively, we just mad an alias function "add" which internally just uses the + operator, with the exact same outcome.

Exercise: what would be returned by these two function calls? 1. add(1,2); 2. add("cat", "fish");

This function example is not so interesting, and only makes the code harder to read. However, when the "body" of the function becomes larger, it allows you to easily replace a complex list of steps with one
function call. Consider: sumAllNumbers(1,50).

Now, if we collect many such functions, pretty much like books, we get a library. So, that one was easy.

That includes this episode of the Programming in the Life Sciences series. I will continue later with the theory about Web Services and Clients, Serialization formats, and Other.

Sunday, October 20, 2013 calls for help

I don't think I mentioned this JISC project by David Shotton et al. yet, and should perhaps have done so earlier. But it is not too late, as Shotton is calling out for help in a Nature Comment this week (doi:10.1038/502295a). Now, I have been tracking what is citing the CDK literature using CiteUlike since 2010, and just asked the project developers how I can contribute this data.

Interestingly, the visualization from is interesting as it also shows papers citing papers that cite the CDK:

This image shows that the corpus is yet small: this CDK paper is cited more then 250 times. In the comment, Shotton writes that "[i]deally, references will come directly from publishers at the time of article publication." I do hope that publishers soon start providing APIs to extract such data. But I like to complement the call out, by inviting everyone to start annotating their old papers with this information, e.g. using CiTO and CiteULike as I did. Importantly, the authors must type their citation, something that will greatly improve the paper itself, anyway.

Now, my own use case, is to get an idea on how the CDK is used. Reason: people are not paying us, so I am limited to reports in the public that write up how they use the CDK. Direct citation is important, but I am even more interested in papers that do not cite the CDK, but cite a paper that describes a tool that depends on the CDK, like PaDEL (doi:10.1002/jcc.21707) which is cited already 73 times. Such papers are traditionally not counted as measure of the impact of the CDK, but surely are. This work, combined with CiTO allows just that.

ResearchBlogging.orgD. Shotton (2013). Publishing: Open citations Nature, 502 (7471), 295-297 : 10.1038/502295a

Sunday, October 13, 2013

Forget Green and Gold Open Access: I only care about Copper Open Access

Update: because titanium OA, platinum OS, even white OA, were all already used (my apologies that I did not my googling well enough), it is Copper Open Access: because it conducts knowledge so well.

I do not regularly attend Open Access publishing conferences. Not that I do not care, but more that I don't have the time. But I care enough to know about the Berlin Declaration (22 October 2003). Meetings like that set my Open Access mind set. Sadly, many publishers have fucked up the term Open Access. Pardon my strong language, but we all strongly suspect some publishers have done so deliberately. And not because they care about scientific dissemination and communication, although they will claim so.

The Berlin Declaration about Open Access (OA) is close to that of other Open things, and here are some critical lines (emphasis mine):

The author(s) and right holder(s) of such contributions grant(s) to all users a free, irrevocable, worldwide, right of access to, and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship (community standards, will continue to provide the mechanism for enforcement of proper attribution and responsible use of the published work, as they do now), as well as the right to make small numbers of printed copies for their personal use.

Yeah, that does read a lot like a CC-BY license. We all know what Open Access means today: pretty much nothing, and once again we must look up the license of papers to see what the journal really is up to.

Now, because people starting messing with with the meaning, some terms were introduced: green OA and gold OA. This was after I thought the community had settled on things, and never had to do much with it. I though green was about self-archiving, and gold was about having a proper Open license.

However, as Andrea Scharnhorst pointed out, my understanding of gold OA was wrong. Apparently, the gold OA definition is now ambiguous too. For example, this page writes that with gold OA the author or author institution can pay a fee to the publisher at publication time, the publisher thereafter making the material available 'free' at the point of access (the 'gold' route). The two are not, of course, incompatible and can co-exist. Nothing here about a Open license. WikiPedia is closer too my impression, in that it says that OA often also comes without fees, but points to a vague Open Access Journal page, which means anything again. (Does anyone have texts for the original definitions of green and gold OA?) This mess is for the Blue Obelisk movement to not include Open Access as a goal, where Open Data, Open Source, and Open Standards are.

So, forget Green and Gold Open Access: only care about Copper Open Access. Hereby, Copper Open Access is defined as:
  1. the author(s) remain copyright owners,
  2. the work is made available under an Open license to all users a free, irrevocable, worldwide, right of access to and a license to copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works in any digital medium for any purpose, subject to proper attribution of authorship, as well as any further rights we associate with Open as outlined by, for example, the Debian Free Software Guidelines.
If you agree with this definition, please support it, and leave a comment your approve, link to this blog, etc, etc. This definition is available under CC-BY.

Saturday, October 12, 2013

Programming in the Life Sciences #6: functions

One key feature of programming languages is the following: first, there is linearity. This is an important point that is not always clear to students who just start to program. In fact, ask yourself what the algorithm is for counting the chairs in the room where you are now sitting. Could a computer do that in the same way? How should your algorithm change? A key point is, is that the program is run step by step, in a linear way.

However, we very easily jump to functions. In fact, we use so many libraries nowadays, this linearity is not so clear anymore. Things just happen with magic library calls. But at the same time, the library calls make our life a lot easier: by using functions, we group functionality in easy to read and easier to understand blobs.

OK, the previous example showed that we could use the HTML @onClick attribute to provide further detail. But I did not show how. This is how:

  html += "Name: <span onClick=\"showDetails('" +
    escape(dataJSON) + "\')\">" + 
    response[i].prefLabel + "</span>";

This code adds the @onClick attribute and a function call to the showDetails() method which takes one parameter, where we pass escaped JSON. That is non-trivial, I understand, and may be due to my limited knowledge of JavaScript. The escaping of the JSON is needed to make quotes match in the generated HTML. In the function later, we can unescape it and get the original JSON again. Importantly, the dataJSON data contains all the details I like to show.

Now, this functions needs to be defined. Yes, plural, because two functions are used in this code snippet: showDetails() and escape(). The last is defined by one of the used libraries. The showDetails() function, however, I made up. So, I had to define it elsewhere in the HTML document, and it looks like:

  var showDetails = function(dataJSON){
    data = JSON.parse(unescape(dataJSON));
    document.getElementById("details").innerHTML =

This example actually gives the exact same output as the code in the previous post, but with one major difference. We now can extend the function as much as we like, but the code to output the list of found compounds does not have to get more complex than it already is.

Wednesday, October 09, 2013

Programming in the Life Sciences #5: converting the results into HTML

Now that we have the communication working with the Open PHACTS LDA, it is time to make a nice GUI. I will not go into details, but we can use basic JavaScript to iterate over the JSON results, and, for example, create a HTML table:

In fact, I hooked in some HTML onClick() functionality so that when you click one of the compound names, you get further details (under Compound Details), though that only outputs the ConceptWiki URI at this moment. A simple for-loop does the heavy work:

  html = "<table>";
  for (var i=0; i<response.length; i++) {
    html += "<tr>";
    html += "<td>";
    dataJSON = JSON.stringify(response[i]);
  //   dataJSON.replace(/"/g, "'");
    html += "Name: <span>" + response[i].prefLabel + "</span>";
    html += "</td>";
    html += "</tr>";
  html += "</table>";
  document.getElementById("table").innerHTML = html;

So, we're set to teach the students all the basics of programming: loops, variables, functions, etc. 

Programming in the Life Sciences #4: communication from within HTML

The purpose of a web service is that you give it a question or task, and that it returns an answer. For example, we can ask the Open PHACTS platform what compounds it knows with aspirin in the name. We pass the question (with the API key) and get a list of matching compounds. Now, this communication is complex: it happens at many levels, which are spelled out in the Internet Model. There are various variants of the stack of communication layers, but we are interested mostly in the top layers, at the application layer. In fact, for this course this model only serves as supporting information for those who want to learn more.

Practically, what matters here is how to ask the question and how to understand the answer.

We are supported in these practicalities with JavaScript libraries, in particular the ops.js library and general JSON functionality provided by most browsers (unless the student decided to use a different programming language, in which there are different libraries). Personally, I have only very limited JavaScript experience, and this mostly goes back to the good old Userscript and Greasemonkey days (wow! the paper is actually the 4th highest scoring BMC Bioinformatics article!). But because my JavaScript knowledge is limited and rusty, I spent a good part of today, to get a basic example running. Very basic, and barely exceeding the communication details. That is, this is the output in the browser:

So, what does the question look like? The question is actually hardcoded in the HTML source, but the page does take two parameters: the app_key and app_id that come with your Open PHACTS account.

The ops.js library helps us, and wraps the Open PHACTS LDA methods in JavaScript methods. Thus, rather can crafting special HTTP calls, we use two JavaScript calls:

var searcher = new Openphacts.ConceptWikiSearch(
  params["app_id"], params["app_key"]
  'Aspirin', '20', '4', '07a84994-e464-4bbf-812a-a4b96fa3d197',

The first statement creates an LDA method object, while the second makes an actual question. I have not defined the callback variable, which actually is a JavaScript function that looks like:

var callback = function(success, status, response){
  var result = searcher.parseResponse(response);
  document.getElementById("output").innerHTML =
    "Results: " + JSON.stringify(result);

When the LDA web service returns data, this method gets called, providing asynchronous functionality to keep the web page responsive. But when called, it first parses the returned data, and then puts the JSON output as text in the HTML. The output that is given in the earlier screenshot.

So, hurdle taken. From here on it's easier. Regular looping over the results, creating some HTML output, etc. The full source code if this example is available as Gist.

Programming in the Life Sciences #3: the assessment

Now that I have wrote out the goals, what they students will practically do, and how to get started with the Open PHACTS platform, I will list how we will assess the students:
  1. a presentation on the second day, outlining the project and work plan, 
  2. working source code at the end of the course,
  3. a final presentation, showing the results and conclusions.
Primarily, they will be judged on their acquired programming skills. Working code is the minimum; but code quality will be taken into account too. I will show them how blogging works as a pre-print server for presentations. I hope it will also learn them what role this has in scientific communication.

Tuesday, October 08, 2013

Programming in the Life Sciences #2: accounts and API keys

I have outlined the scope of the six-day course: the students will learn to program while hacking on the Open PHACTS' Linked Data API (LDA). The first step is to get an account for the LDA. I have already done that to save time. But these are the steps to take. You go to

You then approve the account via your email account and you are set. The account is needed to get an API key. Using this key, Open PHACTS developers can contact you if your scripts go berserk  So, you are kindly invited to make crazy hypotheses and hack the hell out of the platform. That's what I hope my students will do.

To try your new key, go to the documentation page, and open, for example, the SMILES to URL method:

Here you can see what parameters this LDA method has. We focus now on the app_id and app_key fields. Each account comes by default with a, um, default app_id and default app_key. Just click on the field and select them:

Select the defaults and enter a SMILES (try: CC(=O)NC1=CC=C(C=C1)O)). You can select the format you like (I like Turtle) and you get Linked Data back on this compound.

Now, go explore the LDA methods.

Sunday, October 06, 2013

Last CDK-JChemPaint release based on CDK 1.4.x

With CDK master improving day by day now, it is time to stop working with CDK 1.4. Thus, there is one final CDK-JChemPaint release based on this currently still stable CDK release: version 29. Nothing really changed, except for it being based on CDK 1.4.19. And I have also updated the Groovy-JCP code example, resulting in release 29 (yeah, the version numbers follow the CDK-JChemPaint patch). In fact, I have already rebased CDK-JChemPaint on CDK master, and will try to release Groovy-JCP 30 these days too. On the right there is a depiction of a structure as output of one of the scripts, showing CIP chirality labeling (and transparent element symbol labels too).

Saturday, October 05, 2013

Programming in the Life Sciences #1: a six day course

Update: the students will use ops.js and not swagger.js.

Our department will soon start the course Programming in the Life Sciences for a group of some 10 students from the Maastricht Science Programme. This is the first time we give this course, and over the next weeks I will be blogging about this course. First, some information. These are the goals, to use programming to:
  • have the ability to recognize various classes of chemical entities in pharmacology and to understand the basic physical and chemical interactions.
  • be familiar with technologies for web services in the life sciences.
  • obtain experience in using such web services with a programming language.
  • be able to select web services for a particular pharmacological question.
  • have sufficient background for further, more advanced, bioinformatics data analyses.
So, this course will be a mix of things. I will likely start with a lecture or too about scientific programming, such as the importance of reproducibility, licensing, documentation, and (unit) testing. To achieve these learning goals we have set a problem. The description is:
    In the life sciences the interactions between chemical entities is of key interest. Not only do these play an important role in the regulation of gene expression, and therefore all cellular processes, they are also one of the primary approaches in drug discovery. Pharmacology is the science studies the action of drugs, and for many common drugs, this is studying the interaction of small organic molecules and protein targets.
    And with the increasing information in the life sciences, automation becomes increasingly important. Big data and small data alike, provide challenges to integrate data from different experiments. The Open PHACTS platform provides web services to support pharmacological research and in this course you will learn how to use such web services from programming languages, allowing you to link data from such knowledge bases to other platforms, such as those for data analysis.
So, it becomes pretty clear what the students will be doing. They only have six days, so it won't be much. It's just to learn them the basic skills. The students are in their 3rd year at the university, and because of the nature of the programme they follow, a mixed background in biology, mathematics, chemistry, and physics. So, I have a good hope they will surprise me in what they will get done.

Pharmacology is the basic topic: drug-protein interaction, but the students are free to select a research question. In fact, I will not care that much what they like to study, as long as they do it properly. They will start with Open PHACTS' Linked Data API, but here too, they are free to complement data from the OPS cache with additional information. I hope they do.

Now, regarding the technology they will use. The default will be JavaScript, and in the next week I will hack up demo code showing the integration of ops.js and d3.js. Let's see how hard it will be; it's new to me too. But, if the students already are familiar with another programming language and prefer to use that, I won't stop them.

(For the Dutch readers, would #mscpils be a good tag?)

Monday, September 30, 2013

OpenTox Europe 2013 presentation: "The Open PHArmacological Concepts Triple Store"

On behalf of the Open PHACTS project, I have today presented the project at the OpenTox Europe 2013 meeting in Mainz. The session was about data management and analysis, and chaired by Dr. Nina Jeliazkova. Actually, I ended up in the same session as my colleague Martina Kutmon who gave a really nice presentation on PathVisio 3. Anyway, my slides looked like this and was based on earlier presentations from Open PHACTS colleagues, Gerhard Ecker and Chris Evelo in particular:

In the afternoon there were workshops where I presented Bioclipse-OpenTox, and particular the scripting side of it. You can read that tutorial here.

Friday, September 27, 2013

Urgent Open Science needs for Drug Discovery: pKa and logP

There is quite some discussion right now on Open Source Drug Discovery, and questions about what is Open Source and what is not. But as I made clear yesterday, I do not think that a project that requires the assigbment of specific rights independent from Open licenses is not the way forward, and in many cases not even possible. In my humble opinion, an #openscience approach is critical. Hiding data pending a (open) patent does not work for me; I'm sorry. Not that I am against patents in general (primarily, the patent system is broken and misused, but the idea has merits)...

Instead, I very much prefer to focus on solutions. Like the CDK, Bioclipse, BODR, CML, and many other Blue Obelisk tools. These tools are enabling drug discovery and research into computational tools to aid drug discovery. Without strings. Not fuzz about having to submit your precious data before you can use these tools. We contribute, we pay forward. And seriously, I love to see a Nature Chemical Biology paper and learn it uses the CDK, even if I am not a co-author on that paper, as much as I could use that in my academic career (or any of the other 75 contributors of the CDK!).

We do get back, beyond that aforementioned satisfaction. We do see other projects donate data, donate tools built on top of the CDK, to further aid the community.

But if you really like to know, here's my wishlist of things that we really urgently need: Open Data for training (statistical) models for chemical properties. In particular, I need CCZero experimental data (annotated with experimental method, error, etc) for:

  1. logP (and/or LogD)
  2. pKa (please use this wiki)
We recently saw such initiatives for melting points and solubility already from Jean-Claude, Andrew Lang, Antony Williams, and others.

If you have data, please make it available as Open Data, by putting it online in a machine readable format, and with proper copyright and CCZero waiver information.

Thursday, September 26, 2013

Why do databases make sharing Open Data difficult?

I tend to feel quite isolated in these matters, but they matter to me: licenses, agreements, etc. Because I try to be a friendly guy and respect the wishes expressed by others.

However, this puts me in a situation where I cannot join many otherwise interesting initiatives. There are many examples, but I will isolate one, for no particular reason other than that they just published an interesting paper about DMSO solubility modeling (doi:10.1021/ci400213d): the Online Chemical Database.

The training data from this solubility study is available from this website, and is listed in the abstract as freely downloadable. Well, free as in free beer. I cannot even look at the data set metadata without signing a license. So, I started reading the license, and clauses like this worry me:
    4.1 The User grants to Helmholtz Zentrum Muenchen by submitting information, data, models and structures to the Online Chemical Environment a world-wide, non-exclusive, transferable and sub licensable right to use all information data, structures and models submitted, for research, teaching and any other (including commercial) purposes.
Originating from an open, academic culture of collaboration, I rarely am the sole copyright owner of a data set. And with my busy agenda I am really not going to chase down all owners and ask them if they are willing to assign these rights to the Helmholtz Zentrum Muenchen. Do you seriously think I have nothing better to do? So, I cannot contribute data to this database. Worse, this clause probably not compatible with Open Data license in general. I fully understand the attention, but you are paying your legal experts probably a lot of money, so let them do their work and explicitly allow Open Data licenses, indicating that any such clauses do not apply to such data.

BTW, comparing this clause to 4.2 is awkward too. Not giving downloaders of data sets uploaded to the database the same rights as the uploader has given you, doesn't sound like being a good citizen.

Now, in no way this data base is unique. Many databases I encounter, all with the best of intentions, come up with legal obstacles. Is that really what you wanted to do?