Almost three years ago I collaborated with others in the W3C Health Care and Life Sciences interest group. One of the results of that was a paper in the special issue around the semantic web conference at one of the bianual, national ACS meeting (look at this nice RDFa-rich meeting page!). My contribution was around the ChEMBL-RDF, which I recently finally published, though it was already described earlier in an HCLS note.
Anyway, when this paper reached the most viewed paper position in the JChemInf journal, and I tweeted that event, I was asked for an update of the linked data graph (the darker nodes are the twelve the LODD task force worked on). A good questions indeed, particularly if you consider the name, and that not all of the data sets were really Open (see some of the things on Is It Open Data?). UMLS is not open; parts of SIDER and STICH are, but not all; CAS is not at all, and KEGG Cpd has since been locked down. Etc. A further issue is that the Berlin node in the LODD network is down, which hosted many data sets (Open or not). Chem2Bio2RDF seems down too.
Bio2RDF is still around, however (doi:10.1007/978-3-642-38288-8_14). At this moment, it is a considerable part of the current Linked Drug Data network. It provides 28 data sets. It even provides data from KEGG, but I still have to ask them what they had to do to be allowed to redistribute the data, and whether that applies to others too. Open PHACTS is new and integrated a number of data sets, like ChEMBL, WikiPathways, ChEBI, a subset of ChemSpider, and DrugBank. However, it does not expose that data as Linked Data. There is also the new (well, compared to three years ago :) Linked Life Data which exposes quite a few data sets, some originating from the Berlin node.
Of course, DBPedia is still around too. Also important that more and more data bases themselves provide RDF, like Uniprot which has a SPARQL end point in beta, WikiPathways, PubChem, and ChEMBL at the EBI. And more will come, /me thinks.
I am aggregating data in a Google Spreadsheet, but obviously this needs to go onto the DataHub. And a new diagram needs to be generated. And I need to figure out how things are linked. But the biggest question is: where are all the triples with the chemistry behind the drugs? Like organic syntheses, experimental physical and chemical data (spectra, pKa, logP/logD, etc), crystal structures (I think COD is working on a RDF version), etc, etc. And, what data sets am I missing in the spreadsheet (for example, data sets exposed via OpenTox)?
Sunday, March 30, 2014
Friday, March 28, 2014
"Bridging WikiPathways and metabolomics data using the ChEBI ontology"
This week the ChEBI 3rd User Workshop took place, and I presented how WikiPathways is using ChEBI, and how I have been using it in the BridgeDb identifier mapping database for metabolites, and in mapping metabolites to WikiPathways using the ChEBI ontology.
Breaking News: CC-NC only for personal use!

(Image from WikiMedia.)
Sunday, March 16, 2014
Publishers #fail to innovate knowledge dissemination
![]() |
Source: Wikipedia, public domain. |
And just to make the point, we do need tools like this. We did 20 years ago. And publishers have done way too little. I really understand innovation is slow, is expensive. But, come on, use your imagination. I cannot solve everything in the world and really on others to implement stuff too. And here is an idea.
What if publishers could actually solve this problem. I know plenty of people are talking about it, and give it funny names, like nanopublications. That idea too existed for more than 20 years now. In fact, CMLRSS is not far from the nanopublication (doi:10.1021/ci034244p). And it was functional. Really, the implementation and standard is not even the issue. The key is adoption. Adoption may be slow, but it must exist. And for adoption to happen, you need commitment. For example, by promising that the time and resources invested in the adoption will have a return in investment. For example, have a guarantee that your solution won't go commercial at some point (causing a vendor lock in!).
But that something must happen is clear if you return to the science. Have you ever tried to do some theoretical study of some phenomenon? Than you know that data availability is a problem. And this data scarcity is exactly the reason why it has become valuable, and causing people to sit on top of it like a hen on her egg(s). If you ever have been involved in getting some good quality data together (ever noticed that much commercial data does not have the data you really need?), you know how expensive data is then. Recovering it costs more after the publishing process then before. Really, the original notebook has more information, likely be more informative then the formal publication.
Not just has the publishing model itself become more expensive than needed (just think about the APC of newer publishers, like PeerJ!), publishers also make access to the data more expensive than really needed.
This is a huge fail is the Western approach to science: we enormously disrespect data.
If you are not convinced, please give me answers to these questions (read active ingredient for "drug"):
- how were the CYP experiments performed for the top ten selling drugs and what are the main human transformations?
- what is the experimental errors on pKa measurements of the top ten selling drugs (uncharged and single charged, positive and negative)?
- how were the logP values measured for the top ten selling drugs and at what pH?
- what are the size distributions of samples of nanomaterials reported in literature?
- what are the different forms of a protein (not shape, but in terms of structure; so, phophorylation states, exact position, relevant SNPs, etc) of the top ten proteins relevant to pancreatic cancer?
If you can answer any of these questions in less than one hour with provenance (list of DOI and/or PubMed IDs), then I love to hear that. It would give an estimate of the problem. However, my estimate currently is that you cannot fully answer these questions, and most certainly not within one day. Had publishers taken their goal of knowledge dissemination seriously in the past 20 years, it would have been a lot simpler. But they failed. Why should I trust them to do better in the next 20 years? Meanwhile, with the limited funding I get, I will keep being happy with things I can contribute.
Now, if you do not understand why those details matter, start doing a multivariate statistics course. </rant>
Saturday, March 15, 2014
CiteULike to Twitter? IFTTT!
Twitter is great! Minutes after I asked this online:
can I have @citeulike automatically tweet new papers I bookmark?
— Egon Willighagen (@egonwillighagen) March 15, 2014
Sarah Pohl replied:
@egonwillighagen I would have suggested @IFTTT for that, but they don't support @citeulike, it seems.
— Sarah Pohl (@LilithElina) March 15, 2014
Alex Henderson complemented the answer informing me RSS is supported:@egonwillighagen @citeulike Library RSS to @IFTTT then twitter should work fine.So, I signed up to IFTTT and made a recipe, and life is good:
— Alex Henderson (@AlexHenderson00) March 15, 2014
bookmarked "UbiProt: a database of ubiquitylated proteins" http://t.co/7s2HsERYeVI wonder if I can get this to work with CMLRSS (see doi:10.1021/ci034244p) in some way... it would be brilliant to route molecular structures from the Crystallography Open Database into ChemSpider and PubChem automatically, wouldn't it?
— Egon Willighagen (@egonwillighagen) March 15, 2014
Saturday, March 08, 2014
Reviewing CDK patches in the Maven era
Three weeks ago the CDK project migrated from Ant to Maven as the primary build tool. That means that my workflow for making and, importantly, reviewing patches is completely turned upside down. Well, that happens.
My patch reviewing workflow looks like:
My patch reviewing workflow looks like:
- run the test suite and capture the number of JUnit Errors and Fails
- apply the patch and check if things still compile
- run the test suite and capture the number of JUnit Errors and Fails
- compare the number of Errors and Fails before and after
- check if JavaDoc is in order
- check if there is new unit testsing where appropriate
- check for new PMD issues
In there issues I always had CDK Nightly as backup, and this is now replaced by Jenkins; e.g. check this instance at the EBI. This workflow now translate to something like this (the extraction of the results was suggested by John):
- mvn clean compile test -Dmaven.test.failure.ignore=true
- cat */*/target/surefire-reports/* | grep "Tests run" | sed -e "s/, Time elapsed.* /\|/" | sort -t'|' -k2 > prepatch.txt
- git am / git cherry-pick
- repeat step 1 and 2, and safe as postpatch.txt
- diff -u prepatch.txt postpatch.txt
- repeat step 1-5, if needed.
And if all is good, then the diff should show no new fails and possibly even less. During a set of patches, things may be temporary failing, such as in this case:
diff -u prepatch.txt postpatch.txt
--- prepatch.txt 2014-03-08 11:41:13.520240111 +0100
+++ postpatch.txt 2014-03-08 12:59:21.022609259 +0100
@@ -3,6 +3,14 @@
Tests run: 22, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.atomtype.ReactionStructuresTest
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.CDKTest
Tests run: 10, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.formula.rules.IsotopePatternRuleTest
+Tests run: 15, Failures: 0, Errors: 10, Skipped: 0|org.openscience.cdk.graph.CyclesTest
+Tests run: 14, Failures: 0, Errors: 14, Skipped: 0|org.openscience.cdk.graph.EdgeShortCyclesTest
+Tests run: 12, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.EssentialCyclesTest
+Tests run: 31, Failures: 0, Errors: 18, Skipped: 0|org.openscience.cdk.graph.InitialCyclesTest
+Tests run: 14, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.MinimumCycleBasisTest
+Tests run: 14, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.RelevantCyclesTest
+Tests run: 13, Failures: 0, Errors: 11, Skipped: 0|org.openscience.cdk.graph.TripletShortCyclesTest
+Tests run: 14, Failures: 0, Errors: 14, Skipped: 0|org.openscience.cdk.graph.VertexShortCyclesTest
Tests run: 2, Failures: 2, Errors: 0, Skipped: 0|org.openscience.cdk.io.cml.QSARCMLRoundTripTest
Tests run: 14, Failures: 5, Errors: 0, Skipped: 0|org.openscience.cdk.modeling.builder3d.ForceFieldConfiguratorTest
Tests run: 15, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.qsar.descriptors.atomic.AtomDegreeDescriptorTest
Oh, and intermediate compiles I can do without running the tests with:
mvn compile -DskipTests
diff -u prepatch.txt postpatch.txt
--- prepatch.txt 2014-03-08 11:41:13.520240111 +0100
+++ postpatch.txt 2014-03-08 12:59:21.022609259 +0100
@@ -3,6 +3,14 @@
Tests run: 22, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.atomtype.ReactionStructuresTest
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.CDKTest
Tests run: 10, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.formula.rules.IsotopePatternRuleTest
+Tests run: 15, Failures: 0, Errors: 10, Skipped: 0|org.openscience.cdk.graph.CyclesTest
+Tests run: 14, Failures: 0, Errors: 14, Skipped: 0|org.openscience.cdk.graph.EdgeShortCyclesTest
+Tests run: 12, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.EssentialCyclesTest
+Tests run: 31, Failures: 0, Errors: 18, Skipped: 0|org.openscience.cdk.graph.InitialCyclesTest
+Tests run: 14, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.MinimumCycleBasisTest
+Tests run: 14, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.RelevantCyclesTest
+Tests run: 13, Failures: 0, Errors: 11, Skipped: 0|org.openscience.cdk.graph.TripletShortCyclesTest
+Tests run: 14, Failures: 0, Errors: 14, Skipped: 0|org.openscience.cdk.graph.VertexShortCyclesTest
Tests run: 2, Failures: 2, Errors: 0, Skipped: 0|org.openscience.cdk.io.cml.QSARCMLRoundTripTest
Tests run: 14, Failures: 5, Errors: 0, Skipped: 0|org.openscience.cdk.modeling.builder3d.ForceFieldConfiguratorTest
Tests run: 15, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.qsar.descriptors.atomic.AtomDegreeDescriptorTest
mvn compile -DskipTests
Subscribe to:
Posts (Atom)