Sunday, April 16, 2017

GenX spill, national coverage, but where is the data

First (I have never blogged much about risk and hazard), I am not an toxicological expert nor a regulator. I have deepest respect for both, as these studies are one of the most complex ones I am aware off. It makes rocket science look dull. However, I have quite some experience in the relation chemical structure to properties and with knowledge integration, which is a prerequisite for understanding that relation. Anything I do does not say what the right course of action is. Any new piece of knowledge (or technology) has pros and cons. It is science that provides the evidence to support finding the right balance. It is science I focus on.

The case
The AD national newspaper reported spilling of the compound with the name GenX in the environment and reaching drinking water. This was picked up by other newspapers, like de VK. The chemistry news outlet C2W commented on the latter on Twitter:

Translated, the tweet reports that we do not know if the compound is dangerous. Now, to me, there are then two things: first, any spilling should not happen (I know this is controversial, as people are more than happy to repeatedly pollute the environment, just because of self-interest and/or laziness); second, what do we know about the compound? In fact, what is GenX even? It certainly won't be "generation X", though we don't actually know the hazard of that either. (We have IUPAC names, but just like with the ACS disclosures, companies like to make up cryptic names.)

But having working on predictive toxicology and data integration projects around toxicology, and for just having a chemical interest, I started out searching what we know about this compound.

Of course, I need an open notebook for my science, but I tend to be sloppy and mix up blog posts like this, with source code repositories, and public repositories. For new chemicals, as you could read earlier this weekend, Wikidata is one of my favorites (see also doi:10.3897/rio.1.e7573). Using the same approach as for the disclosures, I checked if Wikidata had entries for the ammonium salt and the "active" ingredient FRD-903 (fairly, chemically they are different, and so may their hazard and risk profiles). Neither existed, so I added them using Bioclipse and QuickStatements (a wonderful tool by Magnus Manke): GenX and FRD-903. So, a seed of knowledge was planted.
    A side topic... if you have not looked at yet, please do. It allows you to annotate (yes, there are more tools that allow that, but I like this one), which I have done for the VK article:

I had a look around on the web for information, and there is not a lot. A Wikidata page with further identifiers then helps tracking your steps. Antony Williams, previous of ChemSpider fame, now working on the EPA CompTox Dashboard, added the DTX substance IDs, but the entries in the dashboard will not show up for another bit of time. For FRD-903 I found growth inhibition data in ChEMBL.

But Nina Jeliazkova pointed me to her LRI AMBIT database (poster abstract doi:10.1016/j.toxlet.2016.06.1469, PDF links) that makes (public) data from ECHA available from REACH dossiers in a machine readable way (see this ECHA press release), using their AMBIT software (doi:10.1186/1758-2946-3-18). (BTW, this makes the legal hassle Hartung had last year even more interesting, see doi:10.1038/nature.2016.19365). After creation of a free login, you can find a full (public) dossier with information about the toxicology of the compound (toxicity, ecotoxicity, environmental fate, and more):

I reported this slide, as they worry seems to be about drinking water, so, oral toxicity seems appropriate (note, this is only acute toxicity). The LD50 is the median lethal dose, but is only measured for mouse and rat (these are models for human toxicity, but only models, as humans are just not rats; well, not literally, anyway). Also, >1 gram per kilogram body weight ("kg bw"; assumption) seems pretty high. In my naive understand, the rat may be the canary in the coal mine. But let me refrain from making any conclusions. I leave that to the experts on risk management!

Experts like those from the Dutch RIVM, which wrote up this report. One of the information they say is missing is that of biodistribution: "waar het zich ophoopt", or in English, where the compound accumulates.

Friday, April 14, 2017

The ACS Spring disclosures of 2017 #2: some history

Bethany Halford adds some history about the sessions (see part #1):
    I believe Stu Borman was the first to cover the Division of Medicinal Chemistry’s First Time Disclosures symposium for C&EN, but it was Carmen Drahl who began the practice of hand-drawing and tweeting the clinical candidates as they were disclosed in real time. This seems like an oddball practice to folks who aren’t at the meeting. Why not just take a picture of the relevant slide? Well, that’s against the rules: There are signs all over the ACS National Meeting stating that photos, video, and audio recording of presentations are strictly prohibited. In San Francisco, symposium organizer Jacob Schwarz repeatedly reminded attendees that this was the case. Carmen’s brilliant idea to get around this rule was to simply draw the structures as they were presented, snap a photo, and then tweet it out.

    I’ve inherited the task since Carmen left the magazine a couple of years ago. I find it incredibly stressful. For an even that’s billed as a disclosure, the actual disclosing is fairly fleeting. The structures are often not on the screen for very long, and I’m never confident that I’ve got it 100% right. Last year in San Diego I tweeted out one structure and I heard the following day from Anthony Melvin Crasto, a chemist in India, that based on the patent literature he thought I had an atom wrong. I was certain that I had written this structure correctly, so I contacted the presenting scientist. He had disclosed the wrong structure!

    I agree that there should be some sort of database established afterwards, and I think you all have done great work on that front. I think you’ll find the pharmaceutical companies reluctant to help you out in any way. They guard these compounds so fiercely that it often makes we wonder why we have this symposium to begin with.

The ACS Spring disclosures of 2017 #1

At the American Chemical Society meetings drug companies disclose recent new drugs to the world. Normally, the chemical structures are already out in the open, often as part of patents. But because these patents commonly discuss many compounds, the disclosures are a big thing.

Now, these disclosure meetings are weird. You will not get InChIKeys (see doi:10.1186/s13321-015-0068-4) or something similar. No, people sit down with paper, manually redraw the structure. Like Carmen Drahl has done in the past. And Bethany Halford has taken over that role at some point. Great work from both! The Chemical & Engineering News has aggregated the tweets into this overview.

Of course, a drug structure disclosure is not complete if it does not lead to deposition in databases. The first thing is to convert the drawings into something machine readable. And thanks to the great work from John May on the Chemistry Development Kit and the OpenSMILES team, I'm happy with this being SMILES. So, we (Chris Southan and me) started a Google Spreadsheet with CCZero data:

I drew the structures in Bioclipse 2.6.2 (which has CDK 1.5.13) and copy-pasted the SMILES and InChIKey into the spreadsheet. Of course, it is essential to get the stereochemistry right. The stereochemistry of the compounds was discussed on Twitter, and we think we got it right. But we cannot be 100% sure. For that, it would have been hugely helpful if the disclosures included the InChIKeys!

As I wrote before, I see Wikidata as a central resource in a web of linked chemical data. So, using the same code I used previously to add disclosures to Wikidata, I created Wikidata items for these compounds, except for one that was already in the database (see the right image). The code also fetches PubChem compound IDs, which are also listed in this spreadsheet.

The Wikidata IDs link to the SQID interface, giving a friendly GUI, one that I actually brought up before too. That said, until people add more information, it may be a bit sparsely populated:

But others are working on this series of disclosures too, and keep an eye on this blog post, as others may follow up with further information!

Saturday, April 01, 2017

Closed access book chapters, Bookmetrix, and job creations

Enjoying my Saturday morning (you'll can actually track down that I write more blog posts then, than any other time of the week) with a coffee (no, not beer, Christoph). Wanted to complete my Scholia profile (gree work by Finn, arxiv:1703.04222, happy to have contributes ideas and small patches) a bit more (or perhaps that of the Journal of Cheminformatics), as that relaxes me, and nicely complements rerunning some Bioclipse scripts to add metabolite/compound data to Wikidata (e.g. this post). Because this afternoon I want to do some serious work, like write up outlines for a few cool grant applications. And if lucky, I may be able to do a bit of work on this below-the-radar project.

So, I started updating a full work available at for a peer-reviewed IEEE paper (doi:10.1109/BIBM.2014.6999367), as it is not old Open Access, and I have to rely on green Open Access. Then I headed over to my ImpactStory profile and ran into a closed Open Access book chapter with Tony, Sean, and Ola (doi:10.1007/978-1-62703-050-2_10). But I have no idea if I can put online a green Open Access version of this book chapter.

Now, why I am blogging this (and meanwhile, adding four new DTXSIDs to Wikidata), is two observiations. First, I had not blogged about Bookmetrix yet, a cool project that reports the impact of book chapters. The ROI on writing book chapters I always considered as not so high, but then I saw the #altmetrics for this chapter:

Five citations is not that lot, but considering I do not cite book chapter much either. But look at that number of downloads, 2.39 thousand! Wow!

But there is another angle to that. We regularly report our societal impact, nowadays. It's part of the Dutch Standard Evaluation Protocol, or at least selected by our research institute as something to assess researchers on. Hang on, no, citations is not part of that category. But this is: the paper is sold for about 50 euro. Seriously? Yes, seriously. And apparently 2.39K people bought this chapter. I am not sure if I need to assume that this is mostly people buying the full book, which means the chapter is a lot cheaper. But the full book reports download numbers of above 50 thousand, so it seems not. Now, let's assume that a good part of the bought copies is via package deals and the average payment is half. That may sound high, but we ignore the 50k download for the full book to compensate for that.

Doing that math means that our joint book chapter contributed 60k euro to the European market. That's a full job the four of us created with this single book chapter. I'm impressed.

Thursday, March 30, 2017

March for Science #marchforscience

You cannot have missed it, and if you did, you know about it know. We're marching for science. Originating in the USA, the marches are spreading around the world, also in Europe. More than 400 was the count a week ago. One by one, European cities joined with initiatives. Science March Stockholm was the first to get my interest, but Science March Amsterdam followed soon after. So, no reason to return to Stockholm this April. Here's a map with all planned marches around the world:

Zooming in on Europe (well, part of it), we get this map:

Quite a bit of choice. We see several countries with multiple marches. The Netherlands shows the Amsterdam march, but ideas have been posed to organize a Science March in Maastricht too.

Well, I will be marching. For what? For the importance of apolitical, nonreligious facts about the world. Facts that can be proven true, but also for a world where people value facts, fulfilling the human rights for everyone, as facts don't care about race, gender, color, left, right, or nerdiness.

Our world is precious; human and nature is precious. If we choose to destroy the world or if we choose to prosper mankind and nature, let it because of neutral facts. Not wishful thinking, money, or politics.

Let's show that science (of any domain, not just life sciences, but also humanities, etc) is by everyone and for everyone. Access to knowledge is a human right, is to benefit everyone. The march is for everyone too: you do not have to be working in scientific research to join the march to express your wish to have a fact-based country.

April 22, Amsterdam and Maastricht! Join!

Tuesday, March 21, 2017

OpenAPI to the Ensembl example

Already many months ago I joined a (doi:10.1093/nar/gkv1116) workshop in Amsterdam, organized by Gert Vriend et al (see this coverage). I learned then how to register services, search, and that underneath JSON is used in the API to exchange information about the services. One neat feature is that allows you to specify a lot of detail of the service calls.

Now, at the time we had already used OpenAPI (then still called Swagger) for Open PHACTS for some time, which we later picked up for other projects, like eNanoMapper (API), WikiPathways (API), and BridgeDb (API). OpenAPI configuration files also describe how web services work. So, the idea arose to that it should be possible to convert the first to the second. Simple. I started a GitHub repository, but, of course, did not really have time to implement it.

Then, half a year ago, at the ELIXIR track meeting at the ECCB in The Hague (where I presented this BridgeDb poster), I spoke with people from ELIXIR-DK who were just starting a studentship scheme. This led to a project idea, then a proposal, and then an small, approved project, allowing me to fund Jonathan Mélius to work on this part-time, for about a man month of work, spread over several months.

Jonathan has been doing great work, and because we liked to demo the OpenAPI 2 bridge with a major European resource, Ensembl was suggested (which just published a paper on their core software). An OpenAPI for Ensembl was set up, which is going to be the primary input for the new tool:

The next step was to take the JSON defining the content of this page (you can find the URL to the JSON file at the top of that page, hosted on GitHub too), and convert that to fragments. That the approach works, shows this test entry in

The observant eye will see that various bits of details of the descriptions of the API calls are annotated with EDAM ontology (doi:10.1093/bioinformatics/btt113) terms, a key feature of This information is currently not available in the OpenAPI JSON (we will be exploring how that specification could/should be extended to do this). Moreover, the webservice API methods need ontological annotation in the first place, and we will not be able to totally remove human involvement there.

The EDAM IRIs are still hard-coded in the conversion tool at this moment, but are being factored out into a secondary JSON file for now. So, the conversion tool will take two input JSON files, OpenAPI + EDAM annotation, and create JSON output. The latter can then be inserted into the JSON. We will work on something based on the API to automate that step too.

So, we still have some work to do, but I'm happy with the current progress. We're well on track to complete this project before summer and actually get a long way with the ontology annotation, which was an secondary in the original plan.

Feedback welcome!

Saturday, March 11, 2017

What an Open Science project does: eNanoMapper deliverables archived on ZENODO

eNanoMapper has ended. It was my first EC-funded project as PI. It was great to run a three year Open Science project at this scale. I loved the collaboration with the other partners, and like to thank Lucian and Markus for their weekly coordination of the project! Lucian also reflected on the project in this blog post. He describes the successful completion of the project, and we partly thank that to the uptake of ideas, solutions, and approaches by the NanoSafety Cluster (NSC) community. Many thanks to all NSC projects, including for example NANoREG who were very early adopters!

Our legacy is substantial, I think. I have blogged about some aspects in the past. The projects output includes RRegrs for scanning the regression model space, extensions of AMBIT for substances, tools on top of the APIs, visualizations with JavaScript, etc. Things have been done Open Source and you can find many repositories on GitHub, and we used Jenkins to autobuild various components, and not just source code, but also the eNanoMapper ontology. Several software releases are archived on ZENODO, the ontology is available from BioPortal, the Ontology Lookup Service, and AberOWL (and thanks to the operators for their support to get it properly online!).

Several publications have been published, along with many tutorials. On the website you could already access many of the deliverables of the project. And last week all public deliverables are now archived on ZENODO (HT to Lucian):

Next time, I want to see if we can get the deliverables published in, for example, Research Intentions and Outcomes journal.

Finally, I like to thanks everyone else if the Maastricht University team that worked on eNanoMapper: Cristian Munteanu, who was my first post-doc, Bart Smeets, Linda Rieswijk, Freddie Ehrhart, and part-time Nuno Nunes and Lars Eijssen. Without them I could not have completed our deliverables.

Sunday, March 05, 2017

Upcoming meeting: "Open science and the chemistry lab of the future"

Following the example by Henry Rzepa, here an announcement of a meeting with a great program organized by the Beilstein Institut in Germany. The meeting does also mean I cannot attend another really important meeting, WikiCite, which has a partial overlap :(

At the Open science and the chemistry lab of the future meeting meeting I will represent ELIXIR, which is quite a challenge as they are doing so much, and I only have so much time to cover that. Worse, I am only part-time working on specific ELIXIR tasks, but fortunately getting great help from Rob Hooft of the Dutch Techcenter for Life Sciences (DTL, practically the Dutch ELIXIR node).

I am very much looking forward to meeting friends and seeing people I have only yet met online, like Stuart Chalk (who recently published the CCZero Open Spectral Database) and Open Source Malaria Matthew Todd. Oh, and if you cannot attend the meeting in person, the hashtag to follow is #BeilsteinOS. If you can join, you can register to the meeting here.

Sunday, February 19, 2017

Talk: "Making open science a reality, from a researcher perspective"

Slide from the presentations with
a screenshot of the
Woordenboek Organische Chemie.
Last week I was in Paris (wonderful, but like London, a city that makes you understand Ankh-morpork) for the AgreenSkills+ annual meeting. AgreenSkills+ is a program for postdoc funding in France and the postdocs presented their works. Wednesday (#agreenskills) was a day to learn about Open Science, with other talks from Nancy Potinka and Ivo Grigorov from Foster Open Science, Martin Donnelly from the Edinburgh Digital Curation Centre about data management and the DMPonline tool, and Michael Witt of Purdue University about digital repositories and DataCite (which I should really make time to blog aobut too).

I was asked to talk about my experiences from a researcher perspective (which started with the Woordenboek Organische Chemie). Here are my slides:

Saturday, February 18, 2017

Open Science is already a thing in The Netherlands

It has been hard to miss it: the Dutch National Plan Open Science (doi:10.4233/uuid:9e9fa82e-06c1-4d0d-9e20-5620259a6c65). It sets out an important step forward: it goes beyond Open Access publishing, which has become a tainted topic. After all, green Open Access does not provide enough rights. For example, teachers can still not share green Open Access publications with their students easily.

I am happy I have been able to give feedback on a draft version, and hope it helped. During the weeks before the release I also looked how the Open Science working group of the Open Knowledge International foundation(?) is doing, and happy that at least the Dutch mailing list is still in action. Things are a bit in a flux, as the OKI is undergoing a migration to a new platform. Maybe more about that later.

But one of my main comments was that there already is a lot of Open Science ongoing in The Netherlands. And then I am not talking about all those scientists that already publish part of their work as (gold) Open Access, but the many researchers that already share Open Data, Open Source, or other Open research outputs. In fact, I started a public (CCZero) spreadsheet with GitHub repositories of Dutch research groups, which now also covers many educational groups, at our universities and "hogescholen". This now includes some fourty(!) git repositories, mostly on GitHub but also on GitLab. Wageningen even have their own public git website!

Mind you, I had to educate myself a bit in the exact history of the term Open Science. It actually seems to go back to the USA Open Source community (see these references and particularly this article). And that's actually where I also knew it from, in particular from Dan Gezelter, founding author of the well-known Jmol viewer for small molecules and protein structures, and host of the domain.

Tuesday, January 03, 2017

Wikidata-powered citation lists with citation.js

I don't get enough time with the kids as I would like, but if your son is doing interesting coding projects it makes that a lot easier. One project he is working on is citation.js, a JavaScript library to edit bibliographies. It has become really powerful and totally awesome! We all hate formatting bibliographies and that every journal has its own format. LaTeX and Citation Style Language have done wonders here, but all should even be simpler. As an author I want to be able to just give a DOI and that should be enough.

Or a Wikidata entity identifier.

And citation.js makes that last thing possible, and I spent some time with Lars to implement this for my homepage:

This is more or less what I had before too, but then everything hard coded. The citation.js way allow me to give just a list of two entity IDs (Q27062312 and Q27062639) and citation.js outputs the above. I just have this snippet in the HTML:

      <ul class="cite" id="cite1" />
      <script class="code" type="text/javascript">
        var wikidata = new Cite()
        wikidata.set( [ "Q27062312", "Q27062639" ] )
        htmlOutput = wikidata.get( opt )
          htmlOutput.replace( /&(lt|#60);/g, '<' )
                    .replace( /&(gt|#62);/g, '>' )

The formatting is actually mostly done with a CSL template (though it needs a hack to get it to output HTML), though adapted to also output the DOI hyperlink and Altmetric icon (you can find the customized CSL in the HTML source code as CC-BY-SA 3.0). The citation.js library fetches the data from Wikidata and actually has to deal with the structure there, which includes a mixture of 'author' and 'author name string' fields for author information. Well done!

If you like this, make sure to check out Wikicite, OpenCitations, and Scholia, projects that enabled and triggered some of the ideas behind the above citation.js use!

"10 everyday things on the web the EU Commission wants to make illegal" #04

Fourth example is harder then the third and I hope I got the translation of Julia Reda's example in good way. The starting point is simple enough, bookmarking things where an image is used. However, I am less sure to what extend we use this in online science.

04. Pinning a photo to an online shopping list

Well, you can see how much trouble I had with finding a good equivalent here. So, what is a science shopping list? The above example shows a Google+ post by Björn Brembs. Now, G+ is not really a shopping list, but then again, literature is what researchers buy. Literally. We pay millions and millions for it. Second, we do have dedicated shopping lists for these products, but they not always support images. Of course, these shopping lists are our CiteULike, Mendeley, ResearchGate, etc accounts.

Second limitation of this example is that we would not consider most of our literature of journalistic nature. Therefore the above example. Blogs are typically a mixture of science writing and a kind of journalism. It's a grey area. Now, under the new laws, Björn would have to ask my permission, and worse, G+ needs to install a monitoring system to see if Björn got a proper license as to not break my copyright.

So, back to the likes of ResearchGate and ScienceOpen. With the current proposal, any system of this kind with some commercial model in mind (both are set up by SMEs), they will have to install this monitoring system (after all, we also happily bookmark Nature News articles). The cost of that investment will have to come from somewhere, so this has an enormous impact on their sustainability.

Even worse, the wordings in the proposal I have seen so far, and to the extend I understand Julia's worries, there are no limitations set on this; few or no words on allowed behavior. So, what about dissemination systems in general? I think later examples (we still have six to go!), will shed more light on that.

(And make sure to read the original article by Julia Read!)

Monday, January 02, 2017

EPA CompTox Dashboard IDs in Wikidata

After Antony Williams left the ChemSpider team, he moved on to the EPA. Since then, he has set up the EPA CompTox Dashboard (see also doi:10.1007/s00216-016-0139-z [€]). And in August he was kind enough to upload mappings between InChIKeys (doi:10.1186/s13321-015-0068-4) and their identifiers on Figshare (doi:10.6084/m9.figshare.3578313.v1) as a tab-separated values (TSV) file. Because this database is of interest to our pathway and systems biology work, I realized I wanted ID-ID mappings in our BridgeDb identifier mappings files (doi:10.1186/1471-2105-11-5). As I wrote earlier, I have adopted Wikidata (doi:10.3897/rio.1.e7573) as data source. So, entering these new identifiers in Wikidata is helpful.

Somewhere in the past few months I proposed the needed Wikidata property, P3117 ("DSSTOX substance identifier"), which was approved some time later. For entering the mappings, I have opted to write a Bioclipse script (doi:10.1186/1471-2105-10-397) that uses the Wikidata SPARQL endpoint to get about 150 thousand Wikidata item identifiers (Q-codes) and their InChIKeys. I then parses over the lines in the TSV file from Figshare and creates input for Wikidata for each match, based on exact InChIKey string equivalence.

This output is formatted QuickStatements instructions, a great tool set up by Magnus Manske. Each line looks like (here for N6-methyl-deoxy-adenosine-5'-monophosphate, aka Q27456455):

Q27456455 P3117 "DTXSID30678817" S248 Q28061352

The P248 ("stated in") property is used to link the source (hence: S248) information as reference, with points to the Q28061352 item which is for the Figshare entry for Tony's mapping data. The result in this Wikidata item looks like:

I entered about 36 thousand of such statements to Wikidata. Thus, the yield is about 5%, calculating from the CompTox Dashboard as starting point with about 720 thousand identifiers. From a Wikidata perspective, the yield is higher. There are about 150 thousand items with an InChIKey, so that 24% could be mapped.

Based on properties of the property, it does some automatic validation. For example, it is specified that any Wikidata item can only have one DSSTOX substance identifier, because it can only have one InChIKey too. Similarly, there can not be two Wikidata items with the same DSSTOX identifier. Normally, because because of how Wikidata works, there can be isolated examples. With less then 25 constraint violations, the quality of the process turned out pretty high (>99.9%).

Some of the issues have been manually inspected. Causes vary. One issue was that the Wikidata item in fact had more than one InChIKey. A possible reason for that is that it does not distinguish between various forms of a compound. Two Wikidata items have been split up accordingly. Other problems are due to features of the CompTox Dashboard, and some issues have been tweeted to the Dashboard team.

This mashup of these two resources, as anticipated in our H2020 proposal (doi:10.3897/rio.1.e7573), makes it possible to easily make slices of data. For example, we can query for experimental data for compounds in the EPA CompTox Dashboard with a SPARQL query like for the dipole moment:

Importantly, this query shows the source where this data comes from, one of the advantages of Wikidata.