Sunday, November 26, 2017

Winter solstice challenge: what is your Open Knowledge score?

Source: Wikimedia, CC-BY 2.0.
Hi all, welcome to this winter solstice challenge! Umm, to not give our southern hemisphere colleagues not a disadvantage, as their winter solstice has already passes, you're up for a summer solstice challenge!

So, you know ImpactStory and (if not, browse my blog); these are wonderful tools to see what people are doing with your work. I hope you already know about OpenCitations, a collaboration of publishers, CrossRef, and many others, to make all citation data available. They just passed the 50% milestone, congratulations on that amazing achievement! For the younger scientists it may be worth realizing that for the past 20 years, at least, this data was copyrighted and not to be used unless you paid. Elsevier is, BTW, the major culprit still claiming IP on this, but RT this if you are surprised.

So, the reason I introduce both ImpactStory and OpenCitations is the following. Scientific articles are data and knowledge dense documents. If we did not redirect the reader to other literature. That may give a more complete sketch of the context, describe a measurement protocol, describe how certain knowledge was derived, etc. Therefore, just having your article Open Access is not enough: the articles you cite should be Open Access too. That's the next phase if really making an effort to have all of humanity benefit from the fruits of science.

I know it is hard already to calculate a "Open Access" score, though ImpactStory does a great job at that! So, calculating this for your paper and the papers those papers cite is even harder. You may need to brush up your algorithm and programming skills.

Anyone is allowed to participate. Submission of your entry is done online, e.g. in your blog, in a public write up, or even a open notebook! However, you need at least on citable research object. That is, it needs a DOI. Otherwise, I cannot give you the prize (see below). The score should be based on all your products. Bonus points for those who include software and data citations. Excluding citable object to boost your score (for example, I would have to exclude my book chapters), is seen as cheating the system.

Your article B may cite three articles (C, D, J) but
article D also cited articles (F, I). So, your
Open Knowledge score is recursive.
Source: Wikipedia, CC-BY-SA 4.0
Calculating your Open Knowledge score can be done at multiple levels. After all, your article depends (cites) articles, and your software depends on libraries, but those cited articles and software dependencies recursively also cite articles and/or software. The complexity is non-trivial, making it a perfect solstice challenge indeed!

The prize I have to offer is my continued commitment to Open Science, but that you already get for free and may not be enough boon. So, instead, soon after the winter/summer solstice at the end of this year, I will blog about your research boosting your #altmetrics scores. Yes, I will actually read and try to understand it!

And because there is the results and the method, neither of which exist yet, there are two categories! I just doubled your chance of winning! That's because humanity is worth it! One prize for the best tool to calculated your Open Knowledge score, and one prize for the researcher with the highest score.

Audience Prize
If someone feels a need to organize an audience prize, this is very much encouraged! (Assuming Open approaches, of course :)

Wednesday, November 22, 2017

Monitoring changes to Wikidata pages of your interest

Source: User:Cmglee, Wikipedia, CC-BY-SA 3.0
Wikidata is awesome! In just 5 years they have bootstrapped one of the most promising platforms for the future of science.Whether you like the tools more, or the CCZero, there is coolness for everyone. I'm proud to have been able to contribute my small 1x1 LEGO bricks to this masterpiece and hope to continue this for many years to come. There are many people doing awesome stuff, and many have way more time, have better skills, etc. Yes, I'm thinking here if Finn, Magnus, Andra, the whole Su team, and many, many more.

The point of this post, is to highlight something this matters and something that comes up over and over again and where there just are solutions, like implemented by Wikidata: provenance. We're talking a lot about FAIR data. Most of FAIR data is not technological, it's social. And most of the technical work going on now, is basically to overcome those social barriers.

We teach our students to cite primarily literature and only that. There is a clear reason for that: the primary literature has the arguments (experiments, reasoning, etc) that back a conclusion. Not any citing is good enough: it has to be the exact right shape (think about that Lego brick). This track record of our experiments is a wonderful and essential idea. It removes the need for faith and even trust. Faith is for the religious, trust is for the lazy. Now, without being lazy, it is hard to make progress. But as I have said before (Trust has no place in science #2), every scholar should realize that "trust" is just a social equivalent of saying you are lazy. There is nothing wrong with being lazy: a side effect of it is innovation.

Ideally, we do not have to trust any data source. If we must, we just check where that source got its data from. That works for scholarly literature, and works for other sources too. Sadly, scholarly literature has a horrible track record here: we only cite stuff we find more trustworthy. For example, we prefer to cite articles from journals with high impact factors. Second, we don't cite data. Nor software. As a scholarly community, we don't care much about that (this is where lazy is evil, btw!).

Wikidata made the effort to make a rich provenance model. It has a rich system of referring to information sources. It has version control. And it keeps track of who made the changes.

Of all the awesomeness of Wikidata, Magnus is one of the people that know how to use that awesomeness. He developed many tools that make doing to right thing a lot easier. I'm a big fan of his SourceMD, QuickStatement, and two newer tools, ORCIDator and SPARQL-RC. This latter tool leverages SPARQL (and thus Wikidata RDF) and the version control system. By passing a query, it will list all changes in a given time period. I am still looking for a tool that can show my all changes for items I originally created, but this already is a great tool to monitor the quality of crowdsourcing for data in Wikidata I care about. No trust, but the ability to verify.

Here's a screenshot for the changes of (some of my) output of scientific output I am author of:

Sunday, November 12, 2017

New paper: "WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research"

Focus on metabolic pathways increases the
number of annotated metabolites, further improving
the usability in metabolomics. Image: CC-BY.
TL;DR: the WikiPathways project (many developers in the USA and Europe, contributors from around the world, and many people curating content, etc) has published a new paper (doi:10.1093/nar/gkx1064/4612963), with a slight focus on metabolism. 

Full story
Almost six years ago my family and I moved back to The Netherlands for personal reasons. Workwise, I had a great time in Stockholm and Uppsala (two wonderful universities; thanks to Ola Spjuth, Bengt Fadeel, and Roland Grafström), but being immigrant in another country is not easy, not even for a western immigrant in a western country. ("There is evil among us.")

We had decided to return to our home country, The Netherlands. By sheer coincidence, I spoke with Chris Evelo in the week directly following that weekend. I had visited his group in March that year, while attending a COST-action about NanoQSAR in Maastricht. I had never been to Maastricht University yet, and this group, with their Open Source and Open Data projects, particularly WikiPathways, would give us enough to talk about. Chris had a position on the Open PHACTS project open. I was interested, applied, and ended up in the European WikiPathways group led by Martina Kutmon (the USA node is the group of Alex Pico).

Fast forward to now. It was clear to me that biological text book knowledge was unusable for any kind of computation or machine learning. It was hidden, wrongly represented, and horribly badly annotated. In fact, it still is a total mess. WikiPathways offered machine readable text book knowledge. Just what I needed to link the chemical and biological worlds. The more accurate biological annotation we put in these pathways, or semantically link to these pathways, the more precise our knowledge becomes and the better computational approaches can find and learn patterns not obvious to the human eye (it goes both ways, of course! Just read my PhD thesis.)

Over the past 5-6 years I got more and more involved in the project. Our Open PHACTS tasks did involve WikiPathways RDF (doi:10.1371/journal.pcbi.1004989), but Andra Waagmeester (now Micelio) was the lead on that. I focused on the Identifier Mapping Service, based on BridgeDb (together with great work from Carole Goble's lab, e.g. Alasdair and Christian). I focused on metabolomics.

Indeed, there was plenty to be done in terms of metabolic pathways in WikiPathways. The current database had a strong focus on the genetics and proteins aspects of the pathways. In fact, many metabolites were not datanodes and therefore did not have identifiers. And without identifiers, we cannot map metabolomics data to these pathways. I started working on improving these pathways, and we did some projects using it for metabolomics data (e.g. a DTL Hotel Call project led by Lars Eijssen).

The point of this long introductions is, I am standing on the shoulders of giants. The top right figure shows, besides WikiPathways itself, and the people I just mentioned, more giants. This includes Wikidata, which we previously envisioned as hub of metabolite information (see our Enabling Open Science: Wikidata for Research (Wiki4R) proposal). Wikidata allows me to solve the problem that CAS registry numbers are hard to link to chemical structures (SMILES): it has some 70 thousand CAS numbers.

SPARQL query that lists all CAS registry numbers in Wikidata, along with the matching
SMILES (canonical and isomeric), database entry, and name of the compound. Try it.
A lot more about CAS registry numbers is found in my blog.
Finally, but certainly not least, is Denise Slenter, who started this spring in our group. She picked up things I and others were doing very quickly (for example this great work from Maastricht Science Programme students), gave those her own twist, and is now leading the practical work in taking this to the next level. This new WikiPathways paper shows the fruits of her work.

Of course, there are plenty of other pathways database. KEGG is still the gold standard for many. And there is the great work of Reactome, RECON, and many others (see references in the NAR article). Not to mention the important resources that integrate pathways resources. To me, unique strengths of WikiPathways include the community approach, very liberal licence (CCZero), many collaborations (do we have a slide on that?), and, importantly, its expressiveness. The latter allows our group to do the systems biology work that we do, analyzing microRNA/RNASeq data, studying diseases at a molecular interaction level, see the effects of personal genetics (SNPs, GWAS), and visually integrate and summarize the combination of experimental data and text book knowledge.

OK, this post is now already long enough. And seeing from the length, you can see how much I am impressed with WikiPathways and where it goes. Clearly, there is still a lot left to do. And I am just another person contributing to the project and honored that we could give this WikiPathways paper a metabolomics spin. HT to Alex, Tina, and Chris for that!

Slenter, D. N., Kutmon, M., Hanspers, K., Riutta, A., Windsor, J., Nunes, N., Mélius, J., Cirillo, E., Coort, S. L., Digles, D., Ehrhart, F., Giesbertz, P., Kalafati, M., Martens, M., Miller, R., Nishida, K., Rieswijk, L., Waagmeester, A., Eijssen, L. M. T., Evelo, C. T., Pico, A. R., Willighagen, E. L., Nov. 2017. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Research.