Lernapparat

A short guide to using other people's code

July 26, 2020

The other day, I noticed that someone had copied code from me and not cared much about licensing. I found that quite outrageous at first, but it seems that while I have some kind of history with licenses and people caring about itI still remember going through the OpenJDK files when I was Debian Developer and FTP (archive) assistant, looking if all license information indicated it was suitable for inclusion to Debian and its copyright summary was reasonably complete., it seems that not many people are aware of licensing. So here are a few thoughts.

Disclaimer: Sadly, we live in a lawyer's world and I'm not a lawyer and this isn't legal advice. Nonetheless I think there are a few things one can do better than what currently seems to be the norm.

Are you allowed to use the code?

People put out code on the internet with various amount of strings attached. Arguably, the most common way to specify what you are or are not allowed to do with it is to put in a LICENSE file.

Hint: Never say "the code has a license", it is you, who needs to have a license. Code is distributed under some license by the copyright holder.

Sometimes we see one of the major open source licenses like the Apache license, a BSD-type license, the MIT license (for better or worse, the copyleft licenses like the GNU General Public License (GPL) don't play as much of a role in the deep learning ecosystem). The former licenses (Apache, BSD, MIT) are permissive in the sense that they allow you to do pretty much what you want with very little strings attached. The copyleft licenses come with a "share alike"-type requirement.

But there are notable exceptions, for examples licenses that disallow commercial use, e.g. NVIDIA research seems to like creative commons NC licenses and then offer commercial licenses. Sometimes, apparently more often for datasets than research software, universities do this, too.

All this assumes that the person distributing the code has done their homework, which according my observation is not always the case (which is why I'm writing this in the first place).

Comply with the license

Now, if you found you have a license that allows your use case, you will have to make sure you actually comply with the terms. For the permissive licenses, this is typically very straightforward.

Do read the license to see what it requires. Most of them are written in a way that is easy to understand to make compliance easy. Obviously, if you don't comply with the license, you will have trouble justifying your use or redistribution of the code - in fact, some licenses, like the GPL, include termination clauses where you lose the right to distribute the code in the future if you don't comply.

One obligation you nearly universally have is to reproduce copyright notices, and often also the licensing conditions. So if the code has copyright notices, you usually just copy them. Some organizations - like the Apache foundation - put copyright notices on every header, but even if the code you are copying does not do it, it is a good idea to put the required information next to the code to avoid any ambiguity (but also be sure to mention the fact that parts of the code have their own copyright statement in your own LICENSE file).

If the original code does not come with copyright statements but require you to reproduce one, things are again more messy. The best option might be to check with the author of the code. If you see authorship information, another one could be to reproduce that instead. It is not entirely clear if that is sufficient, but you might be able to argue you made as good an effort as you can.

Sometimes, complying with a license can be tricky, in particular if multiple licenses are involved. Occasionally but rarely it even happens that licenses are widely deemed incompatible, i.e. it is not possible to combine code under them to create a new work while complying with both licenses.I am ignoring patents here. That is another huge mess.

So compliance with the license is one of the legal obligations you have, in particular when you redistribute the code (or compiled software containing the code). Not complying with the license may constitute copyright infringement, sometimes given the sensationalist label of software piracy.

As a contrasting example (and something that I saw often enough to prompt this blog post), people appear to often just put a link to the source of the code. This would seem to fall short of almost all license requirements for not reproducing copyright notices and license information. (But let's not put links here.)

Attribution in the code

The legal aspects from license compliance aside, one might wonder what to do with attribution.Ross Whightman points out that the Apache License actually requires to reproduce attribution notices. Thank you, Ross! Just like in the scientific world you typically cite people with the (first) author's name, the title, and a journal or so, I would typically try to attribute the code with the author's name, a name of the software or title of the code if applicable, and the source where I got it from and others could locate it (e.g. an URL). Sometimes, individual authorship is less clear and even the source collectively says it was produced by the foo contributors or so, and then using that seems OK, too.

By the time we have included copyright notices in order to comply with the license, we often - in particular when copyright holder and author coincide - have already most of the attribution and can just throw in a link and be done.

While the attribution requirements in scientific journals (aka plagiarism definition, but I the entire goal of this little text is to make it a positive thing) usually make no mention of code, it would seem that using code, in particular from a third party, in your project is very similar to a citation, so in addition to courtesy it would seem that, handling attribution of code in code in a similar way as attribution of scientific papers in scientific papers is a good practice in terms of cleanliness.

People have sometimes expressed confusion (or, less charitably, tried to hide behind confusion) about the fact that plagiarism might have not been OK even before text to explicitly forbid them was standard in certifications about submission of scientific works, so we might as well aim for best practices and cleanliness here before someone else finds that we should have been doing this all along.

Attribution in scientific papers

The perhaps most tricky part is if and how to cite code when you write a scientific paper.

It's easy when you use the official author's implementation of a paper. Then it would seem that you're done when you cite their paper.

Sometimes, authors explicitly ask or even state they require citations when code is used in scientific papers. I cannot say whether they have a right to decree that, but it is their express wish and the obvious alternative you have is to not use their code.

At other times it may be more a question of the importance the third party code has for your experimentation and the importance of your experiments for your overall paper. For many papers there is a strong emphasis on experimentation (which may reflect the state of theory in deep learning and related fields) because the theory is virtually non-existent or only provides a vague intuition.As a mathematician, I might say that formulas and saying you use minimize the $y$ distance between points in $z$ space isn't theory to me but just a concise description of what is going on. I have seen papers where more than a third of the experiment's code is from a single third-party author and it's not even the only third-party code. This might then have reached the point where your experiment crucially relies on third-party code. Again, the - at least hypothetical - alternative is not to use the code.

So here it is, when using third-party code, do take a moment to see if it is crucial to your experiments. You will typically cite lots vaguely "related" papers, which may or may not set the bar. I don't think there is much obligation here, but it may be an opportunity to apply judgement and courtesy. If you want a commons to draw on, likely doesn't hurt if, within reason, you credit contributions that are significant to your work.

Now, there is a limit to things. If you use a very particular PyTorch implemented model, from the model implementation to PyTorch itself, to PyTorch's dependencies to the operating system to the hardware, there will be a level below which is the "general infrastructure" even for indispensable things.

For example, R. Peharz et al. Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits sports an acknowledgement for Tim Rocktäschel's classic Einsum tutorial and my implementation of PyTorch's einsum. You would never cite that as a code bit and even the acknowledgement is unusually kind as it a case where it would be very reasonable to consider PyTorch and its einsum function as ubiquitous infrastructure.I do appreciate it! Most of the time, the feedback I get is only bug reports where people - rightfully - point out when things don't live up to their expectation. But it definitely goes beyond what can be expected. At the same time, this might be a good option when you just want to give a not to something you found useful but not critical.

Closing thoughts

So here we went from things that are arguably legal obligations to things that would appear to be best practice to things that are strictly complimentary. I am actually surprised that universities apparently don't teach their aspiring researchers how to conform to a license when using software, or at least with not that much success.

In the end, I believe that this is something where everyone can only gain if we improve.

Again, keep in mind I'm not a lawyer and this isn't legal advice.

If you have comments or additions, do send me a mail.