Poisoning GitHub Copilot and Machine Learning

Update, 08 July 2021: GitHub has confirmed that all public code that was on their servers was used to train Copilot. That means some of my code was used, though it was older versions.

I suggest deleting all code off of GitHub if you can.

GitHub has been on my radar before, and it wasn’t for a good reason.

Well, it’s on my radar again, and it’s once again not for a good reason: Copilot.

Apparently, GitHub seems to think that hosting my code on their servers gives them the right to take my code and make it part of a machine learning model. I don’t like that, especially where Copilot can change licenses on code, something that someone on Hacker News called “code laundering.”

I want no part of that, so I made sure to delete all of my stuff off of GitHub.

There is one exception: my bc. I have enough users, and enough important users, that I can consider my bc critical infrastructure. Most of those users use my GitHub link, so I will keep it there. It’s under a BSD license anyway, and I have enough contributors with enough contributions that I don’t think I can change it.

I also keep forks there of projects that I contribute to that use GitHub as their development platform. Not much I can do about that.

Now, before this happened, I was already working on some new FOSS licenses:

The Yzena Open License, an Apache 2.0-like permissive license.
The Yzena Copyleft License, a non-viral copyleft license.
The Yzena Network License, a non-viral network copyleft license with a better definition for network use (in my opinion) than the AGPL.

I am not a lawyer!

Also, I have not run these licenses past a lawyer yet, so don’t use them. If you still decide to use them, you do so at your own risk.

I will, however, run these by a lawyer sometime soonish.

After thinking about the Copilot problem hard for several hours, I came up with what I think is a solution, and it consists of doing these three things:

I define “source code” in each of those licenses.
I define “this software” (the work under the license) recursively; the base case is the source code, and from there, it is any software produced by an algorithm using “this software” as input.
Then for the two copyleft licenses, I ensure that only the source code needs to be shared.

This conveniently coincides with several things:

GitHub claims that training a machine learning model is fair use, and it may be. However, this license does not claim otherwise; it just claims that the output of the model can be copyrighted.
This definition does not add a new restriction because we already assume that the output of certain algorithms that we call compilers is still software and still copyrightable.

There are bound to be more questions, all of which are answered in the FAQ’s for the licenses (Yzena Open License, Yzena Copyleft License, and Yzena Network License), but for the reader’s ease, I will reproduce some of the Copilot-related ones here:

What’s with the weird definition for “this software”?
GitHub Copilot. I want the licenses to poison the well for machine learning like that.
The reason is that GitHub is arguing that using FOSS code in Copilot is fair use because using data for training a machine learning algorithm has been labelled as fair use.
However, even though the training is supposedly fair use, that doesn’t mean that the distribution of the output of such algorithms is fair use.
The definition of “this software” is crafted to exploit this discrepancy.
But maybe the output of the algorithm is under fair use as well.
If it is, then copyright disappears entirely from software. The reason for this is that we already use algorithms to transform our software. We call those algorithms “compilers,” and their output “executables” or “libraries.” No one claims that a source code’s copyright does not apply to the binary forms output by a compiler.
What if the output of a machine learning algorithm is transformative? Would that not be enough to defeat copyright?
A compiler’s output is also transformative, especially if it does optimizations. This especially applies if the compiler is doing link-time optimization using inlining with code from different sources. In that situation, a compiler is combining multiple sources in non-obvious ways, just as machine learning models do.
A compiler can even transform an O(n) algorithm into an O(1) algorithm!
In other words, unless GitHub Copilot wants to throw out copyright on software completely, this license will apply to the output of its model.
Why didn’t you just add a clause protecting the output of machine learning?
Compatibility with the GPL. The GPL requires no extra restrictions, but it technically already has the same restrictions as the YOL.
The reason why it does is because the GPL allows distribution of binary code, and what is binary code but the output of an algorithm (the compiler) whose input is the source code of the software covered by the GPL?
If I decide to abandon GPL compatibility, however, I still think I will keep the definition because it is wonderfully broad in a way that is best for end users.
Your extra restrictions make your license incompatible with the GPL.
I don’t think so.
First, they are not extra restrictions; they codify something that I believe already exists in the GPL and friends.
Second, even if they are extra restrictions (which would make this license incompatible with the GPL), I think I am okay with that.
In fact, if my licenses are not compatible with the GPL and friends, then I will keep the terms I have and accept the incompatibility. I can do this because these licenses will mostly be used for code in a new language, which means that I wouldn’t be able to use existing code easily anyway.
Your extra restrictions make your license non-Open Source.
Once again, I don’t think so, and for why, see above. The new parts are not new restrictions; they are clarified.
But even if the license is non-Open Source, I’m not sure I care.
The reason is this: I believe FOSS licenses have failed. We have so many companies that have used the freedoms we have tried to give users in order to extract value (data or something else) unethically from those very users. They do this by claiming the rights of distribution that the FOSS licenses give them and then using those rights to distribute Open Source software to users in such a way that they don’t realize that they are being taken advantage of.
The more I’ve understood that, the more I have come to realize that the current iteration of FOSS licenses do not work.
But what will work?
Remember how I said that companies claim the rights of distribution we give them? As it turns out, end users, so-called because they are at the end of a chain of distribution, don’t usually use distribution rights.
That means that the next generation of FOSS licenses can probably more heavily restrict how licensed software is distributed while allowing for no restrictions on the other two freedoms of the four freedoms.
I’m not ready to go that far yet. If I was, I’d add a clause forbidding ads in the software. But that is probably the sort of direction we need to go.
Your license is viral, like the GPL, since it makes the license apply to the full output of algorithms.
This is a good question, because I believe the virality of the GPL is parasitic and has caused people to use closed-source software instead of FOSS alternatives when those alternatives were licensed under the GPL.
The answer is that I carefully defined “source code” to ensure that was not the case.
You see, while the license does apply to the entire output, the requirement is still to provide only the source code of the original. There is no requirement to provide the source code of the entire software, as there is with the GPL.
Why did you say that “this software” only includes output of algorithms that is itself software?
Technically, you can run software through things like hash functions, which are algorithms, and the output is entirely unusable as software. Saying that the license would apply to output like that would never work.
The same goes for static analysis tools. They don’t usually output something that can be considered “software”; instead, they output a list of problems with the software. I wouldn’t want that to be affected by this license.
In other words, that little part of the definition is required from making the definition so broad that it becomes useless.
Your definition of “this software” has “this software” in the definition.
That’s because the definition is recursive. The base case is the source code of the software, and then whenever the source code, or the result of transforming “this software” is transformed by an algorithm, the definition recursively applies.
Why did you define “source code”?
Because of the recursive definition of “this software”. See the previous question.

Of course, those FAQ items don’t answer one crucial question: what if the output of algorithms cannot be copyrighted?

If that’s the case, well, no one knows yet. That fact means that at the very least, these licenses will poison the well for machine learning because they will sow doubt about the legality of using the source code as input to machine learning. That doubt may be all I need to prevent my code from being fed to Copilot.

This is especially important since GitHub seems to claim that it can take FOSS code from anywhere, even if not hosted on its servers. By using these licenses, even without a guarantee that they work, I am ensuring that companies have no guarantee that they won’t work, and they will be hesitant to use Copilot if it was fed my code.

Thus, I have relicensed all of the code under Yzena (bc is a personal project), which means that if GitHub uses any of my Yzena code after this point, I will come after them.

In conclusion, I want to answer one charge that may come up, the charge that I am attempting to make it so no companies can use my software for commercial purposes.

That is false. I am perfectly fine with companies using my code for good commercial purposes, and by good, I don’t mean “not for evil,” but rather, to do good for their customers, to provide actual value instead of extracting value.

Thus, all I ask is that they share changes back. That’s all Linus Torvalds wants for Linux, and that’s all I want for my code. I happen to disagree with him that GPLv2 does the job, but we both want the same thing.

But feeding licensed and copyrighted code through a black box and claiming that what comes out the other end is license-free? That is not what I want; it’s extracting value.

About

Contact

Archive

Categories

Tags

Subscribe

Poisoning GitHub Copilot and Machine Learning

Recent Posts

Subscribe