GitHub sued over Copilot for alleged "software piracy"

GitHub Copilot “ignores, violates, and removes the Licenses offered by thousands -- possibly millions -- of software developers, thereby accomplishing software piracy on an unprecedented scale”.

That’s according to what will be a closely watched class action lawsuit filed on November 3, by programmer Matthew Butterick and the Joseph Saveri Law Firm in the San Francisco federal court.

Critically however, their case does not rest primarily on alleged copyright breaches.

The two are suing GitHub, its parent, Microsoft, and its AI-technology partner, OpenAI.

GitHub Copilot, is effectively an “auto-complete” for coders. It was trained on public GitHub repositories and spits out lines of code on simple prompts. Whilst many coders welcome its ability to avoid the need to write essentially boilerplate code ad nauseum, it has met with some disquiet since its June 2021 launch.

https://twitter.com/mitsuhiko/status/1410886329924194309

The two are suing GitHub, its parent, Microsoft, and its AI-technology partner, OpenAI.

GitHub sued: Copilot class action lawsuit is extensive

Open source advocates have been concerned since the product’s launch that it appears to be auto-completing using chunks of code, verbatim, that were written with copy-left licenses that may prohibit this kind of reuse.

Recent critics of Copilot include Tim Davis, professor of computer science at Texas A&M University.

https://twitter.com/DocSparse/status/1581461734665367554?

GitHub's then-CEO Nat Friedman noted in 2021: "In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler. We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!"

Other prominent concerns have been that the product (charged at $10 per month) is "laundering bias through opaque systems" and perpetuating the use of bloated/lazily written code as coders get accustomed to not having to meaningfully think about or critically review the code Copilot is producing for them.

As the complaint [pdf] alleges: “The Defendants stripped Plaintiffs’ and the Class’s attribution, copyright notice, and license terms from their code in violation of the Licenses and Plaintiffs’ and the Class’s rights. Defendants used Copilot to distribute the now-anonymized code to Copilot users as if it were created by Copilot.”

It adds that: “Copilot often simply reproduces code that can be traced back to open-source repositories or open-source licensees. Contrary to and in violation of the Licenses, code reproduced by Copilot never includes attributions to the underlying authors” – the plaintiffs seek a trial by jury and to “seek to recover injunctive relief and damages as a result and consequence of Defendants’ unlawful conduct” the complaint shows.

GitHub and Microsoft have been contacted for comment.

Copilot class action raises "numerous questions of law"

The complaint alleges that "numerous questions of law or fact common to the entire Class arise from
Defendants’ conduct—including, but not limited to those identified below:

DMCA Violations Whether Defendants’ conduct violated the Class’s rights under the DMCA when GitHub and OpenAI caused Codex and Copilot to ingest and distribute Licensed Materials without including any associated Attribution, Copyright Notice, or License Terms.
Contract-Related Conduct Whether Defendants violated the Licenses governing use of the Licensed Materials by using them to train Copilot and for republishing those materials without appending the required Attribution, Copyright Notice, or License Terms. Whether Defendants interfered in contractual relations between the Class and the public regarding the Licensed Materials by concealing the License Terms. Whether GitHub committed Fraud when it promised not to sell or distribute Licensed Materials outside GitHub in the GitHub Terms of Service and Privacy Statement.

See also: Heroku’s GitHub connection remains on ice after breach as customers fret, eye alternatives

Unlawful-Competition Conduct Whether Defendants passed-off the Licensed Materials as its own creation and/or Copilot’s creation. Whether Defendants were unjustly enriched by the unlawful conduct alleged herein. Whether Defendants Copilot-related conduct constitutes Unfair Competition under California law.
Privacy Violations Whether GitHub violated the Class’s rights under the California Consumer Privacy Act (“CCPA”), the GitHub Privacy Statement, and/or the California Constitution by, inter alia, sharing the Class’s sensitive personal information (or, in the alternative, by not addressing an ongoing data breach of which it is aware); creating a product that contains personal data GitHub cannot delete, alter, nor share with the applicable Class member; and selling the Class’s personal data. Whether GitHub committed Negligence when it failed to stop a still-ongoing data breach it was and continues to be aware of.

GitHub Copilot’s launch triggered an immediate firestorm around potential copyright and open source licensing breaches. Most commentators at the time suggested it would be tough to win a case on these grounds, whatever moral qualms many may have. The Class Action does not, interestingly, centre on this.

One commentator, former MEP Felix Reda noted at the time that certainly in the EU it was likely legal. He wrote: “Since the EU Copyright Directive of 2019, text & data mining is permitted. “Even where commercial uses are concerned, rights holders who do not want their copyright-protected works to be scraped for data mining must opt-out in machine-readable form such as robots.txt. Under European copyright law, scraping GPL-licensed code, or any other copyrighted work, is legal, regardless of the licence used. In the US, scraping falls under fair use, this has been clear at least since the Google Books case.”

GitHub Copilot open source complaint: It’s going to be a tricky case

Assessing the legal risks for Microsoft when GitHub Copilot launched and triggered immediate concerns, The Stack sought comment from a number of lawyers specialising in open source at the time.

One (whom we won’t name as we were unable to contact them today and their views may have evolved) wrote at the time, saying: “[With regard to inputs] copying and reading code in an AI engine is specifically permitted by every single open source license – it is freedom 0: the freedom to use code for any purpose. So, for the training phase, unless there are further details I am unaware of, I would definitely think that there is no infringement of publicly available open source code. I have also seen articles by colleagues (Neil Brown, Andres Guadamuz) who indicate that GH’s terms of business may also allow GH to scan any code, not just open source code on GH.”

They added: “[When it comes to outputs] the AI generated code (output) may not be covered by copyright at all. There have been many arguments about machine-generated works (or photographs taken by monkeys) and so far the general consensus if not court decisions are that for copyright protection, the work must be the intellectual creation of a human person. The AI engine itself is not a person, and while for example the UK regime provides that for machine-generated works, the copyright lies with the person that configured the machine, in the case the AI is self configuring, so much harder to argue that the engineer running the AI has any input into the resulting output/work. That’s regarding protection of the generated code.”