Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ok, my curiosity has been fired here...

I have conjured up two scenarios here:

Let's say I use copilot to generate a bunch of code for an app, something substantial, and it regurgitates a load of bits and pieces from many sources it got from GitHub, I'd assume there won't be any attribution in it... it will be as if Copilot made the code itself (I know it sort of does but lets not split hairs!). I'm guessing the prevailing theory (from GiitHub anyway) is that I'm legitimately allowed to do this.

Now, let's say I generated all that code by manually copying and pasting chunks of code from a whole bunch of repos, whether they are open source, unlicensed, whatever. Would I not be ripe for legal issues? I could potentially find all the code that copilot generated and just copy and paste it from each of the sources and not mention that in my license. What if I told everyone "yeah, I just copied and pasted this from loads of Github repos and didn't put any attribution in my code". I'd assume that (morality aside) I'd be asking for trouble!

Am I missing something? Am I misunderstanding the situation, or the capabilities of copilot?



There's a decent bit of caselaw indicating that computers reading and using a copyrighted work simply "don't count" in terms of copyright infringement -- only humans can infringe copyright. This article[0] does a pretty good job of summarizing the rationale that the courts have provided. My (non-lawyer) take is that GitHub is pushing this just half a step farther -- if computers can consume copyrighted material, and use it to answer questions like "was this essay plagiarized", then in GitHub's view they can also use it to train an AI model (even if it occasionally spits back out snippets of the copyrighted training data). Microsoft has enough lawyers on staff that I'm sure they have analyzed this in depth and believe they at least have a defensible position.

[0]: https://slate.com/technology/2016/08/in-copyright-law-comput...


Makes me wonder what would happen if a similar thing was done with books. If I train an AI on all the texts of Tom Clancy, or Stephen King, or every Star Wars novel, and the books it generates every so often produce paragraphs verbatim from one of those sources, would copyright owners be up in arms? What would the distinction be between the code case and the text case?


I am not a lawyer. I do photography and have a more than passing interest in copyright as it applies to the photographs I take and the material I photograph.

Copyright on art gets more interesting / fuzzier. The key part is substantial similarity - https://en.wikipedia.org/wiki/Substantial_similarity and https://www.photoattorney.com/copyright-infringement-for-sub...

Rather than text, my AI copyright hypothetical... consider a model created based on sunset photographs. You take a regular photograph, pass it through the model, and it transforms it into a sunset. The model was trained on copyrighted works but the model is considered fair use.

Now, I go and take a photograph from some location during the day and then pass it through the transformer and get a sunset. Yea me! Unbeknownst to me, that location is a favorite location for photographers and there were sunsets from that location used in the training data. My photograph, transformed to look like a sunset is now similar to one of them in the training data.

Is my transformed photograph a derivative work of the one in the training data to which it bears similarity to? How would a judge feel about it? How does the photographer who's photograph was used in the training data feel?


What would be interesting in that case would be how the transformed image would look if photos from that location were removed from the training set. That would help reveal whether it was just copying what it had seen or it actually remembered what sunsets looked like and transformed the image using its memory of sunsets in general.


This will surely happen within the next few years; but if the "new work" contains a full paragraph from an existing novel the copyright hammer would come down hard.

Maybe it needs to be paired with another network / hunk of code that checks for verbatim copying?


> There's a decent bit of caselaw indicating that computers reading and using a copyrighted work simply "don't count" in terms of copyright infringement -- only humans can infringe copyright.

I have read variations of "computers don't commit copyright" more times than I can count in the past few days.

How is Copilot different from a compiler? (Please give me the legal answer, not the technical answer. I now the difference between Copilot and a compiler, technically.)

Isn't a compiler a computer program? How is its output covered by copyright?

Am I fundamentally misunderstanding something here?


What if I made a few tweaks to Copilot so that it is very likely to reproduce large chunks of verbatim code that I would like to use without attribution, such as the Linux kernel. Do you really think you can write a computer program that magically "launders" IP?

A compiler is run on original sources. I don't see any analogy here at all.


* They both process source code as input.

* They both produce software as output.

* They both transform their input.

* They both can combine different works to create a derivative work of each work. (Compilers do this with optimizations, especially inlining with link-time optimization.)

They really do the same things, and yet, we say that the output of compilers is still under the license that the source code had. Why not Copilot?


> Why not Copilot?

Because the sources used for input do not belong to the person operating the tool.

If you say that doesn't matter, then you are saying open source licenses don't matter because the same thing applies - I could just run a tool (compiler) on someone else's code, and ignore the terms of their license when I redistribute the binary.


No, I think that’s the point.

If I take some code I don’t have a license for, feed it to a compiler (perhaps with some -O4 option that uses deep learning because buzzwords), then is the resulting binary covered under fair use, and therefore free of all license restrictions?

If not, then how is what Copilot is doing any different?


> If I take some code I don’t have a license for, feed it to a compiler (perhaps with some -O4 option that uses deep learning because buzzwords), then is the resulting binary covered under fair use

No, the binary is not free of license restrictions. Read any open source license - there are terms under which you can redistribute a binary made from the code. For GPL you have to make all your sources available under the same terms for example. For MIT you have to include attribution. For Apache you have to attribute and agree not to file any patents on the work in Apache licensed project you use. This has been upheld in many court cases - though it is not always easy to find litigants who can fund the cases the licenses are sound.


I think you have what I am saying backwards. I am saying that the licenses should apply to the output of Copilot, like they apply to the output of compilers.


Oh sorry, my mistake! Thank you.


That only makes it worse.


You just blew my mind with that analogy. I can only imagine some hair-splitting logic to rationalize a distinction.


The analogy goes even further if you consider compiler optimizations: https://gavinhoward.com/2021/07/poisoning-github-copilot-and... .


"Computers don't commit copyright" is a complete misreading or misunderstanding of another proposition, that "computers cannot author a work".

Authoring is the act that causes a work to be copyrightable. In most jurisdictions, authoring a work automatically causes copyright to subsist in the work to some degree. The purpose of the copyright system is to encourage people to author new, original works, by rewarding those who do with exclusive rights. It is well-known that only humans can author a work. Computers simply cannot do it. If your computer (by some kind of integer overflow UB miracle) accidentally prints out a beautiful artwork, NOBODY has exclusive copyright over it, and anyone may reproduce it without limitation. Same goes for that monkey who took a selfie.

What a compiler does, on the other hand, is adapt a work. Adapting a work is not authoring it. Sometimes when you adapt a work, you also author some original work yourself, like when you translate a book into another language. When a compiler (not a linker) transforms source code, it absolutely, 100% definitely does NOT add any original work; the executable or .so/.a/.dylib/.dll file is simply an adaptation of the original work. The copyright-holder of the source code is the copyright-holder of the machine code. An adaptation is also known as a "derivative work".

(Side note; copyleft licenses boil down to some variation of "if you adapt this, you have to share everything in the derivative work, not just the bits you copied.")

Adaptation is a form of reproduction. It's copying. "Distribution" also often involves copying, at least on the internet. (Selling or giving away a book you have purchased does not constitute copying.) Copying is one of the exclusive rights you have when you own the copyright in a work, that you may then license out.

It gets more complicated when the computer uses fancy ML methods to produce images/text out of things it has seen/read. You can't simplify the law around that to a simple adage digestible enough to share memetically on HN and Twitter. One thing is certain: if the computer did it, by itself, then no original work was authored in the process. That poses a problem for people who write the name of a function and get CoPilot to write the rest; if you do that, you are not the author of that part of the program. If you use it more interactively that's a different story.

There is, however, always a question of whether the copyright in the original works the computer used still subsists in the output.

My rough framing of the licensing issues around CoPilot is therefore as follows:

1. The source code to CoPilot is an original work, and the copyright is owned by GitHub.

2. When GH trained CoPilot's models on other people's works, was that copying? (This one is partially answered. It can spit out verbatim fragments, so it must be copying to some extent, rather than e.g. actually learning how to code from first principles by reading.) If it was not all copying, how much of it was copying and how much of it was something else? What else was it?

3. If GH adapted the originals, what is the derivative work? (I.E. where does the copyright subsist now? Is is a blob of random fragments of code with some weights to a neural network?)

4. Which works is it an adaptation of? You might think "all of them, and for each one, all of the code" but I'm not so sure. For example, imagine the ML blob contains many fragments, but some are shorter than others. If your program has "int x;" in it, and CoPilot can name a variable "x", you can hardly claim that as your own. I'm most interested in whether the mere fact of CoPilot having digested ALL of it, having fed this into the mix and producing a ML blob based on all that information, means that the ML blob is a derivative work of all of them. Or whether there is some question of degree.

5. Fair use. Was it fair use to train the model? Is it, separately or not, fair use to create a commercial product from the model and sell it? Fair use cares about commercial use, nature of the copied work, amount of copying in relation to the whole, and the effect on the market for / value of the copied work. Massive question.

6. If not fair use, then GH is subject to the licenses and how they regulate use of the works. What license conditions must GH comply with when they deal with the derivative work, and how? Many will be tempted to jump straight to this question and say GH must release the source code to CoPilot. I'm not yet convinced that e.g. GPL would require this. I can't believe I'm writing this, but is the ML blob statically or dynamically linked? Lol.

7. Final question, is there some way to separate out works which were copied with no fair use (or not copied at all), from works which were copied with no fair use? People are worried about code laundering, e.g. typing the preamble to a kernel function and reproducing it in full. In that situation, it is fairly obvious that the end user has ultimately copied code from the kernel and needs to abide by GPL 2.0; moreover if they're using CoPilot to write out large swathes of text they will naturally be alert to this possibility and wary of using its output. But think of the converse: if there is no way to get CoPilot to reproduce something you wrote, what's the substance of your complaint? Is CoPilot's model really a derivative of your work, any more than me, having read your code, being better at coding now? Strategically, if you wanted to get GH to distribute the model in full, you might only need one copyleft-licensed, verbatim-reproducible work's owner to complain. But then they would just remove the complainant's code. You might be looking at forcing them to have a "do not use in CoPilot" button or something.


I think this is more cogent analysis than anything else I've seen yet on this topic. You should consider submitting a blog post so this can become a top-level topic.

Also, I loved this quote:

> Copying is one of the exclusive rights you have when you own the copyright in a work, that you may then license out.

I've been paying attention to software copyright topics for more than twenty years and never thought of it in exactly these terms. Its right there in the name - the right to copy it - and determine the terms under which others can copy it is exactly what a copyright is!


I don't doubt that an army of lawyers has poured over this but they have size on their side: the cost of litigation vs potential revenue will be a massive factor.

Edit: > There's a decent bit of caselaw indicating that computers reading and using a copyrighted work simply "don't count" in terms of copyright infringement.

That means their computer can read any code it wants, do whatever it wants with the code, then they can monetise that by giving YOU the code. Would they then be indemnified by saying "no Microsoft human read or used this code"?

However, if you then use the code and look at it, does that make you liable?


Again, not a lawyer, just a guy who likes reading this stuff. The devil is usually in the details of copyright cases. The Turnitin case hinged substantially on whether Turnitin's use of copyrighted essays was "fair use". There are four factors[0] which determine fair use; the two more relevant factors here are "the purpose and character of your use" and "the effect of the use upon the potential market". The court found that Turnitin's use was highly "transformative" (meaning they didn't just e.g. republish essays; they transformed the copyrighted material into a black-box plagiarism detection service) and also found that Turnitin's use had minimal effect on the market (this is where "computers don't count" comes in -- computers reading copyrighted material don't affect the market much because a computer wasn't ever going to buy an essay).

I would be shocked if GitHub's lawyers didn't argue that using copyrighted material as training data for an AI model is highly transformative. There may be snippets available from the original but they are completely divorced from their original context and virtually unrecognizable unless they happen to be famous like the Quake inverse square root algorithm. And I think GitHub's lawyers would also argue that Copilot's use does not affect the _original_ market -- e.g. it does not hurt Quake's sales if their algorithm is anonymously used in a probably totally unrelated codebase.

Your counterexample would probably fail both tests -- it's not transformative use if your software hands out complete pieces of copyrighted software, and it would definitely affect the market if Copilot gave me the entire source code of Quake for my own game.

[0]: https://fairuse.stanford.edu/overview/fair-use/four-factors


I thought I understood fair use but turns out I was wrong...

That being said, creating a transformative work from something else is considered fair use. So, for example, if I read a whole bunch of books and then, heavily influenced by them, create my own, similar book, that would be fair use I suppose... that makes sense.

But, where does the derivative works come in? Where do you draw the line?

If I am heavily influenced by billions of lines of other people's GPL code (ala Copilot!), then I create my own tool from it and keep my code hidden, does that not mean I am abusing the GPL license?


That's what I meant by the devil being in the details -- these gray area questions hinge on the specific facts. Lawyers on both sides will argue which factors apply based on past caselaw and available evidence, and the court renders a decision. For example, from the Stanford webpage I previously linked: "the creation of a Harry Potter encyclopedia was determined to be “slightly transformative” (because it made the Harry Potter terms and lexicons available in one volume), but this transformative quality was not enough to justify a fair use defense in light of the extensive verbatim use of text from the Harry Potter books". So you might be okay creating a Harry Potter encyclopedia in general, but not if your definitions are copy/pasted from the books, but you might still be okay quoting key lines from the books if the quotes are a small portion of your encyclopedia. The caselaw just doesn't lend itself to firm lines in the sand.


If you read a bunch of books and then create a similar book, that isn't transformative; transformative is like, you read a bunch of books and then create a machine translation service. The point of transformative is like "isn't going to conflict with the market or compete in any way with the original thing".


That’s funny, because the bedrock of copyright - insofar as software is concerned - is entirely predicated on the idea that a computer copying code into RAM to execute it is indeed a copyright violation outside of a license to do so.


I think you're right. Especially given that Copilot can reproduce significant blocks of code: https://twitter.com/mitsuhiko/status/1410886329924194309

Famous code: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...


I see this held up as an example a lot, but the fast inverse square root algorithm didn't originate from Quake and is in hundreds of repositories - many with permissive licenses like WTFPL and many including the same comments.

GitHub claims they didn't find any "recitations" that appeared fewer than 10 times in the training data. That doesn't mean it's a completely solved issue (some code may be repeated in many repositories but always GPL, and there are limitations to how they detect recitations), but from rare cases of generating already-common solutions people seem to be concluding that all it does it copy paste.


That may be true, although even GitHub doesn't know for sure. But the problem remains: they're reproducing other people's code without regard to license status.


Copilot is a commercial paid service that generates money for Microsoft


Yeah, that bit I realise but the point I was getting at is this: if I take someone else's code, use chunks of it in my app, say that it's mine and make money from it is that not illegal? Or, at least in violation of the license?

Superficially at least, Copilot (from my understanding) is "copying" code, letting me use it in my app, and making money from it.

I'm just trying to wrap my head around it.

Let's be clear, I am not a lawyer, but it seems... strange!


Also NAL, but I think there's far more of a case that users of Copilot might violate copyright rather than Copilot itself:

- Only a very small proportion of Copilot generated code is reproduced verbatim, so if you specifically built a product just from copied-verbatim code, your act of selecting and combining those pieces of copyrighted code would be creating a derivative work.

- GitHub is not selling the copyrighted code, they are selling the tool itself. Google is literally the same thing: you could theoretically create a product by googling for prefixes of copyrighted code and then copying the remainder straight out of the search results. It's you who would be violating copyright, not google.


I think there is an argument to be made that Copilot is producing derivative code, though. It may produce copies verbatim, and that's a violation, but far more often, it produces a mixture of things it was trained on, most of which probably have some sort of license requiring attribution at the very least.


Both the Copy machine and VCR were found to be legal because they had substantial non infringing uses. As is I don't see how Copilot does. It could, if trained on public domain or attribution free code only, unfortunately there probably isn't enough code out there to train the model adequately under such rules.


Does copilot seem strange, or maybe the concept of intellectual property does?


Copilot isn't strange from a technical prespective.

The strange bit is how they are allowed to use other peoples code to create derivative works (this is how I see it from my non-legal perspective anyway).

Even if it's legal (to the letter of the law, not the spirit) it leaves a sour taste.


Suppose Copilot was Composer and it generated personalized songs for you after being trained on Spotify's library. If you started performing the resulting song and it contained recognizable clips of others, I guarantee you'd have lawyers coming after you.

I don't see this as fundamentally different. It's unlikely that the Free Software Foundation is going to track you down for including some GNU code in your single-user repo. If you used their stuff in a popular commercial project and they got wind of it, you might expect to receive a cease and desist at best.


Copying/pasting code from open source projects it's considered fair use. Come on, who doesn't do that?

I mean, sure you don't copy an entire file, but you tend to copy a snippet, or in the end you look at how is done and you done the exact same way (that is the same of copying it!)

I would say there is not a problem in there.


If you are copy and pasting code from open source projects into your own project, then I think that is more likely to be considered copyright infringement than fair use. Fair use is generally for things like criticism, parody, teaching etc. Obviously this kind of thing would need to be judged on a case-by-case basis, but I think you are on shaky ground here.


Copilot is just a tool, legally it cannot "make code", you're the one making it.

See also : Napster, including how it was condemned for facilitating copyright infringement (what Microsoft is risking here, though the offense is likely to be much milder, of course).


"I'm guessing the prevailing theory (from GiitHub anyway) is that I'm legitimately allowed to do this."

No. Copilot is a technical preview. In the final release, if it reproduces code verbatim, it'll tell you and present the correct license.


Doesn't matter that it's a technical preview; people are using it now, GitHub has already used it internally. So if it infringes now, there is already code out there being used that does infringe.


GitHub appears to be tracking every snippet that they're generating during their trials:

https://docs.github.com/en/github/copilot/research-recitatio...

Are you doing that? If not, then I wouldn't use GitHub's use as justification to engage in copyright infringement.


Oh, I am not using Copilot. But other people not part of GitHub are. And those are still violations.


How will it find the “correct” license?

Will it check the LICENSE file? Simply having a LICENSE file is not a declaration that all the code in that repo is under that LICENSE.

What if specific lines/files are specified to be under different licenses?

What if the publisher of the repo is publishing it under an incorrect license in bad faith?

Will github be responsible if it tells me the wrong license?


Copilot isn't a retrieval model. It's a generative model. It learns the coding techniques, not retrieving snippets. Only 0.1% of code it generates is regurgitated, and even that is usually pretty common code.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: