Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How is this evidence of that fact? Honest question.

I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?



> I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used.

Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.

> But isn't it already known and admitted (and allowed?)

No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.

1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...

2: https://www.reuters.com/legal/litigation/openai-hit-with-new...


> Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it. Unless you qualify for one of the many exceptions, such as fair use


It’s not clear that training is fair use. That’s being contested in court I think.


Training isn’t recreating or distributing so copyright won’t apply if the ruling is actually consistent with the intention of the law, which it may not.

Using copyrighted materials and then meaningfully transforming it isn’t infringement. LLMs only recreate original work in the same way I am when I wrote the first sentence of this paragraph because it probably exists word for word somewhere else too


Thats your interpretation, not the law.


> I don't see where you got that from

It’s been determined by the judge in the Meta case that training on the material is fair use. The suit in that case is ongoing to determine the extent of the copyright damages from downloading the material. I would not be surprised if there is an appeal to the fair use ruling but that hasn’t happened yet, as far as I know. Just saying that there is good reason for them to think it’s been allowed because it kind of has; that can be reversed but it happened.


That was specifically involving 13 authors.

There hasn't been any trials yet about the millions of copyrighted books, movies and other content they evidently used.


There's no reason to think those cases will go any differently. As far as I know, the ruling would have to be appealed at this point. I am only commenting to say that there is reason to think this is true:

> But isn't it already known and admitted (and allowed?)

You seemed to be confused about why this person believed that:

> No, and I don't see where you got that from.

And I wrote a comment intended to dispel your confusion. The above commenter thought that it was allowed because a judge said it was allowed; that can be appealed but that's the reason someone thinks it's allowed.


> There's no reason to think those cases will go any differently. As far as I know, the ruling would have to be appealed at this point.

Trial court rulings aren't binding precedent even on the same court in different cases, so its quite possible that different cases at the trial level can reach different conclusions on fair use on fairly similar facts, given the lack of appellate precedent directly on point with AI training.


Yea, no. I don't think I am confused.

A single verdict about a specific case (13 authors vs META) does not mean it's legal for companies to steal IP from other companies which has evidently been going on for some years now.

Those other companies have lawyers powerful enough to change jurisdiction in many countries in order to "protect their IP".


The Chinese subtitles for silence use a common mark for pirated media in that language, according to other commentors here. In general it's pretty likely that if you're finding non professional subtitles they were distributed with pirated media in some form, that's where you get the most fan subs after all


> were distributed with pirated media in some form,

I disagree with this conclusion. I've used e.g. the opensubtitles dataset for some data-analysis in the past. It's a huge dataset, freely available and precisely intended for such use. Now, if all the data in the opensubtitles dataset is legal, is another point.

So one might argue that using this opensubtitles dataset, makes one complicit to the illegal activities of opensubtitles themselves, IDK: IANAL.


> How is this evidence of that fact?

The contention is that the specific translated text appears largely from illegal translations (i.e., fansubs) and not from authorized translations. And from a legal perspective, that would basically mean there's no way they could legally have appropriated that material.

> But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

Technically, everything is copyrighted. But your question is really about permission. Some of the known corpuses for AI training include known pirate materials (e.g., libgen), but it's not known whether or not the AI companies are filtering out those materials from training. There's a large clutch of cases ongoing right now about whether or not AI training is fair use or not, and the ones that have resolved at this point have done so on technical grounds rather than answering the question at stake.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: