this is correct. "open source" means everything required to recreate from scratc...

userbinator · 2025-02-18T07:51:12 1739865072

How can you even "open source" an AI model without all of the, presumably copyrighted and extremely voluminous, training data?

beeflet · 2025-02-18T08:35:39 1739867739

That could probably be solved with bit-torrent. I think the bigger obstacle is the hardware required for training. Maybe it would be possible for groups of people to reproduce/train open source models with a distributed BOINC-like system?

thephyber · 2025-02-18T08:15:36 1739866536

You would open source the procedure and reference where the data came from. If there is any non-open source content used in training, then the project couldn’t qualify as “open source”.

But this thread is about misuse of the term as applied to the weights package. Those of us who know what open source means should not continue to dilute the term by calling these LLMs by that term.

mupuff1234 · 2025-02-18T10:11:34 1739873494

You don't need the data itself, but at least a reference to what was used, basically provide the entire blueprint to recreate it.

It's just like even for a true open source software you still need to bring your own hardware to run it on.

simondotau · 2025-02-18T08:09:40 1739866180

You can't. But that's not an excuse to misuse the label.

a-dub · 2025-02-18T08:20:01 1739866801

that's how you know when you actually have agi, when you have something that you don't have to shovel in every written word known to man to make it work, but rather can seed it with a few dense public domain knowledge compendia and have it derive everything else for itself from those first principles- possibly going through several stages of from scratch training and regeneration.

int_19h · 2025-02-18T08:46:31 1739868391

The reason why you need to shovel every written word known to man to make it work is because it needs to learn what words mean before it can do anything useful with them, and we don't currently know any better way of making a tabula rasa (like a blank NN) do that. Our own brains are hardwired for language acquisition by evolution, so we can few-shot it when learning and get there much faster; and if we understood how it works, we could start with something similarly hardwired and do exactly what you said.

But we don't actually know all that much about how language really works, for all the resources we spend on linguistics - as the old IBM joke about AI goes, "quality of the product increases every time we fire a linguist" (which is to say, we consistently get better results by throwing "every written word known to man" at a blank model than we do by trying to construct things from our understanding).

All that said, just because we're taking a different, and quite possibly slower / less compute-efficient route, doesn't mean that we can't get to AGI in this way.

dragonwriter · 2025-02-18T08:59:41 1739869181

> Our own brains are hardwired for language acquisition by evolution, so we can few-shot it when learning and get there much faster

No, we can’t few shot it and we don't get there faster (but we develop a lot of other capabilities on the way.) We train on a lot more data; the human brain, unlike an LLM, is training on all that data in processes for ”inference”, and it receives sensory data estimated on the order of a billion bits per second, which means by the time we start using language we’ve trained on a lot of data (the 15 trillion tokens from a ~17 bit token vocabulary that Llama3 is something like the size of a few days of human sense data.) Humans just are trained on and process vastly richer multimodal data instead of text streams.

int_19h · 2025-02-18T09:53:59 1739872439

I was talking about language acquisition specifically. Most of the data that you reference is visual input and other body sensations that aren't directly related to that. OTOH humans don't take all that much text to learn to read and write.

dragonwriter · 2025-02-18T15:53:15 1739893995

> I was talking about language acquisition specifically.

Yeah, humans don't acquire language separately from other experience.

> Most of the data that you reference is visual input and other body sensations that aren't directly related to that.

Visual input and other body sensations are not unrelated to language acquisition.

> OTOH humans don't take all that much text to learn to read and write.

That generally occurs well after they have acquired both language and recognizing and using symbolic visual communication, and they usually have considerable other input in learning how to read and write besides text they are presented with (e.g., someone else reading words out loud to them.)

ncr100 · 2025-02-18T07:37:12 1739864232

Feeling my inner Klingon, "Where is the honor in releasing a binary blob and calling it .. open source. Pfah!"

llm_trw · 2025-02-18T08:26:43 1739867203

Linux doesn't ship a compiler or CPU when you download it. It's not open source I guess.