Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google's Pathways Language Model and Chain-of-Thought (vaclavkosar.com)
77 points by vackosar on April 18, 2022 | hide | past | favorite | 30 comments


Correction!! The model costed around 10M not 10B! Thanks for raising that. Mistake during copying from the second slide :(


The article quotes the cost as roughly 10B$ in the first paragraph. Likely a typo? They quote 10M$ in a later paragraph.


Ya I was like there’s no way google spend 10B on this.


Yes, of course, thanks!


The amount of capital needed to train these high-quality models is eye watering (not to mention the costs needed to acquire the data). Does anyone know of any well capitalized startups exploring this space?


> The amount of capital needed to train these high-quality models is eye watering

It's relative. It would cost more to open a 40 room hotel (about 320k/room), and hotels can't be copied like software.


It's not like that many people are opening 40 room hotels either. Such amounts are atypical within programming and CS communities.

A more relevant example is video games, imagine if the only viable ones were top end AAA games whose completed versions could only be accessed by cloud gaming?


It's not the technologies fault that these companies don't publish their models.


I would not say that. Facebook, Microsoft and Google release plenty of useful models. EleutherAI have released 6 billion and 20 billion parameter language models. Huggingface has been training a 176B model [1].

The issue isn't a lack of models or data, it's that larger models are impossible to train without paying hundreds of thousands to millions of dollars. The hardware requirements for simply running the models already prices it out of reach for most.

These models are rather powerful but the immediate future is one of accessing them by cloud services. GeForce GTX 1080 Ti was 5 years ago, since then memory has roughly doubled in consumer GPUs. To run the highest end models on single GPUs, HW will need to 20x to 70x in memory at the same time as serious gains in flops/Joule.

I suppose improvements in CPU parallelism and RAM speeds will also go a long way towards making such models runnable on reasonable consumer hardware, albeit at slower speeds.

[1] https://huggingface.co/bigscience/tr11-176B-ml-logs


Saying people lack the equipment to run them for inference isn't a good reason to not publish them. The astronomical training cost is a good reason to publish them.


The data here is effectively free. I don't think they would exhaust The Pile, which you can download for free. This is also true for text2image models like DALL-E 2: while OA may have invested in its own datasets, everyone else can just download LAION-400M (or if they are really ambitious, LAION-5B https://laion.ai/laion-5b-a-new-era-of-open-large-scale-mult... ).


OpenAI would be the best example. However these large language models also have limited business value today, making an startup a speculative bet that the team will beat Google/FB/AI/Academics at making a language model and find a viable business model for the resulting model.

I'd take one of those bets or the other, both are tough to pull off. Considering that the first task of such a startup would be to hand ~100-500MM to a hardware or cloud vendor I'd be hesitant to invest as an investor.


It costs less than 10M to train. Why hand so much to hardware or cloud? Soon enough there will be open source GPT-3's, at least two are in training as we speak (BigScience and EleutherAI).

> these large language models also have limited business value today

The Instruct version of GPT-3 has become very easy to steer with just a task description. It can do so many tasks so well it's crazy. Try some interactions with the beta API.

I believe GPT-3 is already above average human level at cognitive tasks that fit in a 4000 token window. In 2-3 years I think all developers will have to adapt to the new status quo.


> I believe GPT-3 is already above average human level at cognitive tasks that fit in a 4000 token window.

How can you possibly make a claim like this without like 80 links justifying it? The claim is fuzzy and absurd, my least favorite combo


Gut feeling based on playing with it. Here's an example:

> Colorless green ideas sleep furiously, and other grammatical nonsense by Noam Chomsky

He was a man without a country, A linguist without a language, A mind without a thought, A dream without a dreamer. He was lost in a world of words, A world where ideas slept furiously, And grammar was a never-ending nightmare.

But he persevered, For he knew that language was the key to understanding the world. And so he continued to study, To learn all that he could, In the hopes that one day, He would find his way home.


> Gut feeling based on playing with it

You should check out the post we're commenting on, it has graphs for this exact metric.

Spoiler: Google's model with 3x the parameters does pass average human in a couple categories, but not at all. I don't think GPT-3 does in any.

It's doubly puzzling to me because you have access and are asserting it feels like an average human to you. It's awesome and it does magical stuff, I use it daily both for code and prose. It also majorly screws up sometimes. It only at an average human level if we play word games with things like "well, the average human wouldn't know the Dart implementation of the 1D gaussian function. Therefore it's better than the average human."


> Gut feeling based on playing with it.

Ok, your phrasing made it sound like some article or material had convinced you of this opinion on my first reading, now I understand.

This is kind of my point about 80 links though - you're using a definition of "cognitive tasks" that more closely resembles knowledge, and then you're letting your personal feelings about profundity guide your conclusions on said cognition.

I don't deny that the machine can output pretty words and has a breadth of knowledge to put us each to shame on some simple queries, but "cognition in a 4000 token window" is an incredibly large place and I don't even understand how you would be able to claim a machine has above-human-average cognition based solely on your own interactions... That's a pretty crazy leap.

PS: I saw the downvotes, I was downvoted for questioning the validity of information that was actually just pure conjecture, be better with your votes


I agree 100%, but I think viable businesses will begin to emerge especially as these large models move from text to images (and eventually to video and 3d models). If the examples shown of DALL-E 2 are indicative of its quality, then a large number of creative jobs could be replaced with a single "creative director" using the model. But the high entry cost just to attempt to train such a model will likely remain a hurdle until more business value is proven.


aye - I suspect the other concern is hat the high entry costs can quickly lead to a "second mover" advantage. The first team spends all the money doing the hard R&D and the second team implements a slightly better version for a fraction of the money.


If nothing else, they'd enjoy slacking gain[1] by starting their computation with more advanced hardware.

[1] https://arxiv.org/pdf/astro-ph/9912202.pdf


I'd just solve some existing problem with the most basic language model you can get your hands on and then move up from there. Sell it first.


That's literally nothing for the benefits it could provide if applied on the real world.


Correction! Cost is around $10M not $10B.


I've talked about structural deficiencies in earlier language models, this one seems to be doing something about them.


Sounds interesting! Would you link to that or describe them here? Thanks!


A very simple one is "can you write a program that might never terminate?"

If a neural network does a fixed amount of computation and that is that it is never going to be able to do things that require a program that may not terminate.

There are numerous results of theoretical computer science that apply just as well to neural networks and other algorithms even though people seem to forget it.

Another is "can an error discovered in late stage processing be fed back to an early stage and be repaired?" That's important if you are parsing a sentence like

   Squad helps dog bite victim.
It was funny because I saw Geoff Hinton give a talk in 2005, before he got super-famous, and he was talking about the idea that led to deep networks and he had a criticism of "blackboard" systems and other architectures that produced layered representations (say the radar of an anti-aircraft system that is going to start with raw signals, turn those into a set of 'blips', coalesce the 'blips' into tracks, interpret the tracks as aircraft, etc.)

Hinton said that you should build the whole system in an integrated manner and train the whole thing working end-to-end and I thought "what a neat idea" but also "there is no way this would work for the systems I'm building because it doesn't have an answer for correcting itself.


I'm by no means an expert, but a lot of choices machine learning algorithms make are more about training parallelization than anything. In many ways it feels like something like a recursive neural network or some architecture even more weird should be better for language, but in practice it's harder to train an architecture that demands each new output depend on the one before. Introducing dependencies on prier output typically kills parallelization. Obviously this is less of a problem for say a brain that has years of training time, but more of problem if you want to train one up in much less time using compute that can't do sequential things very quickly


You're assuming here that there are discrete stages that do different things. I think a better way to conceptualise these deepnets is that they're doing exactly what you want - each layer is "correcting" the mistakes of the previous layer.


Most "deep" networks are organized into layers and information flows in a particular direction although it doesn't have to be that way. Hinton wasn't saying we shouldn't have layers but that we should train the layers together rather than as black boxes that work in isolation.

Also, when people talk about solving problems they talk about layers, layers play a big role in the conceptual models people have for how they do tasks even if they don't really do them that way.

For instance in that ambiguous sentence somebody might say it hinges on whether or not you think "bite" is a verb or a noun.

(Every concept in linguistics is suspect, if only because linguistics has proven to have little value for developing systems that understand language. For instance I'd say a "word" doesn't exist because there are subword objects that depend like a word "non-" and phrases that behave like a word (e.g. "dog bite" fills the same slot as "bite"))

Another ambiguous example is this notorious picture

https://www.livescience.com/63645-optical-illusion-young-old...

which most people experience as "flapping" between two states. Since you only see one at a time there is some kind of inhibition between the two states. Who knows how people really see things, but if I'm going to talk about features I'm going to say that one part is the nose of one of the ladies or the chin of the other lady.

Deep networks as we know it have nothing like that.


$10M for a bag of numbers (i.e the learned weights of the model matrices)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: