Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do transformer architecture and attention mechanisms actually give any benefit to anything else than scalability?

I though the main insights were embeddings, positional encoding and shortcuts through layers to improve back propagation.



When it comes to ML there is no such distinction though. Bigger models == more capable models and for bigger models you need scalability of the algorithm. It's like asking if going to 2nm fabs has any benefit other than putting more transistors in a chip. It's the entire point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: