When it comes to ML there is no such distinction though. Bigger models == more capable models and for bigger models you need scalability of the algorithm. It's like asking if going to 2nm fabs has any benefit other than putting more transistors in a chip. It's the entire point.
I though the main insights were embeddings, positional encoding and shortcuts through layers to improve back propagation.