Transformers were made for machine translation - someone had the insight that when going from one language to another the context mattered such that the tokens that came before would bias which ones came after. It just so happened that transformers we more performant on other tasks, and at the time you could demonstrate the improvement on a small scale.