Models like this are experimentally pretrained or tuned hundreds of times over many months to optimize the datamix, hyperparams, architecture, etc. When they say "ran parallel trainings" they are probably referring to parity tests that were performed along the way (possibly also for the final training runs). Different hardware means different lower-level libraries, which can introduce unanticipated differences. Good to know what they are so they can be ironed out.
Part of it could also be that they'd prefer to move all operations to the in-house trn chips, but don't have full confidence in the hardware yet.
Def ambiguous though. In general reporting of infra characteristics for LLM training is left pretty vague in most reports I've seen.
Part of it could also be that they'd prefer to move all operations to the in-house trn chips, but don't have full confidence in the hardware yet.
Def ambiguous though. In general reporting of infra characteristics for LLM training is left pretty vague in most reports I've seen.