I think its almost certainly using at least two experts per token. It helps a lo...

		chessgecko on April 18, 2024 \| parent \| context \| favorite \| on: Meta Llama 3 I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.