Edit: increased the validation to 10,000 life grids for 100 steps, (taking 16 mi...

Edit: increased the validation to 10,000 life grids for 100 steps, (taking 16 minutes to check), which is hopefully somewhat more convincing. That's 1,000,000 life steps computed without errors in total. Plus 32,000 steps computed without error during training.

When the attention grid is manually computed (to be equivalent to 3 by 3 conv), the model can be trained to be 100% perfect, verified by checking all 3 by 3 grid states. (And this manually computed attention matrix means that once the tokens reach the classifier layer, each token contains only the information of the relevant 3 by 3 grid, and the whole thing is deterministic as you say.)

However, when the model is computing the attention grid itself, just checking all 3 by 3 sub-grid states crop up is not enough, because the position of the sub-grids can impact the attention matrix, and also the state of other cells can impact the attention matrix. So as shown in the post, it does approximate 3 by 3 conv, but if it doesn't get the approximation quite right, there could be errors. But I would say that it's still computing the Game of Life algorithm in an interpretable way, it's just that maybe it has struggled to create a perfect 3 by 3 convolution via attention in that particular case. (To exhaustively check this, would require checking all 2 * (16x16) grids.)