Standard RL algorithms will converge to optimal play versus a fixed opponent, bu...

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc