Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Standard RL algorithms will converge to optimal play versus a fixed opponent, but will not find an optimal policy via self play.

One intuitive way to see this is that a sequence of improving pure policies A < B < C < etc. will converge to optimal play in a perfect information game like chess, but not necessarily in an imperfect information game like rock/paper/scissors where Rock < Paper < Scissors < Rock, etc



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: