Here are some insights into how OpenAI fine-tuned the rewards and short-term act...

habitue · on Aug 25, 2018

Since these are hyperparameters, some of which are annealed over the entire training period, and given the fact that the training required ungodly amounts of computing time, I think it is just impractical for them to have fully checked whether they were set optimally. They probably went with what seemed good, and trusted deep networks to pick up the slack. (this is total speculation on my part).

I do think if they'd used some more sophisticated RL algorithms, perhaps with intrinsic curiosity, or some kind of hierarchical task learning, they might have been able to reduce their training time and maybe been able to tune their hyperparameters a bit more