In many problem settings, most notably in game playing, an agent receives a possibly delayed reward for its actions. Often, those rewards are handcrafted and not naturally given. Even simple terminal-only rewards, like winning equals 1 losing \(-1\), can be seen as unbiased statement, since these values chosen arbitrarily, the behavior of learner may change with different encoding. It is hard t...