Paper Sharing one - Hindsight Experience Replay

05-07

Name

Hindsight Experience Replay

Attachment

Original: arxiv
Code: Github
Video: 油管
Two Minute Papers : 油管
愛可可-愛生活老師Two Minute Papers的微博視頻分享: 微博

Problem or Challenge:

Dealing with sparse rewards, Reward function engineering, Sample efficiency.

Assumptions or hypotheses:

Multiple different goals (a single-goal also works).
Sparse and binary rewards.
Off-policy RL algorithms.

Inspiration or Source:

Human can learn almost as much from achieving an undesired outcome as from the desired one.

Methods or Solutions:

HER can combined with any off-policy RL algorithm.
Training universal policies[2]: Input = the current state + a goal state.
Key idea: Whether an episode is good for the training depends on the final goal. As for off-policy RL algorithms we can choose additional goals.

Story time

Assume you bought a robot and now its in a new house of which it doesnt know the whole environment al all.
You ask it to go to room 1 in 30 seconds.
After a while walking around, It ends in room 2, for traditional RL algorithm, your robot cant learn much experience from this episode as your rewards are all 0.
But what if you change the goal to room 2 and add the full episode with reward 1 to your replay buffer?
You may know that this is an good episode for your robot to learn the policy to go to room 2. This is called hindsight experience replay.
Algorithm：

Experiments or Results:

MuJoCo[3]
Three different tasks: pushing, sliding, and pick-and-place.
For HER they store each transition in the replay buffer twice:

1. Once with the goal used for the generation of the episode.

2. Once with the goal corresponding to the final state from the episode (they call this strategy final).

Learning curves for multi-goal setup

5. DQN without HER was not able to solve any of them.

6. HER learns faster if training episodes contain multiple goals than single goal.

7. HER + Reward shaping.

Surprisingly neither DDPG, nor DDPG+HER was able to successfully solve any of the tasks with any of these reward functions (Two reasons).

8. K goals and three strategies: future(best), episode, random.

9. Strategy future on physical robot, Initially success rate 2/5, after retraining, up to 5/5.

Limitation or Weakness:

Reward shaping does not work well (it requires a lot of domain knowledge).
The goal substitution changes the distribution of experience in an unprincipled way. This bias can in theory lead to instabilities[4].
Cant combine on-policy algorithms.

Summary

It is the first time so complicated behaviors were learned using only sparse, binary rewards.
It can combine any off-policy RL algorithms.

Reference

[1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mcgrew, J. Tobin, P. Abbeel, and W. Zaremba, 「Hindsight Experience Replay,」 2017.

[2] Schaul T, Dan H, Gregor K, et al. Universal Value Function Approximators[C]//

International Conference on Machine Learning. 2015.

[3] E. Todorov, T. Erez, and Y. Tassa, "MuJoCo: A physics engine for model-based control." pp. 5026-5033.

[4] https://blog.openai.com/ingredients-for-robotics-research/

[5] 標題圖片來源：https://blog.openai.com/content/images/2018/02/virtual-goal-5.png