"Convergence Guarantees for Dynamical Neural Network Policy Learning"
Rawson, MichaelPolicy learning is a quickly growing area. In this paper we analyze the Deep Epsilon Greedy method, a randomized algorithm, which chooses actions based on a dynamically trained convolutional neural network's prediction. We establish a policy whose error or regret bound is shown to converge. We also show that an upper bound of the Epsilon Greedy method regret is minimized with cubic root exploration. In experiments with the real-world dataset MNIST, we construct a nonlinear reinforcement learning problem. We witness how, with either high or low noise, some methods do and some do not converge, consistent with our theoretical analysis.