This paper presents a new problem solving approach
that is able to generate optimal policy solution for finite-state
stochastic sequential decision-making problems with high data
efficiency. The proposed algorithm iteratively builds and improves
an approximate Markov Decision Process (MDP) model along with
cost-to-go value approximates by generating finite length trajectories
through the state-space. The approach creates a synergy between an
approximate evolving model and approximate cost-to-go values to
produce a sequence of improving policies finally converging to the
optimal policy through an intelligent and structured search of the
policy space. The approach modifies the policy update step of the
policy iteration so as to result in a speedy and stable convergence to
the optimal policy. We apply the algorithm to a non-holonomic
mobile robot control problem and compare its performance with
other Reinforcement Learning (RL) approaches, e.g., a) Q-learning,
b) Watkins Q(λ), c) SARSA(λ).
[1] D.P. Bertsekas and J.N. Tsitsiklis, Neurodynamic-Programming, Athena
Scientific, Belmont MA, 1996.
[2] C.W. Zobel and W.T. Scherer, "Simulation-Based Policy Generation
Using Large Scale Markov Decision Processes", IEEE Transactions on
Systems, Man and Cybernetics-Part A: Systems & Humans, Vol. 31,
No-6, November 2001.
[3] Stephan Ten Hagen, "Continuous state-space and Q learning for control
of Non-linear systems", Ph. D. Thesis, Amsterdam University, 2001.
[4] C.G. Atkenson and J.C. Santamaria, "A Comparison of Direct and
Model-Based Reinforcement Learning", Tech. Rep. GA 30332-0280,
College of Computing, Georgia Institute of Technology, Atlanta.
[5] J.N Tsitsiklis, "On the Convergence of Optimistic Policy Iteration",
Journal of Machine Learning Research 3 , pp. 59-72, 2002.
[6] Leonid Kuvayev, "Model-Based Reinforcement Learning with an
Approximate Learned Model", Master-s Thesis, Department of
Computer Science, University of Massachusetts, 1997.
[7] L.P. Kaelbling, M.L. Littman and A.W. Moore, "Reinforcement
Learning: A Survey," Journal of Artificial Intelligence Research 4, pp.
237-285,1996.
[8] R. Fierro and F.L.Lewis, "Control of a Non-holonomic Mobile Robot
using Neural Networks", IEEE Transactions on Neural Networks, Vol.
9, No.4, July 1998.
[9] A.G. Barto, S.J. Bradtke and S.P. Singh, "Learning to Act Using Real-
Time Dynamic Programming", Artificial Intelligence, 72(1), pp. 81-138,
1995.
[10] S.P. Singh & R.S. Sutton, " Reinforcement Learning with Replacing
Eligibility Traces", Machine Learning, 22, pp.23-158, 1996.
[11] J.N. Tsiksiklis and B.V. Roy, " An Analysis of Temporal Difference
Learning with Function Approximation", IEEE Transactions on
Automatic Control 42(5), pp. 674- 690,1997.
[12] C.J.C.H. Watkins and P.Dayan, " Technical Note: Q-Learning",
Machine Learning, 8(3/4), pp. 279-292, 1992.
[13] R.S.Sutton, " Learning to predict by the Methods of Temporal
Differences", Machine Learning, 3, pp. 9-44,1988.
[14] C.J.C.H.Watkins, " Learning from Delayed Rewards", Ph.D. Thesis,
Cambridge University, Cambridge, England, 1989.
[15] R.S.Sutton & A.G.Barto, " Reinforcement Learning: An introduction",
MIT Press, Cambridge, Massachusetts, 1998.
[16] R.E.Bellman, "Dynamic Programming", Princeton University Press,
Princeton NJ, 1957.
[17] P.Dayan, "The convergence of TD(λ) for general λ", Machine Learning,
8, pp. 341- 362,1992.
[1] D.P. Bertsekas and J.N. Tsitsiklis, Neurodynamic-Programming, Athena
Scientific, Belmont MA, 1996.
[2] C.W. Zobel and W.T. Scherer, "Simulation-Based Policy Generation
Using Large Scale Markov Decision Processes", IEEE Transactions on
Systems, Man and Cybernetics-Part A: Systems & Humans, Vol. 31,
No-6, November 2001.
[3] Stephan Ten Hagen, "Continuous state-space and Q learning for control
of Non-linear systems", Ph. D. Thesis, Amsterdam University, 2001.
[4] C.G. Atkenson and J.C. Santamaria, "A Comparison of Direct and
Model-Based Reinforcement Learning", Tech. Rep. GA 30332-0280,
College of Computing, Georgia Institute of Technology, Atlanta.
[5] J.N Tsitsiklis, "On the Convergence of Optimistic Policy Iteration",
Journal of Machine Learning Research 3 , pp. 59-72, 2002.
[6] Leonid Kuvayev, "Model-Based Reinforcement Learning with an
Approximate Learned Model", Master-s Thesis, Department of
Computer Science, University of Massachusetts, 1997.
[7] L.P. Kaelbling, M.L. Littman and A.W. Moore, "Reinforcement
Learning: A Survey," Journal of Artificial Intelligence Research 4, pp.
237-285,1996.
[8] R. Fierro and F.L.Lewis, "Control of a Non-holonomic Mobile Robot
using Neural Networks", IEEE Transactions on Neural Networks, Vol.
9, No.4, July 1998.
[9] A.G. Barto, S.J. Bradtke and S.P. Singh, "Learning to Act Using Real-
Time Dynamic Programming", Artificial Intelligence, 72(1), pp. 81-138,
1995.
[10] S.P. Singh & R.S. Sutton, " Reinforcement Learning with Replacing
Eligibility Traces", Machine Learning, 22, pp.23-158, 1996.
[11] J.N. Tsiksiklis and B.V. Roy, " An Analysis of Temporal Difference
Learning with Function Approximation", IEEE Transactions on
Automatic Control 42(5), pp. 674- 690,1997.
[12] C.J.C.H. Watkins and P.Dayan, " Technical Note: Q-Learning",
Machine Learning, 8(3/4), pp. 279-292, 1992.
[13] R.S.Sutton, " Learning to predict by the Methods of Temporal
Differences", Machine Learning, 3, pp. 9-44,1988.
[14] C.J.C.H.Watkins, " Learning from Delayed Rewards", Ph.D. Thesis,
Cambridge University, Cambridge, England, 1989.
[15] R.S.Sutton & A.G.Barto, " Reinforcement Learning: An introduction",
MIT Press, Cambridge, Massachusetts, 1998.
[16] R.E.Bellman, "Dynamic Programming", Princeton University Press,
Princeton NJ, 1957.
[17] P.Dayan, "The convergence of TD(λ) for general λ", Machine Learning,
8, pp. 341- 362,1992.
@article{"International Journal of Information, Control and Computer Sciences:63893", author = "R. Sharma and M. Gopal", title = "Trajectory-Based Modified Policy Iteration", abstract = "This paper presents a new problem solving approach
that is able to generate optimal policy solution for finite-state
stochastic sequential decision-making problems with high data
efficiency. The proposed algorithm iteratively builds and improves
an approximate Markov Decision Process (MDP) model along with
cost-to-go value approximates by generating finite length trajectories
through the state-space. The approach creates a synergy between an
approximate evolving model and approximate cost-to-go values to
produce a sequence of improving policies finally converging to the
optimal policy through an intelligent and structured search of the
policy space. The approach modifies the policy update step of the
policy iteration so as to result in a speedy and stable convergence to
the optimal policy. We apply the algorithm to a non-holonomic
mobile robot control problem and compare its performance with
other Reinforcement Learning (RL) approaches, e.g., a) Q-learning,
b) Watkins Q(λ), c) SARSA(λ).", keywords = "Markov Decision Process (MDP), Mobile robot,Policy iteration, Simulation.", volume = "1", number = "12", pages = "4065-6", }