Wednesday , June 20 2018

Stabilizing Dynamic State Feedback Controller
Synthesis: A Reinforcement Learning Approach

Miguel A. SOLIS1, Manuel OLIVARES2, Héctor ALLENDE1
1 Departamento de Informática,

Universidad Técnica Federico Santa María, Chile
2 Departamento de Electrónica,
Universidad Técnica Federico Santa María, Chile

Abstract: State feedback controllers are appealing due to their structural simplicity. Nevertheless, when stabilizing a given plant, dynamics of this type of controllers could lead the static feedback gain to take higher values than desired. On the other hand, a dynamic state feedback controller is capable of achieving the same or even better performance by introducing additional parameters into the model to be designed. In this document, the Linear Quadratic Tracking problem will be tackled using a (linear) dynamic state feedback controller, whose parameters will be chosen by means of applying reinforcement learning techniques, which have been proved to be especially useful when the model of the plant to be controlled is unknown or inaccurate.

Keywords: Adaptive control, furuta pendulum, reinforcement learning.

>>Full text<<
: Miguel A. SOLIS, Manuel OLIVARES, Héctor ALLENDE, Stabilizing Dynamic State Feedback Controller Synthesis: A Reinforcement Learning Approach, Studies in Informatics and Control, ISSN 1220-1766, vol. 25(2), pp. 245-254, 2016.

  1. Introduction

Reinforcement learning (RL) is typically concerned about solving sequential decision problems modelled by Markov Decision Processes (MDPs). Applications of RL, as a research field inside of Machine Learning, has got extended to areas such as Robotics [10] or Control Theory [7, 12, 16], by means of choosing a suitable representation of the problem to be solved.

The first RL applications on control systems have been found on Werbos [19, 20], where the regulation problem was tackled, whose objective is to design a controller for a given process, such that the internal state of this process approaches zero as time increases unbounded. Then, an immediate extension was to apply policy iteration (PI) algorithms to solve the linear quadratic regulator (LQR) problem [4].

The LQR, i.e., the regulator problem when the system is assumed to be linear, and the performance index is given in terms of a quadratic function [1] is particularly appealing given that its solution is obtained by solving an algebraic Riccati equation (ARE). Then, PI algorithms basically start with an admissible control policy and then iterate between policy evaluation and policy improvement steps until variations on the policy or the specified value function are negligible, as seen on [4, 13, 17].

In the other hand, the linear quadratic tracking (LQT) problem also assumes a linear model for the process dynamics and a quadratic function for the performance index, but the main objective is to design a controller such that the measured output of the process to be controlled, follows an exogenous reference signal, so the LQR could be considered as a particular case of the LQT problem. Although, as mentioned before, RL algorithms have been extensively applied for solving the LQR problem, the LQT has not received much attention on the literature mainly because for most reference signals the infinite horizon cost becomes unbounded [2]. Work in [14] tackles the problem on the continuous time domain by solving an augmented ARE obtained from the original system dynamics and the reference trajectory dynamics, while [9] takes a similar approach for the discrete-time case, where a Q-learning algorithm is obtained for solving the LQT problem without any model knowledge.

Then, when considering noisy systems, the performance index and notions of stability have to be modified accordingly. This problem has been extensively treated on literature from the classical control, or model-based approach [6, 8, 21], unlike on the learning paradigm. Work on [11] uses neural networks for reducing calculus efforts on providing optimal control for the stochastic LQR, while other works focus on relaxing assumptions on the ARE under different scenarios, but still requiring knowledge of the system dynamics [5, 22]. The work on [9] could be considered as the closest to our approach, given the LQT setup and the absence of model knowledge. Nevertheless, unlike the work therein, we consider the stochastic LQT problem, and we extend the structure of the (linear) state feedback controller to be of a more general form. Then, when analysing experimental results, RL will prove to be especially useful for the case when the model is unknown, but it is still useful when dynamics are assumed to be given, since hand-tuning of controller parameters could represent a time consuming task due to the number of degrees of freedom and the corresponding constraints.

The remainder of this document is organized as follows: Section 2 presents a brief review about the basic concepts to be used in the subsequent sections, as well as the classical approach for the LQT problem by using (static) state feedback controllers.

Then, Section 3 shows the appropriate procedure for obtaining a stabilizing dynamic controller for minimizing the LQT performance criteria, and the main results.

Section 4 makes an illustration on simulation results obtained for an arbitrary plant. Finally, Section 5 draw some final conclusions and give some insight into future work.


ξ{·} denotes the expectation operator,  is used to describe the largest eigenvalue of λ(M), while MT denotes the transpose of matrix M, and  its inverse when M-1 is square.

R stand for the set of all the real numbers, and when used with superscripts Rn (or Rn×m) describe a vector (or matrix) with n rows (or n rows and m columns) whose elements are real-valued.


  1. ANDERSON, B. D. O., J. B. MOORE, Optimal Control: Linear Quadratic Methods, Courier Dover Publ., 2007.
  2. BARBIERI, E., R. ALBA-FLORES, On the Infinite Horizon LQ Tracker. Systems & Control Letters, vol. 40(2), 2000, pp. 77-82.
  3. BERTSEKAS, D. P., Dynamic Programming and Optimal Control, 1. Athena Scientific Belmont, MA, 1995.
  4. BRADTKE, S. J., B. E. YDSTIE, A. G. BARTO, Adaptive Linear Quadratic Control using Policy Iteration. In American Control Conference, 1994, volume 3, pp. 3475-3479.
  5. CHEN, S., X. LI, X. Y. ZHOU, Stochastic Linear Quadratic Regulators with Indefinite Control Weight Costs. SIAM Journal on Control and Optimization control weight costs. SIAM Journal on Control and Optimization, vol. 36(5), 1998, pp. 1685-1702.
  6. DE SOUZA, C. E., M. D. FRAGOSO, On the Existence of Maximal Solution for Generalized Algebraic Riccati Equations Arising in Stochastic Control. & Ctrl. Letters, vol. 14(3), 1990, pp. 233-239.
  7. HE, P., S. JAGANNATHAN, Reinforcement Learning-based Output Feedback Control of Nonlinear Systems with Input Constraints. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans., vol. 35(1), 2005, pp. 150-154.
  8. HUANG, Y., W. ZHANG, H. ZHANG, Infinite Horizon Linear Quadratic Optimal Control for Discrete Time Stochastic Systems. Asian Journal of Control, vol. 10(5), 2008, pp. 608-615.
  9. KIUMARSI, B., F. L. LEWIS, M. B. NAGHIBI-SISTANI, A. KARIMPOUR, Optimal Tracking Control of Unknown Discrete-time Linear Systems using Input-output Measured Data. IEEE Trans. on Cybernetics, vol. 45(12), 2015, pp. 2770-2779.
  10. KOBER, J., J. A. BAGNELL, J. PETERS, Reinforcement Learning in Robotics: A Survey. Journal of Robotics Research, vol. 32(11), 2013, pp. 1238-1274.
  11. KUMARESAN, N., P. BALASUBRAMANIAM, P., Optimal Control for Stochastic Linear Quadratic Singular System using Neural Networks. Journal of Process Control, vol. 19(3), 2009, pp. 482-488.
  12. LEWIS, F. L., D. LIU, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, volume 17. John Wiley & Sons, 2013.
  13. LEWIS, F. L., K. G. VAMVOUDAKIS, Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming using Measured Output Data. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Trans., vol. 41(1), 2011, pp. 14-25.
  14. QIN, C., H. ZHANG, Y. LUO, Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adaptive dynamic programming. International Journal of Control, 87(5):1000–1009, 2014.
  15. SODERSTRÖM, T., Discrete-time stochastic systems: estimation and control. Springer, 2002.
  16. SUTTON, R. S., BARTO, A. G., WILLIAMS, R. J., Reinforcement learning is direct adaptive optimal control. Control Systems, IEEE, 12(2): 19–22, 1992.
  17. TEN HAGEN, S., KROSE, B., Linear quadratic regulation using reinforcement learning. In 8th Belgian Dutch Conference on Machine learning, pp. 39–46, 1998.
  18. WATKINS, C.J.C.H., Learning from delayed rewards. PhD thesis, University of Cambridge, 1989.
  19. WERBOS, P. J., Neural networks for control and system identification. In Decision and Control, 1989, Proceedings of the 28th IEEE Conference on, pp. 260–265. IEEE, 1989.
  20. WERBOS, P.J., Approximate dynamic programming for real-time control and neural modeling. Handbook of intelligent control: Neural, fuzzy, and adaptive approaches, 15: 493–525, 1992.
  21. WONHAM, W. M. On a matrix riccati equation of stochastic control. SIAM Journal on Control, 6(4): 681– 697, 1968.
  22. ZHANG, W., LI, G., Discrete-time indefinite stochastic linear quadratic optimal control with second moment constraints. Mathematical Problems in Engineering, 2014, 2014.