Real-time Hybrid System Identification
with Online Deterministic Annealing

Christos N. Mavridis^∗, and Karl Henrik Johansson^∗ ^∗Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm. emails:{mavridis,kallej}@kth.se.Research partially supported by the Swedish Foundation for Strategic Research (SSF) grant IPD23-0019.

Abstract

We introduce a real-time identification method for discrete-time state-dependent switching systems in both the input–output and state-space domains. In particular, we design a system of adaptive algorithms running in two timescales; a stochastic approximation algorithm implements an online deterministic annealing scheme at a slow timescale and estimates the mode-switching signal, and an recursive identification algorithm runs at a faster timescale and updates the parameters of the local models based on the estimate of the switching signal. We first focus on piece-wise affine systems and discuss identifiability conditions and convergence properties based on the theory of two-timescale stochastic approximation. In contrast to standard identification algorithms for switched systems, the proposed approach gradually estimates the number of modes and is appropriate for real-time system identification using sequential data acquisition. The progressive nature of the algorithm improves computational efficiency and provides real-time control over the performance-complexity trade-off. Finally, we address specific challenges that arise in the application of the proposed methodology in identification of more general switching systems. Simulation results validate the efficacy of the proposed methodology.

Index Terms:

Hybrid System Identification, Switching Systems, Piecewise Affine System Identification, Online Deterministic Annealing.

I Introduction

Hybrid systems, described by interacting continuous and discrete dynamics, are a powerful modeling tool in the analysis of systems where logic and continuous processes are interlaced, as in most complex cyber-physical systems. In addition to being able to describe switching dynamics, hybrid systems can be used as a tool to approximate highly non-linear dynamics by a collection of simpler models, and boost model explainability and robustness, by decomposing the behavior of a complex system into sub-systems where first principles and domain knowledge can be used for precise model tuning [1, 2]. As a result, hybrid systems have attracted significant attention in the control community.

However, first principles modelling is often too complicated and sub-optimal, and a hybrid model needs to be identified on the basis of observations. The majority of the work in this area is based on piece-wise affine (PWA) systems, a class of state-dependent switched systems with important applications in identification, verification, and control synthesis of hybrid and nonlinear systems [2, 3, 4, 5]. PWA systems are a collection of affine dynamical systems, indexed by a discrete-valued switching variable (mode) that depends on a partitioning of the state-input domain into a finite number of polyhedral regions [2, 3]. The input–output representation of PWA systems is the class of piece-wise affine auto-regressive exogenous (PWARX) systems with the switching signal depending on a partitioning of the domain of a vector containing the recent history of input–output pairs. The problem of identifying a PWA system can be quite challenging [6, 7]. As a result, most existing approaches focus on offline identification methods [8, 9].

I-A Contribution and Outline

In this work, we propose a two-timescale stochastic optimization approach for real-time state-dependent switched system identification in both input–output and state-space representations. We first focus on the well-studied case of PWA and PWARX systems. In Section II we present the realization and identifiability conditions for PWA systems, and in Theorem 1 of Section II-B we provide the identifiability conditions for state space PWA systems in the form of a persistence of excitation (PE) criterion. In Section III, we formulate the state-dependent switching system identification problem as a combined identification and prototype-based learning problem, and in Sections IV and V we develop a two-timescale stochastic approximation algorithm to solve it in real-time.

In particular, Theorem 3 of Section IV constructs a stochastic approximation algorithm based on online deterministic annealing that estimates the mode-switching signal, as well as the number of modes, through a bifurcation phenomenon, studied in Section IV-B. In Section V a second stochastic approximation algorithm based on standard adaptive filtering, running at a faster timescale, is developed to update the parameters of the local models based on the estimate of the switching signal. The convergence properties of this system of recursive algorithms are studied in Theorem 4 of Section V-B, and the applicability of the proposed approach in more general state-dependent switching systems is discussed in Section VI. Finally, in Section VII, simulation results validate the efficacy of the proposed approach in PWA systems.

I-B Related Work

Most existing switched system identification methods [2, 3, 4] can be categorized by the problem formulation used as optimization-based [4, 10, 8, 11], algebraic [12, 13], or clustering-based [9, 14, 15], and by the the method used as offline [9, 16, 17] or recursive [18, 13]. Algebraic methods are based on transforming the SARX model to a “lifted” ARX model that does not depend on the switching sequence [12, 13]. Offline optimization-based methods often rely on solving a large mixed-integer program, which is tractable only for small data sets [11, 8], or relaxation techniques over the same problem [18]. Finally, clustering-based methods are optimization-based methods that make use of unsupervised learning to estimate the partition of the domain that is needed for the switching signal [9, 14, 15, 19, 20, 21]. Most hybrid identification approaches are offline methods that first classify each observation and estimate the local model parameters (either simultaneously or iteratively), and then reconstruct the partition of the switching signal [9, 14, 22]. In our recent work, we have proposed the use of the online deterministic annealing approach as a clustering method to estimate the partition of the switching signal [23, 24]. Compared to standard offline methods, this approach allows for real-time PWA system identification, provides computational benefits, and offers real-time control over the performance-complexity trade-off, desired in many applications. In this work, we modify and extend this approach and provide an extensive study of a real-time prototype-based learning method for more general switched systems in both input–output and state-space representations. Compared to [23, 24], the proposed method also provides a solution to the central problem of estimating a minimal number of modes.

I-C Notation

The sets $\mathbb{R}$ and $\mathbb{Z}$ represent the sets of real and integer numbers, respectively, while $\mathbb{Z}_{+}$ represents the set of non-negative integers. For a real matrix $A\in\mathbb{R}^{n\times m}$ , $A^{\mathrm{T}}\in\mathbb{R}^{m\times n}$ denotes its transpose and $\text{vec}(A)\in\mathbb{R}^{mn}$ the vectorization of $A$ . The $n\times n$ identity matrix is denoted $I_{n}$ . $A\succeq 0$ is a positive semi-definite matrix, and the condition $A\succeq B$ is understood as $A-B\succeq 0$ . Unless otherwise specified, random variables $\mathcal{X}:\Omega\rightarrow\mathbb{R}^{d}$ are defined in a probability space $(\Omega,\mathbb{F},\mathbb{P})$ . The probability of an event is denoted $\mathbb{P}\left[\mathcal{X}\in S\right]\coloneq\mathbb{P}\left[\omega\in\Omega% :\mathcal{X}(\omega)\in S\right]$ , and the expectation operator $\mathbb{E}\left[\mathcal{X}\right]=\int_{\Omega}\mathcal{X}\textrm{d}\mathbb{P}$ . In case of multiple random variables $(\mathcal{X},\mathcal{Y})$ and a deterministic function $f$ , the expectation operator $\mathbb{E}\left[f(\mathcal{X},\mathcal{Y})\right]$ is understood with respect to the joint probability measure, while $\mathbb{E}\left[\mathcal{X}|\mathcal{Y}\right]\coloneq\mathbb{E}\left[\mathcal% {X}|\sigma(\mathcal{Y})\right]$ denotes the expectation of $\mathcal{X}$ conditioned to the $\sigma$ -field of $\mathcal{Y}$ . Stochastic processes $\left\{\mathcal{X}(k)\right\}_{k}$ , $k\in\mathbb{Z}_{+}$ , are defined in the filtered probability space $(\Omega,\mathbb{F},\left\{\mathcal{F}_{n}\right\}_{n},\mathbb{P})$ , where $\mathcal{F}_{n}=\sigma(\mathcal{X}(k)|k\leq n)$ , $k\in\mathbb{Z}_{+}$ , is the natural filtration. The indicator function of the event $\left[\mathcal{X}\in S\right]$ is denoted $\mathds{1}_{\left[\mathcal{X}\in S\right]}$ and $\otimes$ denotes the Kronecker product. Finally, “ $\min$ ” defines the minimization operator while “ $\operatorname*{minimize}$ ” defines a minimization problem.

II Switched and Piecewise Affine Systems

A general discrete-time switched system is described by:

	$\displaystyle x_{t+1}$	$\displaystyle=f_{\sigma_{t}}(x_{t},u_{t})+w_{t}$		(1)
	$\displaystyle y_{t}$	$\displaystyle=g_{\sigma_{t}}(x_{t},u_{t})+v_{t},\quad t\in\mathbb{Z}_{+}$		(1)

where $x_{t}\in\mathbb{R}^{n}$ is the state vector of the system, $u_{t}\in\mathbb{R}^{p}$ the input, $y_{t}\in\mathbb{R}^{q}$ the output, and $w_{t}\in\mathbb{R}^{n}$ and $v_{t}\in\mathbb{R}^{q}$ are noise terms. The signal $\sigma_{t}\in\left\{1,\ldots,s\right\}$ defines the mode which is active at time $t$ . System (1) is a switched affine system when it can be expressed as:

	$\displaystyle x_{t+1}$	$\displaystyle=A_{\sigma_{t}}x_{t}+B_{\sigma_{t}}u_{t}+\bar{f}_{\sigma_{t}}+w_{t}$		(2)
	$\displaystyle y_{t}$	$\displaystyle=C_{\sigma_{t}}x_{t}+D_{\sigma_{t}}u_{t}+\bar{g}_{\sigma_{t}}+v_{% t},\quad t\in\mathbb{Z}_{+}.$		(2)

The matrices $A_{i}\in\mathbb{R}^{n\times n}$ , $B_{i}\in\mathbb{R}^{n\times p}$ , $C_{i}\in\mathbb{R}^{q\times n}$ , $D_{i}\in\mathbb{R}^{q\times p}$ , $\bar{f}_{i}\in\mathbb{R}^{n}$ , and $\bar{g}_{i}\in\mathbb{R}^{q}$ define the affine dynamics for each mode $i\in\left\{1,\ldots,s\right\}$ . System (2) is PWA when $\sigma_{t}$ is defined according to a polyhedral partition of the state and input space, i.e., when

\sigma_{t}=i\iff\begin{bmatrix}x_{t}\\ u_{t}\end{bmatrix}\in R_{i}\subset R,

(3)

where $R_{i}$ , $i=1,\ldots,s$ , are convex polyhedra defining a partition of the state-input domain $R\subseteq\mathbb{R}^{n+p}$ , that is when $R_{i}\cap R_{j}=\emptyset$ for $i\neq j$ , and $\bigcup_{i}R_{i}=R$ .

Switched affine systems can be expressed in input–output form as Switched AutoRegressive eXogenous (SARX) systems of fixed orders $n_{a}$ , $n_{b}$ , such that for every component $y_{t}^{(i)}\in\mathbb{R}$ of the output vector $y_{t}\in\mathbb{R}^{q}$ it holds:

\displaystyle y_{t}^{(i)}=\bar{\theta}_{\sigma_{t}}^{(i)\mathrm{T}}\begin{% bmatrix}r_{t}\\ 1\end{bmatrix}+\bar{e}_{t}^{(i)},\ i=1,\ldots,q,

(4)

where the regressor vector $r_{t}\in\mathbb{R}^{\bar{d}}$ , $\bar{d}=qn_{a}+p(n_{b}+1)$ , is defined by

r_{t}=[y_{t-1}^{\mathrm{T}}\ldots y_{t-n_{a}}^{\mathrm{T}}u_{t}^{\mathrm{T}}u_% {t-1}^{\mathrm{T}}\ldots u_{t-n_{b}}^{\mathrm{T}}]^{\mathrm{T}}\in\mathbb{R}^{% \bar{d}}.

(5)

The parameter vectors $\bar{\theta}_{j}^{(i)}\in\mathbb{R}^{\bar{d}+1}$ , $j\in\left\{1,\ldots,s\right\}$ , define each ARX mode, and $\bar{e}_{t}\in\mathbb{R}^{q}$ is a noise term. Similarly, (4) is PWARX if

\sigma_{t}=i\iff r_{t}\in P_{i}\subset P\subseteq\mathbb{R}^{d},

(6)

and $\left\{P_{i}\right\}_{i=1}^{s}$ define a polyhedral partition of $P\subseteq\mathbb{R}^{d}$ .

II-A Realization and Identification of PWARX Models

Results in [25] show that every observable switched affine system admits a SARX representation. Necessary and sufficient conditions for input–output realization of SARX and PWARX systems are given in [26], and [27], respectively. It is worth mentioning, however, that the number of modes and parameters can grow considerably when a PWA state-space system is converted into a minimum-order equivalent PWARX representation [27].

Identifiability refers to whether or not identification of a given parameterized system from noise-free data is a well-posed problem. In spite of the increasing attention received by SARX and PWARX system identification, there are currently only few results on the identifiability of these systems [2, 3]. Identifiability with respect to the input–output behavior of switched linear systems is investigated in [28]. The general identification problem for a PWARX system of the form (4)-(6) can be formulated as a stochastic optimization problem over the parameters $\left\{n_{a},n_{b},s,\left\{\theta_{i}\right\}_{i=1}^{s},\left\{P_{i}\right\}_% {i=1}^{s}\right\}$ . We make the following assumption:

Assumption 1.

Upper bounds $(\tilde{n}_{a},\tilde{n}_{b})$ on the orders of the model $(n_{a},n_{b})$ are known.

Assumption 1 will allow us to concentrate on the properties of PWARX identification, assuming known $(\tilde{n}_{a},\tilde{n}_{b})$ subject to potential computational bounds.

II-B Realization and Identification of PWA State-Space Models

The problem of identifying a state-space representation of a switched affine system can be quite challenging. Traditionally, it has been handled linked to applying results from classical realization theory to each linear subsystem [7]. However, identifiability issues arise regarding the characterization of minimality of discrete-time switched linear systems. The first issue relates to the known fact that realizations of a switched affine system are not unique [6]. The lack of uniqueness is related to that (i) the minimal realizations of the local linear systems from input–output observations are non-unique, and (ii) a realization of a switched affine system can be constructed for any arbitrary number of modes $s^{\prime}\geq s$ [6]. Here we address the problem of the realization of the local systems being unique only up to isomorphisms, even when the switching signal is known [7]. In particular, assuming no affine dynamics $\bar{f}_{\sigma_{t}}$ , $\bar{g}_{\sigma_{t}}$ , for any set of invertible matrices $P_{\sigma_{t}}\in\mathbb{R}^{n\times n}$ , $\sigma_{t}\in\left\{1,\ldots,s\right\}$ , the realization $\left\{P_{\sigma_{t}}A_{\sigma_{t}}P_{\sigma_{t}}^{-1},P_{\sigma_{t}}B_{\sigma% _{t}},C_{\sigma_{t}}P_{\sigma_{t}}^{-1},D_{\sigma_{t}}\right\}$ corresponds to the transfer function associated with (2), i.e., the same input–output observations. To ensure uniqueness of the realizations, given that all subsystems $i\in\left\{1,\ldots,s\right\}$ share the same state space, and simplify the presentation of our methodology, we make the following assumptions.

Assumption 2.

$C_{i}=C$ , $\forall i\in\left\{1,\ldots,s\right\}$ in system (2).

Assumption 2 implies that the order $n$ is known (observed) and enforces that the set of observations is acquired using the same observation mechanism, which leads to the realization of (2) being unique. In practice, this corresponds, for example, with the assumption of using a single sensor with the same world reference to measure the states of the system for every mode, without allowing any similarity transformation.

Assumption 3.

No affine dynamics, i.e., $\bar{f}_{\sigma_{t}}=0$ , $\bar{g}_{\sigma_{t}}=0$ , no feed-forward terms, i.e., $D_{\sigma_{t}}=0$ , the states are fully observable, i.e., $C=I_{n}$ , and the error terms $w_{t}$ and $v_{t}$ share the same zero-mean statistics for every mode of the system.

Assumptions 3 are made to simplify the presentation of the proposed methodology without loss of generality.

In addition to the realizations of the local systems being non-unique, minimality and identifiability of the switched system does not necessarily imply that of the local subsystems [28]. In Theorem 1, we describe the conditions under which the local linear models of (2) (under Assumptions 2–3) can be identified, even when a subset of them is not controllable (minimal) in isolation.

Theorem 1.

Consider a bounded-input bounded-output linear discrete-time system of the form:

	$\displaystyle x_{t+1}$	$\displaystyle=Ax_{t}+Bu_{t},\quad t\in\mathbb{Z}_{+}$		(7)
	$\displaystyle y_{t}$	$\displaystyle=x_{t},$		(7)

where $x_{t}\in\mathbb{R}^{n}$ , $u_{t}\in\mathbb{R}^{p}$ , $A\in\mathbb{R}^{n\times n}$ , and $B\in\mathbb{R}^{n\times p}$ . Denote $r_{t}=[x_{t}^{\mathrm{T}}u_{t}^{\mathrm{T}}]^{\mathrm{T}}$ . Then, if there exist some $\alpha,\beta,T>0$ such that

\alpha I_{n+p}\preceq\sum_{\tau=t}^{t+T}r_{t}r_{t}^{\mathrm{T}}\preceq\beta I_% {n+p},\quad\forall t\geq 0,

(8)

the augmented parameter matrix $\hat{\Theta}_{t}=[\hat{A}_{t}|\hat{B}_{t}]$ updated by the recursion

\hat{\Theta}_{t+1}=\hat{\Theta}_{t}-\gamma\left(\hat{\Theta}_{t}r_{t}-x_{t+1}% \right)r_{t}^{\mathrm{T}},\quad t\geq 0,

(9)

for some $\gamma>0$ , asymptotically converges to $\Theta=[A|B]$ .

Proof.

See Appendix A. ∎

As a result of Theorem 1, throughout this paper, we make the following assumption to ensure identifiability of (2) under Assumptions 2–3:

Assumption 4.

All linear subsystems $i\in\left\{1,\ldots,s\right\}$ of (2) are asymptotically bounded, and the bounded control input $u_{t}$ is designed such that for every mode $i\in\left\{1,\ldots,s\right\}$ of (2), there exist some $\alpha_{i},\beta_{i},T_{i}>0$ for which the following persistence of excitation condition holds:

\alpha_{i}I_{n+p}\preceq\sum_{\tau=t}^{t+T_{i}}\begin{bmatrix}x_{\tau}x_{\tau}% ^{\mathrm{T}}&x_{\tau}u_{\tau}^{\mathrm{T}}\\ u_{\tau}x_{\tau}^{\mathrm{T}}&u_{\tau}u_{\tau}^{\mathrm{T}}\end{bmatrix}% \preceq\beta_{i}I_{n+p},\ \forall t\geq 0.

(10)

Remark 1.

Informally, condition (10) states that not every subsystem in (2) should be controllable (minimal), as long as the boundaries of each mode (region $R_{i}$ in the state-input system) are visited often enough and form a rich-enough set of states.

Remark 2.

The assumption of asymptotic boundedness and controllability (thus, minimality) for all subsystems of (2) would simplify the condition (10) to a persistence of excitation criterion for the input $u_{t}$ for each subsystem separately. Although this assumption is usually adopted, it is a limiting assumption in a practical sense. The assumption that all the local systems share the same state space of order $n$ is a modeling assumption that facilitates the identification of the switched signal as a partition of the state-input space. However, it allows for situations when the minimal realization of some of the local models is of order $n^{\prime}<n$ , as long as the switched system as a whole is identifiable.

III Hybrid System Identification as an Optimization Problem

Consider a switched linear system of the form

	$\displaystyle\psi_{t}$	$\displaystyle=\Theta_{i}\phi_{t}+e_{t},$		(11)
		$\displaystyle=[\phi_{t}^{\mathrm{T}}\otimes I_{m}]\theta_{i}+e_{t},\text{ if }% \phi_{t}\in S_{i},\ t\in\mathbb{Z}_{+},$		(11)

where $\psi_{t}\in\mathbb{R}^{m}$ , $\phi_{t}\in\mathbb{R}^{d}$ , $\sigma_{t}\in\left\{1,\ldots,s\right\}$ , $\Theta_{i}\in\mathbb{R}^{m\times d}$ , for all $i=1,\ldots,s$ , $\theta_{i}=\text{vec}(\Theta_{i})\in\mathbb{R}^{md}$ , $e_{t}\in\mathbb{R}^{m}$ is a zero-mean noise signal, and $\left\{S_{i}\right\}_{i=1}^{s}$ define a polyhedral partition of $S\subseteq\mathbb{R}^{d}$ . System (4) can be written in the form (11) with $\psi_{t}=y_{t}\in\mathbb{R}^{q}$ , $\phi_{t}=[r_{t}^{\mathrm{T}}1]^{\mathrm{T}}\in\mathbb{R}^{\bar{d}+1}$ , and $\Theta_{i}=[\bar{\theta}_{i}^{(1)}\ldots\bar{\theta}_{i}^{(q)}]^{\mathrm{T}}$ , where $m=q$ , and $d=\bar{d}+1$ . In addition, system (2) under Assumptions 2, 3 can be written in the form (11) with $\psi_{t}=x_{t+1}\in\mathbb{R}^{n}$ , $\phi_{t}=[x_{t}^{\mathrm{T}}u_{t}^{\mathrm{T}}]^{\mathrm{T}}\in\mathbb{R}^{n+p}$ , and $\Theta_{i}=[A_{i}|B_{i}]$ , where $m=n$ , and $d=n+p$ .

Under the identifiability conditions discussed in Section II, the general identification problem for a switching system of the form (11) can be formulated as a stochastic optimization problem over the parameters $\left\{s,\left\{\theta_{i}\right\}_{i=1}^{s},\left\{S_{i}\right\}_{i=1}^{s}\right\}$ , as follows:

\operatorname*{minimize}_{s,\left\{\theta_{i}\right\},\left\{S_{i}\right\}}\ % \mathbb{E}\left[\sum_{i=1}^{s}\mathds{1}_{\left[\Phi\in S_{i}\right]}d_{\rho}% \left(\Psi,[\Phi^{\mathrm{T}}\otimes I_{m}]\theta_{i}\right)\right],

(12)

where $\Psi\in\mathbb{R}^{m}$ and $\Phi\in\mathbb{R}^{d}$ represent random variables, realizations of which constitute the system observations, the nonnegative measure $d_{\rho}$ is an appropriately defined dissimilarity measure, and the expectation is taken with respect to the joint distribution of $(\Psi,\Phi)\in\mathbb{R}^{m+d}$ that depends on the system dynamics, the control input, and the noise term in (11).

It is clear that the optimization problem (12) is computationally hard and becomes intractable as the number of modes and states increases. In particular, the number of modes $s$ is unknown and completely alters the cardinality and the domain of the set of parameter vectors $\left\{\theta_{i}\right\}_{i=1}^{s}$ that represent the dynamics of the system. In addition, a parametric representation for the polyhedral regions $\left\{S_{i}\right\}$ should be defined.

To represent the regions $\left\{S_{i}\right\}$ , we will follow a Voronoi tessellation approach based on prototypes. We introduce a set of parameters $\hat{\phi}\coloneq\left\{\hat{\phi}_{i}\right\}_{i=1}^{K}$ , $\hat{\phi}_{i}\in S$ and define the regions:

\Sigma_{i}=\left\{\phi\in S:i=\operatorname*{arg\,min}_{j}d_{\rho}(\phi,\hat{% \phi}_{j})\right\},\ i=1,\ldots,K.

(13)

The measure $d_{\rho}$ can be designed such that the Voronoi regions $\Sigma_{i}$ are polyhedral, e.g., when $d_{\rho}$ is a squared Euclidean distance or any Bregman divergence, as will be explained in Section IV-A. In this sense, each $S_{i}$ can be mapped to a region $\Sigma_{j}$ (for $K=s$ ) or the union of a subset of of $\left\{\Sigma_{j}\right\}$ (for $K>s$ ), according to a predefined rule, as will be explained in Section IV-C. An illustration of this partition is given in Fig. 1.

In addition to the prototype parameters $\left\{\hat{\phi}_{i}\right\}_{i=1}^{K}$ , we also introduce a set of parameters $\hat{\theta}\coloneq\left\{\hat{\theta}_{i}\right\}_{i=1}^{K}$ , $\hat{\theta}_{i}\in\mathbb{R}^{md}$ , with each $\hat{\theta}_{i}$ associated with the region $\Sigma_{i}$ according to (13). Representing the augmented random vector

X=\begin{bmatrix}\Psi\\ \Phi\end{bmatrix}\in\Pi\subseteq\mathbb{R}^{m+d},

(14)

we can define a set of augmented codevectors $\mu\coloneq\left\{\mu_{i}\right\}_{i=1}^{K}$ as

\mu_{i}=\begin{bmatrix}z(\phi,\hat{\theta}_{i})\\ \hat{\phi}_{i}\end{bmatrix}\in\Pi,\ i=1,\ldots,K,

(15)

where the first component of each $\mu_{i}$ ¹¹1Throughout this paper we will use the notation $\mu_{i}$ , $\mu_{i}(\hat{\phi}_{i})$ , $\mu_{i}(\hat{\theta}_{i})$ , $\mu_{i}(\hat{\theta}_{i},\hat{\phi}_{i})$ , $\mu_{i}(\phi,\hat{\theta}_{i},\hat{\phi}_{i})$ interchangeably, to showcase the dependence on the variables of interest in each case. is a mapping $z(\phi,\hat{\theta}_{i})=[\phi^{\mathrm{T}}\otimes I_{m}]\hat{\theta}_{i}$ that simulates the local model dynamics in (11) with unknown parameters $\theta_{i}$ , and the second component is a set of unknown codevectors $\hat{\phi}_{i}$ that define the partition in (13).

Problem (12) can then be decomposed into two interconnected stochastic optimization problems. Assuming $\left\{\hat{\theta}_{i}\right\}_{i=1}^{K}$ are known, the optimization problem

\operatorname*{minimize}_{\hat{\phi}}\ \mathbb{E}\left[\sum_{i=1}^{K}\mathds{1% }_{\left[\Phi\in\Sigma_{i}(\hat{\phi})\right]}d_{\rho}\left(X(\Psi,\Phi),\mu_{% i}(\hat{\theta}_{i},\hat{\phi}_{i})\right)\right]

(16)

finds the optimal parameters $\left\{\hat{\phi}_{i}\right\}_{i=1}^{K}$ that define the partition $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ subject to the joint distribution of $(\Psi,\Phi)$ , and is, therefore, a mode switching signal identification problem.

On the other hand, assuming the partition $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ (and, therefore, $\left\{S_{i}\right\}_{i=1}^{s}$ ) is known, the optimization problem

\operatorname*{minimize}_{\hat{\theta}}\ \mathbb{E}\left[\sum_{i=1}^{K}\mathds% {1}_{\left[\Phi\in\Sigma_{i}\right]}d_{\rho}\left(\Psi,[\Phi^{\mathrm{T}}% \otimes I_{m}]\hat{\theta}_{i}\right)\right]

(17)

is a system identification problem for each mode of the system.

In Section IV we address the question of finding the optimal number $K$ according to a performance-complexity trade-off, as well as finding a mapping between $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ and $\left\{S_{i}\right\}_{i=1}^{\hat{s}}$ for the lowest possible number $\hat{s}\geq s$ . In Section V we tackle the problem of solving (16) and (17) as a system of interconnected stochastic optimization problems in real-time using principles from two-timescale stochastic approximation theory.

Refer to caption — Figure 1: Illustration of the partition $\left\{S_{i}\right\}_{i=1}^{s}$ of the state-input space $S$ and its connection to the artificial partition $\left\{\Sigma_{j}\right\}_{j=1}^{K}$ . The optimal parameters $\left\{\hat{\phi}_{j}\right\}$ induce a partition $\left\{\Sigma_{j}\right\}$ that minimizes the mode switching error.

IV Mode Identification with Online Deterministic Annealing

We aim to construct a recursive stochastic optimization algorithm to solve problem (16) while progressively estimating the number $K$ of the augmented codevectors $\left\{\mu_{i}\right\}_{i=1}^{K}$ , an estimate $\hat{s}$ of the actual number of modes, and a mapping between $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ and $\left\{S_{i}\right\}_{i=1}^{\hat{s}}$ . Recall that the observed data are represented by the random variable $X\in\Pi$ in (14), and the augmented codevectors $\left\{\mu_{i}\right\}_{i=1}^{K}$ are normally treated as constant parameters to be estimated. To progressively estimate $K$ and $\hat{s}$ , we will adopt the online deterministic annealing approach [29, 30], and define a probability space over an arbitrary number of codevectors, while constraining their distribution using a maximum-entropy principle at different levels. First we define a quantizer $Q:\Pi\rightarrow\Pi$ as a stochastic mapping of the form:

Q(x)=\mu_{i}\ \text{ with probability }p(\mu_{i}|x).

(18)

Then we formulate the multi-objective optimization

\operatorname*{minimize}_{\hat{\phi}}\ F_{\lambda}(\mu)=(1-\lambda)D(\mu)-% \lambda H(\mu),\ \lambda\in[0,1),

(19)

where the dependence on $\hat{\phi}$ comes through $\mu(\hat{\phi})$ , the term

\displaystyle D(\mu)

\displaystyle=\mathbb{E}\left[d\left(X,Q\right)\right]=\int p(x)\sum_{i}p(\mu_% {i}|x)d_{\rho}(x,\mu_{i})~{}\textrm{d}x

(20)

is a generalization of the objective in (16), and

	$\displaystyle H(\mu)$	$\displaystyle=\mathbb{E}\left[-\log P(X,Q)\right]$		(21)
		$\displaystyle=H(X)-\int p(x)\sum_{i}p(\mu_{i}\|x)\log p(\mu_{i}\|x)~{}\textrm{d}x$		(21)

is the Shannon entropy. This is now a problem of finding the locations $\left\{\hat{\phi}_{i}\right\}$ and the corresponding probabilities $\left\{p(\mu_{i}|x)=\mathbb{P}[Q=\mu_{i}|X=x]\right\}$ .

Notice that, for $p(\mu_{i}|x)=\mathds{1}_{\left[\phi\in\Sigma_{i}(\hat{\phi})\right]}$ and $\lambda=0$ , (19) is equivalent to (16). In that sense, (19) introduces extra optimization parameters in the probabilities $\left\{p(\mu_{i}|x)\right\}$ , and the parameter $\lambda$ that defines a homotopy $F_{\lambda}$ . However, the advantages of this approach are notable, and, perhaps counter-intuitively, lead to numerical optimization solutions with several computational benefits. On the one hand, the Lagrange multiplier $\lambda\in[0,1)$ controls the trade-off between $D$ and $H$ , which, as will be shown, is a trade-off between performance and complexity. On the other hand, the use of the conditional probabilities $\left\{p(\mu_{i}|x)\right\}$ allows for the definition of the entropy term $H$ , which introduces several useful properties [29, 30, 31, 32, 33]. In particular, as we will show in Section IV-B, reducing the values of $\lambda$ defines a direction that resembles an annealing process [29, 34] and induces a bifurcation phenomenon, with respect to which, the number of unique codevectors $K_{\lambda}$ depends on $\lambda$ and is finite for any given value of $\lambda>0$ . This process also introduces robustness with respect to initial conditions [29, 35].

IV-A Solving the Optimization Problem

To solve (19) for a given value of $\lambda$ , we successively minimize $F_{\lambda}$ first with respect to the association probabilities $\left\{p(\mu_{i}|x)\right\}$ , and then with respect to the codevector locations $\mu$ . The solution of the optimization problem

	$\displaystyle F_{\lambda}^{*}(\mu)$	$\displaystyle=\min_{\left\{p(\mu_{i}\|x)\right\}}F_{\lambda}(\mu),$		(22)
	s.t.	$\displaystyle\sum_{i}p(\mu_{i}\|x)=1,$		(22)

is given by the Gibbs distributions [36]:

p^{*}(\mu_{i}|x)=\frac{e^{-\frac{1-\lambda}{\lambda}d_{\rho}(x,\mu_{i})}}{\sum% _{j}e^{-\frac{1-\lambda}{\lambda}d_{\rho}(x,\mu_{j})}},~{}\forall x\in\Pi.

(23)

In order to minimize $F^{*}(\mu)$ with respect to $\hat{\phi}$ we set the gradients to zero

\frac{\mathrm{d}}{\mathrm{d}{\hat{\phi}}}F_{\lambda}^{*}(\mu)=\frac{\mathrm{d}% }{\mathrm{d}\mu}F_{\lambda}^{*}(\mu)\frac{\mathrm{d}\mu}{\mathrm{d}{\hat{\phi}% }}=0

(24)

where $\frac{\mathrm{d}\mu}{\mathrm{d}{\hat{\phi}}}=\begin{bmatrix}0_{m\times d}\\ I_{d}\end{bmatrix}$ , and

		$\displaystyle\frac{\mathrm{d}}{\mathrm{d}{\mu}}F_{\lambda}^{*}(\mu)=\frac{% \mathrm{d}}{\mathrm{d}\mu}\left((1-\lambda)D(\mu)-\lambda H(\mu)\right)$		(25)
		$\displaystyle=\sum_{i}\int p(x)p^{*}(\mu_{i}\|x)\frac{\mathrm{d}}{\mathrm{d}\mu% _{i}}d_{\rho}(x,\mu_{i})~{}\mathrm{d}x=0,$		(25)

where we have used (23) and direct differentiation with similar arguments as in [36]. It follows that $\frac{\mathrm{d}}{\mathrm{d}\hat{\phi}}F_{\lambda}^{*}(\mu)=0$ which implies that

\displaystyle\int p(x)p^{*}(\mu_{i}|x)\frac{\mathrm{d}}{\mathrm{d}\mu_{i}}d_{% \rho}(x,\mu_{i})~{}\mathrm{d}x\begin{bmatrix}0_{m\times d}\\ I_{d}\end{bmatrix}=0,\ \forall i.

(26)

Equation (26) has a closed-form solution if the dissimilarity measure $d_{\rho}$ belongs to the family of Bregman divergences [37, 29],information-theoretic dissimilarity measures that include the squared Euclidean distance and the Kullback-Leibler divergence, and are defined as follows:

Definition 1 (Bregman Divergence).

Let $\rho:S\rightarrow\mathbb{R}$ , be a strictly convex function defined on a vector space $S\subseteq\mathbb{R}^{d}$ such that $\phi$ is twice F-differentiable on $S$ . The Bregman divergence $d_{\rho}:H\times S\rightarrow\left[0,\infty\right)$ is defined as:

\displaystyle d_{\rho}\left(x,\mu\right)=\rho\left(x\right)-\rho\left(\mu% \right)-\frac{\partial\rho}{\partial\mu}\left(\mu\right)\left(x-\mu\right),

where $x,\mu\in S$ , and the continuous linear map $\frac{\partial\rho}{\partial\mu}\left(\mu\right):S\rightarrow\mathbb{R}$ is the Fréchet derivative of $\rho$ at $\mu$ .

Throughout this manuscript, we will assume that the dissimilarity measure $d_{\rho}$ in (13) is a Bregman divergence. Then the solution to the optimization problem

\operatorname*{minimize}_{\hat{\phi}}~{}F_{\lambda}^{*}\left(\mu(\hat{\phi})% \right),

(27)

where $F_{\lambda}^{*}(\mu)$ is the solution of (22) for a given $\lambda\in[0,1)$ and $p^{*}(\mu_{i}|x)$ is given by (23), is given by Theorem 2.

Theorem 2.

If $d_{\rho}:\Pi\times\Pi\rightarrow\mathbb{R}_{+}$ is a Bregman divergence, then

\hat{\phi}_{i}^{*}=\frac{\int\phi p(x)p^{*}(\mu_{i}|x)~{}\mathrm{d}x}{p^{*}(% \mu_{i})}

(28)

is a solution to the optimization problem (27).

Proof.

By definition, for a Bregman divergence $d_{\rho}:\Pi\times\Pi\rightarrow\mathbb{R}_{+}$ based on a strictly convex function $\rho:\Pi\rightarrow\mathbb{R}$ , it holds that $\frac{\partial d_{\rho}}{\partial\mu}(x,\mu)=-\left<\nabla^{2}\rho(\mu),(x-\mu% )\right>$ . Similar to [30], with standard algebraic manipulations, (26) then becomes

\int(\phi-\hat{\phi}_{i}^{*})p(x)p^{*}(\mu_{i}|x)~{}\mathrm{d}x=0,\ \forall i,

(29)

where $p^{*}(\mu_{i}|x)$ is given by (23) and the integral is defined over the domain $\Pi$ . Eq. (29) is equivalent to (28) since $\int p(x)p^{*}(\mu_{i}|x)~{}\mathrm{d}x=p^{*}(\mu_{i})$ . ∎

Remark 3.

The partition $\left\{\Sigma_{i}\right\}$ induced by (13) and a dissimilarity measure $d_{\rho}$ that belongs to the family of Bregman divergences, is separated by hyperplanes [37]. As a result, each $\Sigma_{i}$ is a polyhedral region for a bounded domain $S$ .

Based on Theorem 2, Theorem 3 below constructs a gradient-free stochastic approximation algorithm that recursively estimates (28).

Theorem 3.

The sequence $\hat{\phi}_{i}(t)$ constructed by the recursive updates

\begin{cases}\rho_{i}(t+1)&=\rho_{i}(t)+\beta(t)\left[\hat{p}_{i}(t)-\rho_{i}(% t)\right]\\ \sigma_{i}(t+1)&=\sigma_{i}(t)+\beta(t)\left[\phi_{t}\hat{p}_{i}(t)-\sigma_{i}% (t)\right],\end{cases}

(30)

where $x_{t}=[\psi_{t}^{\mathrm{T}}\phi_{t}^{T}]^{\mathrm{T}}$ represents external input with $\psi_{t}\sim\Psi$ , $\phi_{t}\sim\Phi$ , $\sum_{t}\beta(t)=\infty$ , $\sum_{t}\beta^{2}(t)<\infty$ , and the quantities $\hat{p}_{i}(t)$ and $\hat{\phi}_{i}(t)$ are recursively updated as follows:

\displaystyle\hat{\phi}_{i}(t)=\frac{\sigma_{i}(t)}{\rho_{i}(t)},\quad\hat{p}_% {i}(t)=\frac{\rho_{i}(t)e^{-\frac{1-\lambda}{\lambda}d(x_{t},\mu_{i}(t))}}{% \sum_{i}\rho_{i}(t)e^{-\frac{1-\lambda}{\lambda}d(x_{t},\mu_{i}(t))}},

(31)

with $\mu_{i}(t)=[z_{i}^{\mathrm{T}}(\phi_{t},\hat{\theta}_{i}),\hat{\phi}_{i}(t)^{% \mathrm{T}}]^{\mathrm{T}}$ , converges almost surely to $\hat{\phi}_{i}^{*}$ given in (28).

Proof.

The proof follows similar arguments as Theorem $5$ of [30]. ∎

Remark 4.

Notice that the dynamics of (30) can be expressed as:

\displaystyle\hat{\phi}_{i}(t+1)

\displaystyle=\frac{\beta(t)}{\rho_{i}(t)}\bigg{[}\frac{\sigma_{i}(t+1)}{\rho_% {i}(t+1)}\left(\rho_{i}(t)-\hat{p}_{i}(t)\right)+\phi_{t}\hat{p}_{i}(t)-\sigma% _{i}(t)\bigg{]},

(32)

where the recursive updates take place for every codevector $\hat{\phi}_{i}$ sequentially. This is a discrete-time dynamical system that presents bifurcation phenomena with respect to the parameter $\lambda$ , i.e., the number of equilibria of this system changes with respect to the value $\lambda$ which is hidden inside the term $\hat{p}_{i}(t)$ in (31). According to this phenomenon, the number of distinct values of $\hat{\phi}_{i}$ is finite, and the updates need only be taken with respect to these values that we call “effective codevectors”. This is discussed in Section IV-B.

IV-B Bifurcation Phenomena

In Section IV-A we described how to solve the optimization problem for a given value of the parameter $\lambda$ . The main idea of the proposed approach is to solve a sequence of optimization problems of the form (19) with decreasing values of $\lambda$ . This process then becomes a homotopy optimization method [38]. In particular, the usage of the entropy term resembles annealing optimization methods and grants $\lambda$ the name of a ’temperature’ parameter. Notice that, so far, we have assumed an arbitrary number of codevectors $K$ . We will show that the unique values of the set $\left\{\hat{\phi}_{i}\right\}$ that solves (19), form a finite set of $K_{\lambda}$ values that we will refer to as “effective codevectors”.

Notice that at high temperature ( $\lambda\rightarrow 1$ ), (23) yields uniform association probabilities $p(\mu_{i}|x)=p(\mu_{j}|x),\ \forall i,j,\forall x$ , and as a result of (28), all pseudo-inputs are located at the same point $\hat{\phi}_{i}=\mathbb{E}_{X}\left[\phi\right],\ \forall i$ , which means that there is one unique “effective” codevector given by $\mathbb{E}_{X}\left[\phi\right]$ . As $\lambda$ is lowered below a critical value, a bifurcation phenomenon occurs, when the number of “effective” codevectors increases, which describes an annealing process [29, 34]. Mathematically, this occurs when the existing solution $\hat{\phi}^{*}$ given by (28) is no longer the minimum of the free energy $F^{*}$ , as the temperature $\lambda$ crosses a critical value. Following principles from variational calculus, we can track the bifurcation by the condition:

\frac{d^{2}}{d\epsilon^{2}}F^{*}(\left\{\hat{\phi}+\epsilon\hat{\psi}\right\})% \bigg{|}_{\epsilon=0}\geq 0,

(33)

for all choices of finite perturbations $\left\{\hat{\psi}\right\}$ . Using (33) and direct differentiation, one can show that bifurcation depends on the temperature coefficient $\lambda$ (and the choice of the Bregman divergence, through the function $\rho$ ) [30, 36]. In other words, the number of codevectors increases countably many times as the value of $\lambda$ decreases, and an algorithmic implementation needs only as many codevectors in memory as the number of “effective” codevectors.

In practice. we can detect the bifurcation points by introducing perturbing pairs of codevectors at each temperature level $\lambda$ . In this way, the codevectors $\hat{\phi}$ are doubled by inserting a perturbation of each $\hat{\phi}_{i}$ in the set of effective codevectors. The newly inserted codevectors will merge with their pair if a critical temperature has not been reached and separate otherwise. The merging criterion takes the form:

\frac{1-\lambda}{\lambda}d_{\rho}(\hat{\phi}_{i},\hat{\phi}_{j})\leq\epsilon_{% n},\ \forall i,j,

(34)

for a given threshold $\epsilon_{n}$ . The pseudocode for this algorithm is presented in Alg. 1. A detailed discussion on the implementation of the original online deterministic annealing algorithm, its complexity, and the effect of its parameters, can be found in [29, 30, 36].

IV-C Estimating the number of modes

As illustrated in Fig. 1, the problem formulation developed in Section III defines a possibly imperfect surjective mapping from $\left\{\Sigma_{j}\right\}_{j=1}^{K}$ to $\left\{S_{i}\right\}_{i=1}^{s}$ such that each $S_{i}$ is defined as a union of a subset of $\left\{\Sigma_{j}\right\}_{j=1}^{K}$ . According to Remark 3, it is possible for this mapping to be perfect in the sense that each $S_{i}$ is perfectly represented, inducing zero mode switching error. The design of an appropriate termination criterion for Alg. 1 is an open question and is subject to the trade-off between the number $K$ and the minimization of the identification error. In this work, we make use of the condition $K\leq K_{\max}$ as a termination criterion, where $K_{\max}$ represents the computational capacity of the identification device.

Recall that each $\Sigma_{j}$ is associated with a parameter vector $\hat{\theta}_{j}$ , $j=1,\ldots,K$ . Assuming a set $\bar{\theta}=\left\{\bar{\theta}_{i}\right\}_{i=1}^{\hat{s}}$ , we define each $\hat{\theta}_{j}$ as the mapping:

\hat{\theta}_{j}(\bar{\theta})=\bar{\theta}_{i},\ \text{if }i=\operatorname*{% arg\,min}_{k}d_{\rho}(\hat{\theta}_{j},\bar{\theta}_{k}).

(35)

In this way $\Sigma_{j}\in S_{i}$ if $\hat{\theta}_{j}(\bar{\theta})=\bar{\theta}_{i}$ . Therefore, given (35), the goal now is to find $\hat{s}$ and $\bar{\theta}$ such that $\hat{s}=s$ , and $\bar{\theta}_{i}=\theta_{i}$ , $\forall i\in\left\{1,\ldots,s\right\}$ . We follow a similar approach to the bifurcation mechanism described in Section IV-B. Starting with one codevector $\hat{\phi}_{0}$ , we define $\bar{\theta}_{0}=\hat{\theta}_{0}$ . Every time a codevector $\hat{\phi}_{j}$ is split into a pair of perturbed codevectors, a new $\hat{\theta}_{j^{\prime}}$ is introduced. After convergence for a given $\lambda$ , merging of the codevectors is detected by (34). For the insertion of a new $\bar{\theta}_{i}$ we check the condition:

d_{\rho}(\hat{\theta}_{j},\bar{\theta}_{i})>\epsilon_{s},\ \forall j,

(36)

with respect to a given threshold $\epsilon_{s}$ . Notice that in contrast to (34), (36) does not depend on $\lambda$ . If (36) is satisfied, a new $\bar{\theta}_{i}$ is introduced and $\hat{s}\leftarrow\hat{s}+1$ . This process is integrated in the mode identification algorithm and its pseudocode is presented in Alg. 1.

Remark 5.

Note that $\left\{\hat{\theta}_{j}\right\}$ are only used as functions of $\bar{\theta}$ , and the parameters $\left\{\bar{\theta}_{i}\right\}$ are the ones that are being updated by the local system identification algorithm that will be presented in Section V.

Algorithm 1 Switched System Identification

Set parameters and initialize

\hat{\phi}=\left\{\hat{\phi}_{0}\right\},\bar{\theta}=\left\{\bar{\theta}_{0}\right\}

while

K<K_{\textrm{max}}

and

\lambda>\lambda_{\textrm{min}}

Perturb

\hat{\phi}_{i}\leftarrow\left\{\hat{\phi}_{i}+\delta,\hat{\phi}_{i}-\delta\right\}

\forall i

Set

t\leftarrow 0

repeat

Observe

x=(\psi,\phi)

according to (11)

Update

\bar{\theta}_{w}

w=\operatorname*{arg\,min}_{j}d_{\rho}(\phi,\hat{\phi}_{j})

, using (39)

for

i=1,\ldots,K

Update

\hat{\phi}

using (30), (31)

end for

t\leftarrow t+1

until Convergence

Discard

\hat{\phi}_{i}

\frac{1-\lambda}{\lambda}d(\hat{\phi}_{j},\hat{\phi}_{i})<\epsilon_{n}

\forall i,j,i\neq j

Insert

\hat{\theta}_{i}

\bar{\theta}

d_{\rho}(\hat{\theta}_{j},\hat{\theta}_{i})>\epsilon_{s},\ \forall j

Lower temperature

\lambda\leftarrow\gamma\lambda

0<\gamma<1

end while

Define

\left\{\Sigma_{i}\right\}_{i=1}^{K}

using (13)

Define

\hat{s}=\text{card}(\bar{\theta})

Define

\left\{S_{i}\right\}_{i=1}^{\hat{s}}

\Sigma_{j}\in S_{i}

\hat{\theta}_{j}(\bar{\theta})=\bar{\theta}_{i}

Estimated Model Parameters:

\hat{s}

\left\{S_{i}\right\}_{i=1}^{\hat{s}}

\left\{\bar{\theta}_{i}\right\}_{i=1}^{\hat{s}}

V Piecewise Affine System Identification

In this section we review standard recursive system identification for estimating the parameters $\left\{\bar{\theta}_{i}\right\}$ of the local models given knowledge of the partition $\left\{S_{i}\right\}$ . We show that this kind of recursive identification can be formulated as a stochastic approximation algorithm, and that it can be combined using the theory of two-timescale stochastic approximation with the stochastic approximation method of estimating $\left\{S_{i}\right\}$ by $\left\{\Sigma_{i}\right\}$ as proposed in Section IV.

V-A Recursive Identification of Local Models

Recall that, given knowledge of the partition $\left\{S_{i}\right\}_{i=1}^{s}$ , each local linear model of the PWA system in (11) is completely defined by the parameters $\left\{\theta_{i}\right\}$ . In the following, we develop a stochastic approximation recursion to estimate $\left\{\bar{\theta}_{i}\right\}$ . First we define the error:

\epsilon(t)=\sum_{i}\mathds{1}_{\left[\phi_{t}\in S_{i}\right]}[\phi_{t}^{% \mathrm{T}}\otimes I_{m}]\bar{\theta}_{i}-\psi_{t}

(37)

A stochastic gradient descent approach aims to minimize the error:

\operatorname*{minimize}_{\bar{\theta}_{i}}~{}\frac{1}{2}\mathbb{E}\left[\|% \epsilon(t)\|^{2}\right],

(38)

using the recursive updates:

	$\displaystyle\bar{\theta}_{i}(t+1)$	$\displaystyle=\bar{\theta}_{i}(t)-\alpha(t)\left(\nabla_{\bar{\theta}_{i}}% \epsilon(t)\right)\epsilon(t)$		(39)
		$\displaystyle=\bar{\theta}_{i}(t)-\alpha(t)[\phi_{t}^{\mathrm{T}}\otimes I_{m}% ]^{\mathrm{T}}\epsilon(t)$		(39)

where $\sum_{n}\alpha(n)=\infty$ , $\sum_{n}\alpha^{2}(n)<\infty$ . Here the expectation is taken with respect to the joint distribution of $(\psi_{y},\phi_{t})$ as explained in Section III. This is a standard recursive identification method and constitutes a stochastic approximation sequence of the form:

\bar{\theta}_{i}(t+1)=\bar{\theta}_{i}(t)+\alpha(t)\left[h_{\theta}(\bar{% \theta}_{i}(t))+M_{\theta}(t+1)\right],\ t\geq 0,

(40)

where $h_{\theta}(\bar{\theta}_{i})=-\nabla\mathbb{E}\left[\|\epsilon(t)\|^{2}\right]$ is Lipschitz, and $M(t+1)=\nabla\mathbb{E}\left[\|\epsilon(t)\|^{2}\right]-\nabla\|\epsilon(t)\|^% {2}$ is a Martingale difference sequence. This sequence converges almost surely to the equillibrium of the differential equation

\dot{\bar{\theta_{i}}}=h_{\theta}(\bar{\theta}_{i}),\ t\geq 0.

(41)

which can be shown to be a solution of (38) with standard Lyapunov arguments. For more details the reader is referred to [39, 30]. Moreover, notice that (39) is a vectorized representation of (9), for $\gamma=\alpha(t)>0$ . Therefore, under the PE condition (10) of Assumption 10, and under the zero-mean noise assumption, it follows that $\bar{\theta}_{i}$ converges asymptotically to $\theta_{i}$ for all $i=1,\ldots,s$ , i.e., the minimum of (38) is achieved.

V-B Combined Mode and Dynamics Identification

Recall that the mode identification method is based on the stochastic approximation updates (30) that can be written with respect to the vectors $\xi_{i}(t)=[\rho_{i}^{\mathrm{T}}(t)\sigma_{i}^{\mathrm{T}}(t)]^{\mathrm{T}}$ and a stepsize schedule $\beta(t)$ in the form:

\xi_{i}(t+1)=\xi_{i}(t)+\beta(t)\left[h_{\phi}\left(\xi(t),\bar{\theta}(t)% \right)+M_{\phi}(t+1)\right],\ t\geq 0,

(42)

where $h_{\phi}$ is Lipschitz, $M_{\phi}(t)$ is a Martingale difference sequence and the dependence on $\bar{\theta}$ comes from the quantity $\hat{p}_{i}(t)$ in (31) given (35). At the same time, the recursive system identification technique to estimate $\bar{\theta}$ is a stochastic approximation sequence with a stepsize schedule $\alpha(t)$ of the form:

\bar{\theta}_{i}(t+1)=\bar{\theta}_{i}(t)+\alpha(t)\left[h_{\theta}\left(\xi(t% ),\bar{\theta}(t)\right)+M_{\theta}(t+1)\right],\ t\geq 0,

(43)

as given in (40). The dependence on $\xi$ , comes through (37), since $\xi$ defines $\hat{\phi}$ , which defines $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ through (13), which defines $\left\{S_{i}\right\}_{i=1}^{\hat{s}}$ through the rule $\Sigma_{j}\in S_{i}$ if $\hat{\theta}_{j}(\bar{\theta})=\bar{\theta}_{i}$ .

Theorem 4 shows how the two recursive algorithms (42) and (43) can be combined using the theory of two-timescale stochastic approximation if $\nicefrac{{\beta(t)}}{{\alpha(t)}}\rightarrow 0$ , i.e., when the estimation of the partition $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ is updated at a slower rate than the updates of the parameters $\left\{\bar{\theta}_{i}\right\}_{i=1}^{\hat{s}}$ .

Theorem 4.

Consider the sequence $\left\{\xi(t)\right\}_{t\in\mathbb{Z}_{+}}$ generated using the updates (42), where $\xi_{i}(t)=[\rho_{i}^{\mathrm{T}}(t)\sigma_{i}^{\mathrm{T}}(t)]^{\mathrm{T}}$ , and $(\rho_{i},\sigma_{i})$ are defined in (30). Consider the sequence $\left\{\bar{\theta}(t)\right\}_{t\in\mathbb{Z}_{+}}$ generated by the updates (43). Let the stepsizes $(\alpha(t),\beta(t))$ of (43) and (42), respectively, satisfy the conditions $\sum_{n}\alpha(n)=\sum_{n}\beta(n)=\infty$ , $\sum_{n}(\alpha^{2}(n)+\beta^{2}(n))<\infty$ , and $\nicefrac{{\beta(n)}}{{\alpha(n)}}\rightarrow 0$ , with the last condition implying that the iterations for $\left\{\xi(t)\right\}$ run on a slower timescale than those for $\left\{\bar{\theta}(t)\right\}$ . If the equation

\dot{\bar{\theta}}(t)=h_{\theta}(\xi,\bar{\theta}(t)),\ \bar{\theta}(0)=\bar{% \theta}_{0},

(44)

has an asymptotically stable equilibrium $\lambda(\xi)$ for fixed $\xi$ and some Lipschitz mapping $\lambda$ , and the equation

\dot{\xi}(t)=h_{\phi}(\xi(t),\lambda(\xi(t))),\ \xi(0)=\xi_{0},

(45)

has an asymptotically stable equilibrium $\xi^{*}$ , then, almost surely, the sequence $(\xi(t),\bar{\theta}(t))$ generated by (42), (43), converges to $(\xi^{*},\lambda(\xi^{*}))$ .

Proof.

It follows directly from Theorem 2, Ch. 6 of [39]. ∎

Corollary 4.1.

Condition (44) of Theorem 4 is satisfied by the definition of $h_{\theta}$ in (41). Therefore, (45) implies the convergence of $\hat{\phi}$ through (31), and of the partition $\left\{\Sigma_{i}\right\}$ through (13).

Notice that the condition $\nicefrac{{\beta(t)}}{{\alpha(t)}}\rightarrow 0$ is of great importance. Intuitively, the stochastic approximation algorithm (42), (43) consists of two components running in different timescales, where the slow component is viewed as quasi-static when analyzing the behavior of the fast transient. In practice, the condition $\nicefrac{{\beta(t)}}{{\alpha(t)}}\rightarrow 0$ is satisfied by stepsizes of the form $(\alpha(t),\beta(t))=(\nicefrac{{1}}{{t}},\nicefrac{{1}}{{(1+t\log t)}})$ , or $(\alpha(t),\beta(t))=(\nicefrac{{1}}{{t^{\nicefrac{{2}}{{3}}}}},\nicefrac{{1}}% {{t}})$ . Another way of achieving the two-timescale effect is to run the iterations for the slow component with stepsizes $\left\{\alpha_{t(k)}\right\}$ , where $t(k)$ is a subsequence of $t$ that becomes increasingly rare (i.e. $t(k+1)-t(k)\rightarrow\infty$ ), while keeping its values constant between these instants. A good policy is to combine both approaches and update the slow component with slower stepsize schedule $\beta(t)$ along a subsequence keeping its values constant in between (e.g., [30, 39]).

VI General Switched System Identification

So far, in Sections III, IV, and V we have developed a real-time idenitification method for PWA systems. Notice, however, that neither the proposed methodology, nor the algorithmic implementation of Alg. 1 are constrained to PWA systems, meaning that the proposed approach can, in principle, be applied to more general switching and hybrid systems. However, one must proceed with caution, as issues may arise with respect to the identifiability conditions, the mode-switching estimation error, and the possibly non-linear local system identification error. In this section, we discuss the applicability of the proposed approach in different cases often encountered in hybrid control systems, namely switched linear systems with non-polyhedral partition, and switched non-linear systems with polyhedral partition.

VI-A Switched linear systems with non-polyhedral partition.

In the case of linear local dynamics, the recursive identification method discussed in Section V-A remains unchanged, and the same convergence results hold as well. However, if the regions $S_{i}$ of the mode switching partition $\left\{S_{i}\right\}_{i=1}^{s}$ are non-polyhedral, they cannot be perfectly approximated by a finite union of polyhedral regions $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ . Therefore, the mode switching estimation will have inherent non-zero error. It is worth pointing out that from the convergence results of the online deterministic annealing algorithm [36], it follows that the partition error can be arbitrarily small in the limit $K\rightarrow\infty$ (which is the case when $\lambda\rightarrow 0$ ). Albeit a nice analytical result, in practice there will always be non-zero error in the estimation of the partition $\left\{S_{i}\right\}_{i=1}^{s}$ .

We hereby discuss two ways to deal with this problem. The first is to assume the existence of a non-linear transformation that maps each $S_{i}$ to a polyhedral region $\bar{S}_{i}$ , and proceed with Alg. 1. General-purpose learning machines, such as artificial neural networks can be incorporated in this process. Further assumptions and analysis is required for this method, which is beyond the scope of this paper. The second refers to mitigating the jumping effect of the identified system to decrease the closed-loop error that naturally occurs due to imperfect mode switching. To this end, recall that, given an observation $\phi_{t}$ the dynamics of the identified model are given according to (11) by:

\hat{\psi}_{t}=[\phi_{t}^{\mathrm{T}}\otimes I_{m}]\bar{\theta}_{i},\text{ if % }\phi_{t}\in\Sigma_{j}\text{ and }\hat{\theta}_{j}(\bar{\theta})=\bar{\theta}_% {i}.

(46)

To mitigate the jumping behavior one can make use of the association probabilities

p(\phi_{i}|\phi_{t})=\frac{e^{-\frac{1-\lambda}{\lambda}d_{\rho}(\phi_{t},\phi% _{i})}}{\sum_{j}e^{-\frac{1-\lambda}{\lambda}d_{\rho}(\phi_{t},\phi_{i})}},

(47)

to instead construct the weighted dynamics:

\hat{\psi}_{t}=\sum_{i=1}^{K}p^{*}(\phi_{i}|\phi_{t})[\phi_{t}^{\mathrm{T}}% \otimes I_{m}]\hat{\theta}_{i}.

(48)

This jump-mitigation method has been used in the literature to preserve smoothness of the closed-loop dynamics and is particularly useful when hybrid system identification is used for non-linear function approximation, i.e., when the original system is not hybrid but is to be approximated by a hybrid system with simpler local dynamics.

VI-B Switched non-linear systems with polyhedral partition.

In this case, often referred to as piece-wise non-linear hybrid systems [40], the mode switching partition $\left\{S_{i}\right\}_{i=1}^{s}$ is polyhedral, and can be perfectly approximated by a finite union of polyhedral regions $\left\{\Sigma_{i}\right\}_{i=1}^{K}$ . For the identification of the non-linear local dynamics, the recursive identification method discussed in Section V-A needs to be modified. In particular the recursive updates:

\displaystyle\bar{\theta}_{i}(t+1)

\displaystyle=\bar{\theta}_{i}(t)-\alpha(t)\left(\nabla_{\bar{\theta}_{i}}% \epsilon(t)\right)\epsilon(t),

(49)

given in (39) of the same stochastic gradient descent structure are used, with the error term in this case given by

\epsilon(t)=\sum_{i}\mathds{1}_{\left[\phi_{t}\in S_{i}\right]}\hat{f}(\phi_{t% },\bar{\theta}_{i})-\psi_{t},

(50)

where the functions $\hat{f}(\phi_{t},\bar{\theta}_{i})$ are local parametric models of known form, differentiable with respect to the parameters $\theta_{i}$ . General-purpose learning machines, such as artificial neural networks can be used. Notice that the identification updates remain stochastic approximation updates of the same form, which means that the convergence results of Theorem 4 continue to hold.

VII Experimental Results

We illustrate the properties and evaluate the performance of the proposed algorithm in two PWA systems, one in PWARX form and the other in state-space form.

VII-A PWARX System

The first one, adopted from [9], is given in the input–output representation of (51):

\displaystyle y_{t}=\begin{cases}\theta_{1}^{\mathrm{T}}\phi_{t}+e_{t},\quad% \text{if }r_{t}\in P_{1}\\ \theta_{2}^{\mathrm{T}}\phi_{t}+e_{t},\quad\text{if }r_{t}\in P_{2}\\ \theta_{3}^{\mathrm{T}}\phi_{t}+e_{t},\quad\text{if }r_{t}\in P_{3}\\ \end{cases},

(51)

where $y_{t}\in\mathbb{R}^{1}$ , $r_{t}\in P=[-4,4]$ , $\phi_{t}=[r_{t}\ 1]^{\mathrm{T}}$ , $(P_{1},P_{2},P_{3})=([-4,-1],(-1,2),[2,4])$ , and $(\theta_{1},\theta_{2},\theta_{3})=([1,2]^{\mathrm{T}},[-1,0]^{\mathrm{T}},[1,% 2]^{\mathrm{T}})$ . This example is chosen to showcase the properties of the proposed methodology, since its simplicity allows graphical representation of the signaling partition and the convergence of the model parameters. At the same time, it is a switching system that presents a jump at $r_{t}=2$ , and same dynamics for different regions of the input space, i.e., $\theta_{1}=\theta_{3}$ while $P_{1}\neq P_{3}$ . System (51) can thus be written in the form:

\displaystyle y_{t}=\begin{cases}\theta_{2}^{\mathrm{T}}\phi_{t}+e_{t},\quad% \text{if }\phi_{t}\in S_{2}\\ \theta_{1}^{\mathrm{T}}\phi_{t}+e_{t},\quad\text{otherwise}\end{cases},

(52)

where $S_{2}=\left\{\phi=[r\ 1]^{\mathrm{T}}:r\in P_{2}\right\}$ . The representation of (52) in the input–output ( $r-y$ ) space is shown in Fig. 2. A total of $N=150$ observations under Gaussian noise ( $e_{t}\sim N(0,0.2)$ ) are accessible sequentially.

Algorithm 1 is applied to the observations for $T=900$ iterations. The same observations can be reused by the algorithm. The temperature parameters used for the online deterministic annealing algorithm are $(\lambda_{\max},\lambda_{\min},\gamma)=(0.99,0.2,0.8)$ , and the stepsizes $(\alpha(t),\beta(t))=(\nicefrac{{1}}{{(1+0.01t)}},\nicefrac{{1}}{{(1+0.9t\log t% )}})$ . At first ( $\lambda=\lambda_{\max}$ ), the algorithm keeps in memory only one codevector $\hat{\phi}_{1}$ and one model parameter vector $\bar{\theta}_{1}$ , essentially assuming that the system has constant dynamics in the entire domain, i.e., $\hat{S}_{1}=\Sigma_{1}=\left\{\phi=[r\ 1]^{\mathrm{T}}:r\in P\right\}$ . As new input–output pairs are observed, the estimated parameter $\bar{\theta}_{1}$ gets updated by the iterations (39). We have assumed $\bar{\theta}_{1}(0)=[1,1]^{\mathrm{T}}$ .

At the same time, the estimate of $\bar{\theta}_{1}$ are used to update the location of the codevector towards the mean of the observation domain as shown in (28). This process does not yield any accurate identification results since at this stage it is assumed that $P_{1}=P$ . However, it boosts the robustness of the identification algorithm with respect to initial conditions, since the converged values of the parameters for $\lambda=\lambda_{\max}$ will be used as initial conditions for the next value of $\lambda$ . As $\lambda$ is reduced, the bifurcation phenomenon described in Section IV-B takes place, and, after reaching a critical value, the single codevector splits into two duplicates. Now the algorithm assumes that there are two modes in the system and estimates the optimal model parameters $\left\{\bar{\theta}_{1},\bar{\theta}_{2}\right\}$ and partition $\left\{\Sigma_{1},\Sigma_{2}\right\}$ (through the location of the codevectors $\left\{\hat{\phi}_{1},\hat{\phi}_{2}\right\}$ ). This process continues until a desired termination criterion is reached. In this case it is the minimum temperature parameter $\lambda_{\min}$ that reflects to a potential time and computational constraint of the system. The bifurcation phenomenon is illustrated in Fig. 3 where the locations of the codevectors $\left\{\hat{\phi}_{i}\right\}$ , $\hat{\phi}_{i}\in P=[-4,4]$ generated by Alg. 1 are shown. The algorithm progressively constructs a total of $K=5$ effective codevectors. The number of modes is estimated with the process explained in Section IV-C. Two modes are estimated, i.e., $\hat{s}=2$ with $\bar{\theta}=\left\{\bar{\theta}_{1},\bar{\theta}_{2}\right\}$ . The association of each effective codevector with each identified mode according to the rule (35) is shown in Fig. 3.

The final estimated partition, the output of the estimated model, and its error with respect to the true model without noise are shown in Fig. 4. As shown, the identification error is low with the exception of a single misclassification instance of the mode at the boundary of the true partition of the input–output domain. This mode switching error can be avoided by allowing $\lambda$ to go lower, which results in a larger number $K$ of effective codevectors and is indicative of the performance/complexity trade-off of the algorithm. Finally, the convergence of the parameters $\left\{\bar{\theta}_{i}\right\}$ of each of the $\hat{s}=2$ local models detected are shown in Fig. 5. Parameter values that do not appear at $t=0$ indicate that belong to modes identified through the bifurcation phenomenon after a certain critical temperature value.

VII-B State-Space PWA system

The second system we evaluate is given by the following linearized PWA dynamics in the state-space domain:

\displaystyle\begin{cases}x_{t+1}&=(I_{2}+\textrm{d}t\begin{bmatrix}0&1\\ 0&0\end{bmatrix})x_{t}+\textrm{d}t\begin{bmatrix}0\\ 1\end{bmatrix}u_{t},\ \text{if }|u_{t}|>1\\ x_{t+1}&=(I_{2}+\textrm{d}t\begin{bmatrix}0&1\\ 0&-1\end{bmatrix})x_{t}+\textrm{d}t\begin{bmatrix}0\\ 0\end{bmatrix}u_{t},\ \text{if }|u_{t}|\leq 1\end{cases},

(53)

where $x_{t}\in\mathbb{R}^{2}$ and $u_{t}\in\mathbb{R}$ . System (53) has two modes ( $s=2$ ) and the switching signal is defined by the polyhedral regions $R_{1}=\left\{[x^{\mathrm{T}}|u^{T}]^{\mathrm{T}}\in\mathbb{R}^{3}:u<-1\right\}$ , $R_{2}=\left\{[x^{\mathrm{T}}|u^{T}]^{\mathrm{T}}\in\mathbb{R}^{3}:-1<u<1\right\}$ , and $R_{3}=\left\{[x^{\mathrm{T}}|u^{T}]^{\mathrm{T}}\in\mathbb{R}^{3}:1<u\right\}$ with $S_{1}=R_{1}\bigcup R_{3}$ and $S_{2}=R_{2}$ . The dynamics of (53) consist of a controllable double integrator when the input is of sufficient magnitude, and a stable autonomous system that drives its velocity to zero, otherwise. An example of such system can be an autonomous vehicle that avoids minimal, potentially accidental, gas pedal input. In this example, the linear system of the second mode ( $s=2$ ) is not minimal, and its identification relies on the mode switching behavior of the system, as explained in Section II-B. To preserve the PE conditions of Assumption 10, the input signal is chosen as $u_{t}=2\cos(2\pi t*\textrm{d}t)$ , $t\in\mathbb{Z}_{+}$ , and the noise term $w_{t}$ is a zero-mean Gaussian random variable with $\sigma^{2}=0.1$ . The evolution of (53) over time, as well as the mode switching behavior, are shown in Fig. 6.

The system is allowed to run for $T=3s$ (seconds), with $\textrm{d}t=0.01$ , i.e., a total of $N=300$ observations are acquired online, based on which, the proposed method identifies the switched system in real time. The temperature parameters used for the online deterministic annealing algorithm are $(\lambda_{\max},\lambda_{\min},\gamma)=(0.99,0.1,0.8)$ , and the stepsizes $(\alpha(t),\beta(t))=(\nicefrac{{1}}{{1+0.01t}},\nicefrac{{1}}{{1+0.9t\log t}})$ . At first ( $\lambda=\lambda_{\max}$ ), the algorithm keeps in memory only one codevector $\hat{\phi}_{1}$ and one model parameter vector $\bar{\theta}_{1}$ , essentially assuming that the system has constant dynamics in the entire domain. The estimated parameter $\hat{\theta}_{1}$ gets updated by the iterations (39). We have assumed $\hat{\theta}_{1}(0)=[1,1,1,1,1,1]^{\mathrm{T}}$ . As $\lambda$ is reduced, the bifurcation phenomenon described in Section IV-B takes place, and, after reaching a critical value, the single codevector splits into two duplicates. This process continues until the minimum temperature parameter $\lambda_{\min}$ that reflects to a potential time and computational constraint of the system. The bifurcation phenomenon is illustrated in Fig. 7 where the third coordinate of the codevectors $\left\{\hat{\phi}_{i}\right\}$ , which gives an estimate of the control input representation of the mode, is depicted. A total of $K=4$ effective codevectors are estimated and the association of the regions $\left\{\Sigma_{j}\right\}_{j=1}^{4}$ with the identified modes $\left\{S_{i}\right\}_{i=1}^{2}$ is also depicted in Fig. 7.

The identification error and the estimated mode switching behavior are shown in Fig. 8 in comparison with the true mode switching behavior of the system. More specifically, the algorithm identifies a total of $\hat{s}=2$ modes with $S_{1}=\Sigma_{3}\bigcup\Sigma_{4}$ and $S_{2}=\Sigma_{1}\bigcup\Sigma_{2}$ . In Figure 9, the convergence of the parameters $\left\{\bar{\theta}_{i}\right\}$ of each of the $\hat{s}=2$ local models detected to the actual $\left\{\theta_{i}\right\}_{i=1}^{2}$ observed are shown. Parameter values that do not appear at $t=0$ indicate that they belong to modes identified through the bifurcation phenomenon after a certain critical temperature value.

VIII Conclusion

We proposed a real-time identification scheme for discrete-time switched state-space models. In contrast to most existing identification algorithms for piece-wise affine systems, the proposed approach is appropriate for online identification of both the modes and the subsystems of the switched system, and is computationally efficient compared to standard algebraic, mixed-integer programming, and offline clustering-based methods. The progressive nature of the algorithm also provides real-time control over the performance-complexity trade-off.

Future directions include extensions of the proposed approach to identification of both discrete- and continuous-time partially observable piece-wise affine models in the state-space domain using real-time observations.

References

[1] A. Garulli, S. Paoletti, and A. Vicino, “A survey on switched and piecewise affine system identification,” IFAC Proceedings Volumes, vol. 45, no. 16, pp. 344–355, 2012.
[2] D. Liberzon, Switching in Systems and Control. Springer, 2003, vol. 190.
[3] S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal, “Identification of hybrid systems a tutorial,” European Journal of Control, vol. 13, no. 2-3, pp. 242–260, 2007.
[4] A. Moradvandi, R. E. Lindeboom, E. Abraham, and B. De Schutter, “Models and methods for hybrid system identification: a systematic survey,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 95–107, 2023.
[5] A. Bemporad, G. Ferrari-Trecate, and M. Morari, “Observability and controllability of piecewise affine and hybrid systems,” IEEE Transactions on Automatic Control, vol. 45, no. 10, pp. 1864–1876, 2000.
[6] R. Vidal, A. Chiuso, and S. Soatto, “Observability and identifiability of jump linear systems,” in IEEE Conference on Decision and Control, vol. 4, 2002, pp. 3614–3619.
[7] M. Petreczky, L. Bako, and J. H. Van Schuppen, “Realization theory of discrete-time linear switched systems,” Automatica, vol. 49, no. 11, pp. 3337–3344, 2013.
[8] J. Roll, A. Bemporad, and L. Ljung, “Identification of piecewise affine systems via mixed-integer programming,” Automatica, vol. 40, no. 1, pp. 37–50, 2004.
[9] G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari, “A clustering technique for the identification of piecewise affine systems,” Automatica, vol. 39, no. 2, pp. 205–217, 2003.
[10] F. Lauer, G. Bloch, F. Lauer, and G. Bloch, “Hybrid system identification,” Hybrid System Identification: Theory and Algorithms for Learning Switching Models, pp. 77–101, 2019.
[11] L. Bako, “Identification of switched linear systems via sparse optimization,” Automatica, vol. 47, no. 4, pp. 668–677, 2011.
[12] R. Vidal, S. Soatto, Y. Ma, and S. Sastry, “An algebraic geometric approach to the identification of a class of linear hybrid systems,” in IEEE International Conference on Decision and Control, vol. 1, 2003, pp. 167–172.
[13] R. Vidal, “Recursive identification of switched ARX systems,” Automatica, vol. 44, no. 9, pp. 2274–2287, 2008.
[14] M. Gegundez, J. Aroba, and J. M. Bravo, “Identification of piecewise affine systems by means of fuzzy clustering and competitive learning,” Engineering Applications of Artificial Intelligence, vol. 21, no. 8, pp. 1321–1329, 2008.
[15] H. Nakada, K. Takaba, and T. Katayama, “Identification of piecewise affine systems based on statistical clustering technique,” Automatica, vol. 41, no. 5, pp. 905–913, 2005.
[16] C. Li, Z. Huang, Y. Wang, and H. Jiang, “Rapid identification of switched systems: A data-driven method in variational framework,” Science China Technological Sciences, vol. 64, no. 1, pp. 148–156, 2021.
[17] A. L. Juloski, S. Weiland, and W. M. H. Heemels, “A bayesian approach to identification of hybrid systems,” IEEE Transactions on Automatic Control, vol. 50, no. 10, pp. 1520–1533, 2005.
[18] L. Bako, K. Boukharouba, E. Duviella, and S. Lecoeuche, “A recursive identification algorithm for switched linear/affine models,” Nonlinear Analysis: Hybrid Systems, vol. 5, no. 2, pp. 242–253, 2011.
[19] R. Baptista, J. Y. Ishihara, and G. A. Borges, “Split and merge algorithm for identification of piecewise affine systems,” in American Control Conference, 2011, pp. 2018–2023.
[20] A. M. Ivanescu, T. Albin, D. Abel, and T. Seidl, “Employing correlation clustering for the identification of piecewise affine models,” in Workshop on Knowledge Discovery, Modeling and Simulation, 2011, pp. 7–14.
[21] M. G. Sefidmazgi, M. M. Kordmahalleh, A. Homaifar, and A. Karimoddini, “Switched linear system identification based on bounded-switching clustering,” in American Control Conference, 2015, pp. 1806–1811.
[22] A. Bemporad, A. Garulli, S. Paoletti, and A. Vicino, “A bounded-error approach to piecewise affine system identification,” IEEE Transactions on Automatic Control, vol. 50, no. 10, pp. 1567–1580, 2005.
[23] C. N. Mavridis and J. S. Baras, “Identification of piecewise affine systems with online deterministic annealing,” in IEEE Conference on Decision and Control, 2023, pp. 4885–4890.
[24] C. N. Mavridis, A. Kanellopoulos, J. S. Baras, and K. H. Johansson, “State-space piece-wise affine system identification with online deterministic annealing,” in European Control Conference, 2024, pp. 3110–3115.
[25] S. Weiland, A. L. Juloski, and B. Vet, “On the equivalence of switched affine models and switched arx models,” in IEEE Conference on Decision and Control, 2006, pp. 2614–2618.
[26] S. Paoletti, A. Garulli, J. Roll, and A. Vicino, “A necessary and sufficient condition for input-output realization of switched affine state space models,” in IEEE Conference on Decision and Control, 2008, pp. 935–940.
[27] S. Paoletti, J. Roll, A. Garulli, and A. Vicino, “On the input-output representation of piecewise affine state space models,” IEEE Transactions on Automatic Control, vol. 55, no. 1, pp. 60–73, 2009.
[28] M. Petreczky, L. Bako, and J. H. van Schuppen, “Identifiability of discrete-time linear switched systems,” in ACM International Conference on Hybrid Systems: Computation and Control, 2010, pp. 141–150.
[29] C. N. Mavridis and J. S. Baras, “Online deterministic annealing for classification and clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 7125–7134, 2023.
[30] C. Mavridis and J. S. Baras, “Annealing optimization for progressive learning with stochastic approximation,” IEEE Transactions on Automatic Control, vol. 68, no. 5, pp. 2862–2874, 2023.
[31] C. N. Mavridis, N. Suriyarachchi, and J. S. Baras, “Maximum-entropy progressive state aggregation for reinforcement learning,” in IEEE Conference on Decision and Control, 2021, pp. 5144–5149.
[32] C. N. Mavridis and J. S. Baras, “Progressive graph partitioning based on information diffusion,” in IEEE Conference on Decision and Control, 2021, pp. 37–42.
[33] C. N. Mavridis, N. Suriyarachchi, and J. S. Baras, “Detection of dynamically changing leaders in complex swarms from observed dynamic data,” in International Conference on Decision and Game Theory for Security. Springer, 2020, pp. 223–240.
[34] K. Rose, “Deterministic annealing for clustering, compression, classification, regression, and related optimization problems,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2210–2239, 1998.
[35] C. Mavridis, E. Noorani, and J. S. Baras, “Risk sensitivity and entropy regularization in prototype-based learning,” in Mediterranean Conference on Control and Automation, 2022, pp. 194–199.
[36] C. Mavridis and J. Baras, “Multi-resolution online deterministic annealing: A hierarchical and progressive learning architecture,” arXiv preprint arXiv:2212.08189, 2022.
[37] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” Journal of Machine Learning Research, vol. 6, no. Oct, pp. 1705–1749, 2005.
[38] X. Lin, Z. Yang, X. Zhang, and Q. Zhang, “Continuation path learning for homotopy optimization,” in International Conference on Machine Learning. PMLR, 2023, pp. 21 288–21 311.
[39] V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint. Springer, 2009, vol. 48.
[40] F. Lauer and G. Bloch, “Switched and piecewise nonlinear hybrid system identification,” in Hybrid Systems: Computation and Control: 11th International Workshop, HSCC 2008, St. Louis, MO, USA, April 22-24, 2008. Proceedings 11. Springer, 2008, pp. 330–343.
[41] B. M. Jenkins, A. M. Annaswamy, E. Lavretsky, and T. E. Gibson, “Convergence properties of adaptive systems and the definition of exponential stability,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2463–2484, 2018.
[42] B. Anderson and J. Moore, “New results in linear system stability,” SIAM Journal on Control, vol. 7, no. 3, pp. 398–414, 1969.

Appendix A Proof of Theorem 1.

We construct the system

\hat{x}_{t+1}=\hat{A}x_{t}+\hat{B}u_{t},\quad t\in\mathbb{Z}_{+},

(54)

where $\hat{A}\in\mathbb{R}^{n\times n}$ , and $\hat{B}\in\mathbb{R}^{n\times p}$ . Subtracting (7) from (54), we get:

e_{t+1}=\bar{\Theta}r_{t},\quad t\in\mathbb{Z}_{+},

(55)

where $e_{t}=\hat{x}_{t}-x_{t}\in\mathbb{R}^{n}$ is the observation error, $r_{t}=[x_{t}^{\mathrm{T}}|u_{t}^{\mathrm{T}}]^{\mathrm{T}}\in\mathbb{R}^{n+p}$ is the augmented state-input vector as defined in (5), and $\bar{\Theta}=[(\hat{A}-A)|(\hat{B}-B)]$ is an augmented matrix of the system parameters of size $n\times(n+p)$ . Then (9) is equivalent to:

\bar{\Theta}_{t+1}=\bar{\Theta}_{t}-\gamma e_{t+1}r_{t}^{\mathrm{T}},\quad t% \geq 0.

(56)

Notice that (56) can be written in the form of a linear time-varying dynamical system:

\bar{\Theta}_{t+1}=\bar{\Theta}_{t}(I_{n+p}-\gamma r_{t}r_{t}^{\mathrm{T}}),~{% }t\geq 0.

(57)

By vectorizing $\bar{\Theta}_{t}$ such that $\bar{\theta}_{t}=\text{vec}(\bar{\Theta}_{t})$ , (57) becomes:

\displaystyle\bar{\theta}_{t+1}

\displaystyle=(I_{n(n+p)}-\gamma\psi_{t}\psi_{t}^{\mathrm{T}})\bar{\theta}_{t}% =\Xi_{t}\bar{\theta}_{t},~{}t\geq 0,

(58)

where $\otimes$ denotes the Kronecker product, and $\psi_{t}=[r_{t}^{\mathrm{T}}\otimes I_{n}]^{\mathrm{T}}$ is a $n(n+p)\times n$ matrix. We will show that (58) is exponentially stable in the large (Definition 1, [41]) as long as (8) is satisfied. Consider the Lyapunov function candidate $V(t,\bar{\theta})=\bar{\theta}_{t}^{\mathrm{T}}\bar{\theta}_{t}$ . It is obvious that there exist $k_{1},k_{2}>0$ such that $k_{1}\|\bar{\theta}\|^{2}\leq V(t,\bar{\theta})\leq k_{2}\|\bar{\theta}\|^{2}$ . Notice that $V(t+1,\bar{\theta}_{t+1})-V(t,\bar{\theta}_{t})=\bar{\theta}_{t}^{\mathrm{T}}% \Xi_{t}^{\mathrm{T}}\Xi^{t}\bar{\theta}_{t}$ . As a result, by summing the differences for $T$ timesteps, we get:

$\displaystyle V(t+$	$\displaystyle T+1,\bar{\theta}_{t+T+1})-V(t,\bar{\theta}_{t})=$	(59)
	$\displaystyle=\sum_{\tau=t}^{t+T}V(\tau+1,\bar{\theta}_{\tau+1})-V(\tau,\bar{% \theta}_{\tau})$
	$\displaystyle=\sum_{\tau=t}^{t+T}\bar{\theta}_{\tau}^{\mathrm{T}}\left(\Xi_{% \tau}^{\mathrm{T}}\Xi_{\tau}-I_{n(n+p)}\right)\bar{\theta}_{\tau}$
	$\displaystyle=\bar{\theta}_{t}^{\mathrm{T}}\left[\sum_{\tau=t}^{t+T}\Phi(\tau;% t)^{\mathrm{T}}\left(\Xi_{\tau}^{\mathrm{T}}\Xi_{\tau}-I_{n(n+p)}\right)\Phi(% \tau;t)\right]\bar{\theta}_{\tau}$
	$\displaystyle\leq-\alpha_{1}\bar{\theta}_{t}^{\mathrm{T}}I_{n(n+p)}\bar{\theta% }_{t}=-\alpha_{1}V(t,\bar{\theta}_{t}),$

for some $0<\alpha_{1}<1$ . Here $\Phi(\tau;t)=\Xi_{t}\Xi_{t+1}\ldots\Xi_{\tau-1}$ is the transition matrix of (58), and the inequality follows from condition (8). Notice that the first inequality in (8) is equivalent to $\alpha I_{n+p}\preceq\sum_{\tau=t}^{t+T}r_{\tau}^{\mathrm{T}}r_{\tau}$ and directly implies that $\alpha_{2}I_{n(n+p)}\preceq\sum_{\tau=t}^{t+T}\psi_{\tau}^{\mathrm{T}}\psi_{\tau}$ , for some $\alpha_{2}>0$ , as well. As a result $\sum_{\tau=t}^{t+T}\Xi_{\tau}^{\mathrm{T}}\Xi_{\tau}\preceq\alpha_{3}TI_{n(n+p)}$ for some $0<\alpha_{3}<1$ , and, therefore, $\sum_{\tau=t}^{t+T}\left(\Xi_{\tau}^{\mathrm{T}}\Xi_{\tau}-I_{n(n+p)}\right)% \preceq-\alpha_{4}TI_{n(n+p)}$ for some $0<\alpha_{4}<1$ . Finally this implies that $\left[\sum_{\tau=t}^{t+T}\Phi(\tau;t)^{\mathrm{T}}\left(\Xi_{\tau}^{\mathrm{T}% }\Xi_{\tau}-I_{n(n+p)}\right)\Phi(\tau;t)\right]\leq-\alpha_{1}I_{n(n+p)}$ for some $0<\alpha_{1}<1$ [42]. Notice that the second inequality of (8) is necessary to ensure non-singularity of the transition matrix $\Phi(\tau;t)$ [41]. Finally, as an immediate result of (59), $V(t+T+1,\bar{\theta}_{t+T}+1)\leq(1-\alpha_{1})V(t,\bar{\theta}_{t})$ , $\forall t\geq 0$ , which implies uniform asymptotic stability in the large, and, due to linearity, exponential stability in the large.

Real-time Hybrid System Identification with Online Deterministic Annealing

Abstract

Index Terms:

I Introduction

I-A Contribution and Outline

I-B Related Work

I-C Notation

II Switched and Piecewise Affine Systems

II-A Realization and Identification of PWARX Models

Assumption 1.

II-B Realization and Identification of PWA State-Space Models

Assumption 2.

Assumption 3.

Theorem 1.

Proof.

Assumption 4.

Remark 1.

Remark 2.

III Hybrid System Identification as an Optimization Problem

IV Mode Identification with Online Deterministic Annealing

IV-A Solving the Optimization Problem

Definition 1 (Bregman Divergence).

Theorem 2.

Proof.

Remark 3.

Theorem 3.

Proof.

Remark 4.

IV-B Bifurcation Phenomena

IV-C Estimating the number of modes

Remark 5.

V Piecewise Affine System Identification

V-A Recursive Identification of Local Models

V-B Combined Mode and Dynamics Identification

Theorem 4.

Proof.

Corollary 4.1.

VI General Switched System Identification

VI-A Switched linear systems with non-polyhedral partition.

VI-B Switched non-linear systems with polyhedral partition.

VII Experimental Results

VII-A PWARX System

VII-B State-Space PWA system

VIII Conclusion

References

Appendix A Proof of Theorem 1.

Real-time Hybrid System Identification
with Online Deterministic Annealing