On Sequential Learning for Parameter Estimation in Particle Algorithms for State-Space Models

Particle methods, also known as Sequential Monte Carlo, have been ubiquitous for Bayesian inference for state-space models, particulary when dealing with nonlinear non-Gaussian scenarios. However, in many practical situations, the state-space model contains unknown model parameters that need to be estimated simultaneously with the state. In this paper, We discuss a sequential analysis for combined parameter and state estimation. An online learning method is proposed to approach the distribution of the model parameter by tuning a flexible proposal mixture distribution to minimize their Kullback-Leibler divergence. We derive the sequential learning method by using a truncated Dirichlet processes normal mixture and present a general algorithm under a framework of the auxiliary particle filtering. The proposed algorithm is verified in a blind deconvolution problem, which is a typical state-space model with unknown model parameters. Furthermore, in a more challenging application that we call meta-modulation, which is a more complex blind deconvolution problem with sophisticated system evolution equations, the proposed method performs satisfactorily and achieves an exciting result for high efficiency communication.


Introduction
State-space model, a class of probabilistic graphical model (Koller and Friedman, 2009) that describes the dependence between the unobserved state variable and the observed measurement, is a fundamental model for statistical inference with diversely applications in fields like statistics, econometrics, and information engineering (West and Harrison, 1997;Cappé et al., 2005).A linear or 'weakly' nonlinear state space model with Gaussian noise can be easily solved using Kalman filter and its derivatives.In nonlinear non-Gaussian scenarios, particle methods (also known as Sequential Monte Carlo methods) have been proven to be the most successful approach for the numerical approximation and Bayesian inference of the unknown state, because of their simplicity, flexibility, and ease of implementation (Gordon et al., 1993;Liu and Chen, 1998;Doucet et al., 2000;Liu, 2001;Doucet et al., 2001).In most of previous research, the inference has been focused on filtering or smoothing for the state, with the assumption that the model parameter in the state-space model are known.However, in practical situations, the model may contain unknown parameter that need to be estimated simultaneously with the state, and it might even be the case that the inference of the model parameter is the primary problem of interest.
One early and straightforward way to deal with this problem is to extend the original state to an augmented state that includes the state and the parameter together, and then to apply a standard particle filter to perform inference for both of them.However, this naive approach has been recognized not to be efficient, because the parameter space is not well explored (Kitagawa, 1998;Liu and West, 2001).Consequently, various improve methods has been developed over the past fifteen years (Refer to Kantas et al. (2015) for thoroughly review): maximum likelihood methods have been developed with different Monte Carlo evaluations of the likelihood of the model parameter (Hürzeler and Künsch, 2001;DeJong et al. 2013), and gradient based optimization (Ionides et al., 2006;Ionides et al., 2011) or expectation maximization methods (Andrieu et al., 2005;Cappé, 2009) have been introduced for an on-line or off-line estimation of the model parameter.The maximum likelihood approach generally converges rather slowly, but it may be a good choice for large data sets because of its low complexity; Bayesian methods apply directly to the augmented states and Markov chain Monte Carlo steps are utilized to improve the inference/estimation of the model parameter ( Gilks and Berzuini, 2001;Fearnhead, 2002;Andrieu et al., 2010).In the inspiring work by (Liu and West, 2001), an artificial dynamic is introduced to the static model parameter, and a kernel density estimation method is proposed to capture the density of the parameters.The significant idea of these authors is to use a shrinkage strategy for kernel locations and variances inflation, which therefore removes the problem of information loss over time (West, 1993).However, the learning method proposed in this work seems to be ad hoc and weak in explaining the underlying driving strategy.In this paper, a stochastic algorithm is proposed that tunes a flexible proposal mixture distribution to minimize the Kullback-Leibler divergence between the distribution of the unknown model parameter and the distribution of the proposed mixture: this leads to an online learning method for the numerical approximation of the distribution of the model parameter.We derive the sequential learning method by using a truncated Dirichlet processes normal mixture, and present the general algorithm using a proposed learning approach under a framework of the auxiliary particle filtering.The proposed algorithm is verified and compared with other related algorithms in a blind deconvolution problem, which is a typical state-space model with unknown model parameter (Liu and Chen, 1995).Furthermore, a more novel and challenging application is presented, which is essentially a more complex blind deconvolution problem with sophisticated system evolution equation.The proposed method performs satisfactorily in dealing with the demodulation of this system and obtains an exciting result for high efficiency communication.
The paper is organized as follows.In Section 2 we present the general state space model and the auxiliary particle filter algorithm for the Bayesian inference of this model.In Section 3 we introduce the sequential learning method and a general framework for joint parameter and state estimation.In Section 4 we verify the proposed method in a simple blind deconvolution and a more channelling problem in wireless communication.Finally, we summarize our methods in Section 5.

Problem Statement and Particle Filter Algorithm
A state-space model can be defined by the following two processes: the Markovian evolution equation, or state equation, defining the transition density p(x t |x t−1 , θ), in which the state vector at time t is x t , the model parameter vector is θ, f (•) is the system evolution function and w t is the system noise; and the observation equation defining the observation density p(y t |x t−1 , θ), where h(•) is the observation function and v t is the observation noise.The above expressions covers a very broad class of interesting dynamic models (West and Harrison 1997).
With a known parameter θ the task of sequential Bayesian inference is to estimate the posterior distribution p(x t |y 0:t , θ).However, to apply particle methods, a more general approach is to estimate the sequence of joint posteriors p(x 0:t |y 0:t , θ) recursively, which leads to the following fundamental recursions: for  2).The recursion is initialised with some distribution, for example, p(x 0 ).Particle filtering is a class of importance sampling and resampling techniques designed to give a numerical approximation for the recursions in (7).We present the auxiliary particle filter (APF) (Pitt and Shephard, 1999) here, as it covers a class of particle filter algorithms (Kantas et al., 2015) and is widely used in parameter and state estimation (Liu and West, 2001;Flury and Shephard, 2011).Let the proposal be q(x t , y t |x t−1 , ϕ) = q(x t |y t , x t−1 , ϕ)q(y t |x t−1 , ϕ), where q(x t |y t , x t−1 , ϕ) is a probability density function which is easy to sample from and q(y t |x t−1 , ϕ) is a nonnegative function that can be evaluated.The auxiliary particle filter sequentially draws samples from q(x t |y t , x t−1 , ϕ) and calculates the following importance weights, For t ≥ 1 the APF algorithm can be summarized as follows, At iteration t ≥ 1, for all i = 1, ..., N: Step Step 2: Compute the weights w Step In many applications, the model parameter θ is unknown and it might even be of interest for inference.A straightforward solution is to define an extended state that includes the state x t and the parameter θ together.For example in (Liu and West, 2001), an 'artificial evolution' equation for the model parameter θ is introduced: θ t = θ t−1 + ε, where ε ∼ N(0, Σ t ) with some specified variance matrix Σ t .A kernel density estimation method is proposed to capture the density of the parameter θ t .However, the method of the kernel density estimation is critical in such a problem.As studied by Liu and West (2001), they use a shrinkage strategy for kernel locations and variances estimation in order to removes the over-dispersion of the variances.Although this method works satisfactorily with a careful chosen shrinkage strategy, more efficient approach are still worth for further exploration (Kantas et al., 2015).

Sequential Learning
A sequential learning method is proposed here to deal with the online learning of the unknown distribution of the model parameter.Assume that the parameter vector θ is subject to some unknown distribution π(θ).We propose to utilize a parametric distribution q(θ; ψ) with the controlling parameter ψ to approximate the unknown π(θ).To measure the distance between these two distributions, the broadly used Kullback-Leibler divergence (KL-divergence) is employed here.The idea of the proposed sequential learning approach is to tune the parameter ψ by learning from samples of π(θ), and enable q(θ; ψ) to approximate π(θ) in the sense of minimizing the KL-divergence ] . Other criterion like moment matching can also be applied to measure the closeness of two distributions (Ji, 2006), but it is not straightforward to derive the algorithm to find the optimal controlling parameter of the proposal distribution.Under the measurement of KL-divergence, the optimal parameter ψ * which minimizes D [ π(θ) ∥ q(θ; ψ) ] can be obtained by finding the root of the derivative of D (if exists): The closed-form solution of the integral equation ( 9) is generally intractable, as h(ψ) involves an integral with unknown distribution π(θ).However, suppose that we can obtain samples θ ) , then we can numerically evaluate h(ψ) by Monte Carlo integration: ĥ where ĥ(θ (1:N) ; ψ) can be viewed as a noisy 'observation' of h(ψ).One available approach for obtaining roots of h(ψ) = 0 when we only have noisy evaluations of h(ψ) is the Stochastic Approximation (SA) algorithm (Kusher and Yin, 1997).
The SA algorithm iteratively updates ψ to approximate its optimal values by the following formula: where t is an iterative index, {ξ t } is a sequence of 'noise' (thus the Monte Carlo estimation ĥ(θ (1:N) t ; ψ t ) can be interpreted as the ground true h(ψ t ) plus noise ξ t ), and {r t } is a sequence of decreasing step-sizes satisfying The proof of covergence of this Monte Carlo estimation based sequential learning algorithm is detailed in the appendix.
We assume that q(θ; ψ) is a truncated Dirichlet process (TDP) normal mixture (Ishwaran and James, 2001;Ji, 2009).Let ψ denotes the set of controlling parameters (V k , µ k , Σ k ), a TDP normal mixture can be expressed as, q(θ; ) and V l ∈ (0, 1) which can be initially drawn from a beta distribution Beta(α 0 , β 0 ).The the partial derivative of D[π(θ) ∥ q(θ; ψ)] with respect to V k , µ k and Σ k can be derived respectively as follows (refer to Ji (2009) for more details of derivations), Given the samples Θ t = {θ (i) t } N i=1 from π(θ), the Monte Carlo approximation of these partial derivatives is

Joint Parameter and State Estimation
Let us return to the inference of the augmented states under a particle filter framework.Assuming the particle samples (x (i) t , θ (i) t ) and weights w (i) t (i = 1, ..., N) are available to represent the joint posterior p(x t , θ|y t ), with the sequential learning algorithm to approach the unknown distribution of θ by samples θ (i)  t (i = 1, ..., N), we now have an extended version of the auxiliary particle filter algorithm, incorporating the parameter and the state estimation together.The resulting general algorithm is presented as follows: Step 1: At iteration t, the sampled N particles containt (where c (i) t is the label of the TDP normal mixture component which θ (i)  t belongs to), with its weights w (i) t .Calculate the auxiliary variable Step 2: Sample the auxiliary index from the set {1, ..., N} with probabilities proportion to obtain new index set j, and new label set c ( j) t .
Step 7: If the stopping criterion is satisfied, then stop; otherwise, set t := t + 1 and go back to step 1.

Blind Deconvolution
Blind deconvolution of source signals is a subject that has been widely studied and applied in various fields, such as wireless communications, sonar and radar systems, audio and acoustics, and image processing (Benveniste and Goursat, 1984;Haykin, 1994).Take the wireless communication over a multipath fading channel as an example (Chen and Liu, 2000;Ali et al., 2002), the unknown information signal propagates through the channel which mixes the signal as it passes through a filter.The blind deconvolution problem is to recover the independent sources from the mixed signal without any priori knowledge of the original signal and the conditions of the channel.This problem is a typical state-space model problem, and can be expressed as follows, where the information bits s t ∈ {−1, 1} are independent, identically distributed (i.i.d) samples from a Bernoulli(0.5),[h0, h 1 , ..., h q ] T is a set of coefficients representing the channel conditions, q is the number of the multipath, y t is the received signal for decoding and the noise ϵ are i.i.d.normal with mean zero and variance σ.
In the simulation study, we consider the following settings: the number of particles in the particle filter is 200, the number of components in the TDP normal mixture is 20, the initial values of µ k are randomly chosen from (−0.5, 1) and Σ k are 0.1 * I where I is an identity matrix.The noise variance σ is an important parameter in the blind deconvolution problem, where we use the signal to noise ratios(SNR) to measure the noise level.For the first study, the SNR is 14 dB and the coefficients of the channel are shown in Table 1.We run 10 trials, and each trial has 10, 000 information bits.The sequential learning algorithm outputs an online estimation of the coefficients θ [h 0 , h 1 , h 2 ] T .At each iteration we estimate the mean value of θ using the particles . The resulting trajectories from 1 to 2000 are shown in Figure 1, which demonstrates the convergence rate and stability of the sequential learning algorithm.The mean values of each h i over a long period (iterations 1, 000 to 10, 000) are shown in Table 1, which show the high accuracy of our algorithm in estimating the model parameter.The performance of blind deconvlution in the above communication system can be well measured using the bit error rate (BER) of different SNRs (Chen and Liu, 2000;Ali et al., 2002).We compare our proposed method (SL+APF) with a naive bootstrap particle filter (BPF) with augmented state (x, θ), a BPF with known θ, an APF with augmented state (x, θ), an APF with known θ and a matched filter bound (MFB) which represents a reference line for the BER in Gaussian noise channel (Ali et al., 2002).The MFB can be calculated as follows (Ali et al., 2002): . Figure 2 shows the BERs for different SNRs (in dB).The APF based methods can generally outperform the BPF based methods.The proposed SL+APF method outperforms the APF with augmented states, and the resulting BER is close to the one evaluated by the APF when θ is known.Moreover, APF with augmented states has a apparent higher SNR requirement for low BERs below 10 −3 , while our proposed method overcomes this drawback because of the efficient learning for channel coefficients.from iteration 0 to 2000.The estimation of each h i comes close to the true value with in a few hundreds iterations and becomes stable since then.

Meta-modulation
In this example we consider a realistic state space model with sophisticated system evolution equations.In the above example, only the channel response is considered as the mixing mechanism, which is expressed as the coefficients h.Here, we consider a more general situation, where the impulse response consists of all the filters and the channel response during the entire transmission.Let h t = [h 0 , ..., h q t ] T represents all the filter responses in the transmitter, such as the pulse shaping filter, the transmitter filter and etc.Let h c = [h 0 , ..., h q c ] T be the channel response, and let h r = [h 0 , ..., h q r ] T be the response of all the filters in the receiver, such as the receiver filter, the matched filter and etc.
where h = [h 0 , h 2 , ..., h Q ] T represents the convolution of all the impulse responses h = h t ⊗ h c ⊗ h r , and Q = q t + q c + q r is the total number of the filter coefficients.
Let S = [s 0 , s τ , s 2τ , ...] T denotes the transmitted i.i.d.bit sequence.In traditional communication system, S will be N times upsampled to S ′ = [s 0 , 0, ..., 0, s τ , 0, ..., 0, s 2τ ...] T , and then S ′ passes through a pulse shaping filter h(nτ/N) (n = 0, ..., N − 1) with duration τ.The resulting convolution signal will be To demodulate such a signal, one can apply the matched filter to the received noisy signal y, where y = x + ϵ, and then decide the value of each s i by using the signal detection theory.Obviously, such a demodulation is not a blind deconvolution problem.On the other hand, consider the scenario that S is transmitted Q times faster, then S becomes Both APF with known parameter and our proposed SL+APF algorithms deconvolute all the 10, 000 bits correctly when SNR ≥ 12dB, so no points appear in 12dB and 14dB for these two curves.
S ′ = [s 0 , s τ/Q , ..., s (Q−1)τ/Q , s τ , ...] T .Let S ′ passes through a pulse shaping filter h(qτ/Q) (q = 0, ..., Q − 1) with duration τ, then we get a truly convoluted signal Consequently this communication system violates the Nyquist intersymbol interference criterion, and traditional matched filter and detection approaches fail to demodulated such signal, however, we can appeal to the blind deconvolution techniques to separate the information bits at the receiver.The notable advantage of this system is that the bandwidth of modulated signal x, which depends mainly on the pulse shaping filter h(qτ/Q), is comparable with the bandwidth of x as the pulse duration τ is the same, but the information sequence S has been managed to transmitted Q times faster.
Furthermore, it can be verified that this system essentially belongs to a parallel Gaussian channel model: let the vector S Q×1 represents Q bits information, H (2Q−1)×Q be a matrix that each row is the filter coefficients h T shifted with one time slot delay, then the received signal Y (2Q−1)×1 is given by the matrix format Y = HS +ϵ.It is trivial to find that the rank of matrix H (2Q−1)×Q is Q.Apply a singular value decomposition, H = UΛV, where both U (2Q−1)×(2Q−1) and V Q×Q are the unitary matrix and Λ (2Q−1)×Q is a diagonal matrix with rank Q, then where Y ′ = U −1 Y, S ′ = VS and ϵ ′ = U −1 ϵ.According to (25) , this system is equivalent to Q independent parallel channels (Tse and Viswanath 2005; Cover and Thomas, 2006).Consequentially, the transmission rate R of this communication system will be bounded by the capacity of a parallel Gaussian channel model: R ≤ C = Q 2 log 2 (1 + SNR/Q) bits per transmission.This high efficient communication strategy is proposed to be named as 'meta-modulation': it utilizes ordinary modulation framework but allows heavy interaction such as convolution in certain domain of the information bits, and therefore format an independent parallel channel to achieve an extraordinary high spectrum efficiency.
In this simulation study, we consider the case that Q is sufficient large, then it is not trivial to demodulate the convolutional signal of (21) particulary some of the coefficients are unknown.For example, the Viterbi algorithm can be modified to deal with deconvolution or blind deconvoultion (Ali et al., 2001;Li, 2014), but it becomes infeasible for large Q as the potential paths in Viterbi algorithm increase exponentially.However, our proposed SL+APF method seems well suite for the deconvolution of this meta-modulation system.The simulation is set as follows, h t is a Gaussian filter with q t = 100, h r is also a Gaussian filter with q r = 100, h c = [0.925,0.430, −0.201, −0.117, 0.091].Parameter settings of the algorithm are Figure 3. Comparisons of BERs for different methods.Both APF with known parameter and our proposed SL+APF algorithms deconvolute all the 10 5 bits correctly when SNR > 42dB, so no points appear in 44 dB for these two curves.kept the same with example 3.1 except we simulated 10 5 bits.Coefficients of h t and h r are predefined and known, while the channel response h c are unknown at the receiver.We compare our proposed method with an APF with augmented state and an APF with known channel coefficients.The BER plots in Figure 3 shows that the proposed SL+APF algorithm still performs well, but this time the APF with augmented state method performs very unstable because of the insufficiency in exploration of the unknown channel coefficients.Simulation study also demonstrates that the proposed algorithm achieves the BER < 10 −5 at SNR=44 dB, which is approximately 4 dB higher than the theoretical required SNR 40.1dB to reach the capacity 200 bits per transmission.We also verify that the meta-modulation works in real-world physical channels, we implemented the communication process on a standard verification system, the universal software radio peripheral (USRP).The received signal is demodulated by the SL+APF algorithm.As shown in Figure 3, the BER becomes slightly higher because of the quantization error in the real communication, however, hardware verification shows the potential in real-world applications.In conclusion, this is an exciting result, because not only such a high spectrum efficiency system has rarely been reported, but also the required SNR for demodulation is moderately low that can be satisfied in practice.

Figure 2 .
Figure 2. Comparisons of BERs for different methods.Both APF with known parameter and our proposed SL+APF algorithms

Table 1 .
The coefficients of channel response and the estimation via sequential learning algorithm