On Estimating Non-standard Discrete Distributions Using Adaptive MCMC Methods

In this paper, we compare empirically the performance of some adaptive MCMC methods, that is, Adaptive Metropolis (AM) algorithm, Single Component Adaptive Metropolis (SCAM) algorithm and Delayed Rejection Adaptive Metropolis (DRAM) algorithm. The context is the simulation of non-standard discrete distributions. The performance criterion used is the precision of the frequency estimator. An application to a Bayesian hypothesis testing problem shows the superiority of the DRAM algorithm over the other considered sampling schemes.


Introduction
The principle of Markov chain Monte-Carlo (MCMC) approach is to construct an ergodic Markov chain whose stationary distribution corresponds exactly to the simulating density.To do this, several sampling schemes have been proposed and the best known one is the Metropolis-Hastings (M-H) algorithm, see (Hastings, 1970).This algorithm uses an acceptancerejection mechanism on values generated from some proposal distribution.Its main advantage is that the convergence is guaranteed whatever the used proposal density.But, a bad choice of this distribution may be disastrous and leads to a very slow convergence.This is often the case when the density of interest is complicated or of high dimension.
To overcome this limit, diverse adaptive versions of the M-H algorithm are proposed in the literature [for an overview, see (Bardenet, Cappé, Fort & Kégl, 2015), (Liang, Liu & Carroll, 2010) and (Vihola, 2012)].These methods allow to adjust, automatically, the parameters of the proposal distribution in order to speed up the convergence.However, most of them seem to be problematic due to the difficulty in ensuring the ergodicity of the simulated chains.Consequently, the theoretical convergence results of M-H algorithm will not be applicable.
The Adaptive Metropolis (AM) algorithm is one of the most successfully applied adaptive methods.Developed originally by (Haario, Saksman & Tamminen, 2001), this sampling scheme has the advantage to preserve the correct ergodicity properties [see (Saksman & Vihola, 2010)], which ensures the convergence of the chain to the target distribution.Its basic principle is to use multi-normal proposal density whose covariance matrix is tuned at each iteration.Recently, the AM algorithm has aroused the interest of many scientists and new variants had been proposed.Thus, to improve the performance of this sampler, particularly for high dimensions, (Haario, Saksman & Tamminen, 2005) have developed the Single Component Adaptive Metropolis (SCAM) algorithm.Likewise, (Haario, Laine, Mira & Saksman, 2006) have proposed the Delayed Rejection Adaptive Metropolis (DRAM) algorithm which combines both the AM sampler and the Metropolis-Hastings algorithm with delayed rejection (MHDR).
In the literature, the most frequently used criteria to test the efficiency of MCMC methods are both convergence rate and accuracy of considered estimators.Although many works had revealed the rapid convergence of M-H adaptive algorithms, few studies focused on the assessment of these methods in terms of the accuracy of the calculated estimators.
In this article, we suggest to compare and analyze, experimentally, the performances of AM, SCAM and DRAM algorithms, in the context of simulating non-standard discrete distributions.The performance criterion considered is the precision of the frequency estimator.Hence, the remaining part of this work is organized as follows.In section 2, we explain the details of the AM sampler.Section 3 contains a description of the SCAM algorithm.In section 4, the foundations of the DRAM method are given with its main characteristics.Finally, our empirical study is presented in section 5, where the results are analyzed and discussed.

AM algorithm
To properly set up the M-H algorithm, it is crucial to well choose the parameters of the proposal distribution and particularly the variance.Indeed, if the variance is weak, the acceptance rate will be high but the chain will converge very slowly because the steps are very small.In case of the opposite sense, if the variance is large, the algorithm will reject most of the proposed candidates and, consequently, the speed of its convergence will be weak.In general, the adjustment of the proposal variance is heuristic and it forms a quite tricky problem.
The AM algorithm aims to palliate this particular difficulty.The idea is to modify, at each iteration, the covariance matrix of the proposal distribution in order to correspond it to the target density.Just one obstacle is that the target covariance matrix is unknown in general.In order to get around this problem, the AM algorithm estimates on each iteration the target covariance matrix from the past values of the simulated chain.
Suppose that the AM algorithm is used to simulate a target density π and (x t , t = 0, 1, • • • ) specifies the obtained Markov chain.The first step is to fix out the initial covariance matrix C 0 .On the (t + 1) th iteration of the algorithm, a candidate is generated from a multi-normal proposal distribution, centered at the current state x t and with covariance matrix C t+1 defined by The term t 0 indicates the time from which the proposal covariance matrix is adjusted, the parameter s d is a scaling factor that depends to the dimension d of the states space, the term ϵ insures the non-singularity of the matrix C t+1 and I d denotes the identity matrix of order d.
Similarly to the M-H sampling scheme, the AM algorithm uses an acceptance/rejection mechanism to select proposed candidates through the following steps: 1. Set x 0 , C 0 , t 0 , s d and ϵ.
(b) generate a candidate y from the proposal distribution N d (x t , C t+1 ).
3. Set t ← t + 1 and go to step 2.
It is important to note that the process (x t , t = 0, 1, • • • ) is far to be markovien since its future development depends on the whole history.But its ergodicity was rigorously proved by (Haario, Saksman & Tamminen, 2001).

SCAM Algorithm
In the literature, the AM algorithm has been applied with success for various problems and its advantages are identified by many authors.However, it is shown by (Haario, Saksman & Tamminen, 2005) that this simulation scheme looses efficiency in high dimension problems.In this particular case, adjusting the covariance matrix leads to very complicated calculations that slow down the convergence of the algorithm.
To overcome these difficulties, (Haario, Saksman & Tamminen, 2005) have suggested a new version of the AM algorithm named the SCAM algorithm.Its special feature is that the chain future state, represented by a vector, is generated component by component using univariate proposal densities.These distributions are normal and whose variance is adjusted at each iteration similarly to the AM method.In this way, we pass from simulating a multivariate variable to simulating many univariate variables which is often an easier task.
Let π be a d-dimensional distribution, that we desire to simulate through the SCAM algorithm.Suppose ( t are the d components of the vector x t .On the (t + 1) th iteration of the algorithm and to simulate the i th coordinate of x t+1 , we generate a candidate from the normal distribution, centered at x i t , and with variance C i t+1 such that where C i 0 is the initial value of the variance, t 0 is the moment at which the variance is adjusted, s is a scaling factor whose value exceeds 1 and ϵ is a positive constant such that 0 < ϵ < 1.
In order to accept/reject the proposed candidate, SCAM algorithm uses the M-H dynamic.Thus, starting from x i t , the component x i t+1 is obtained by the following scheme: 1. Determine C i t+1 using the formula (2).
2. Generate a candidate y i from the proposal distribution N ) .
3. Compute the acceptance probability 4. Take As for the AM sampler, the generated process (x t , t = 0, 1, • • • ) is no more a Markov chain.However and according to (Haario, Saksman & Tamminen, 2005), this process meets the ergodicity property that ensures its convergence to the distribution of interest π.

DRAM Algorithm
Originally developed by (Haario, Laine, Mira & Saksman, 2006), the DRAM sampling scheme aims to combine the rapidity of the AM algorithm with the high performance of the MHDR algorithm.The latter was suggested by (Mira, 1998) and constitutes an improved version of the M-H algorithm whose aim is to reduce the variance of the computed estimators.The idea is to propose more than one candidate within the same iteration in order to increase the acceptance rate, see (Baffoun, Hajlaoui & Farhat, 2017).

MHDR Algorithm
Suppose that we desire to construct a Markov chain (x t , t = 0, 1, • • • ) whose invariant distribution is exactly π.Analogously to the M-H algorithm, on the (t + 1) th iteration of MHDR algorithm, the first step is to generate a candidate y 1 from the proposal distribution q 1 (x t , •).This candidate is accepted as a new state of the chain with the probability ] .
In case of rejection and instead of staying in the current state, as suggested by M-H algorithm, we pass to step two by generating a new candidate y 2 from the proposal density q 2 (x t , y 1 , •).This candidate is accepted with the probability ] .
This sampler can be applied with fixed or random number of steps by iteration, see (Mira, 1998).It is interesting to note that all acceptance probabilities are computed so that reversibility of the chain, with respect to the density of interest, is preserved separately at each step.

Combining MHDR and AM Algorithms
The DRAM sampling scheme is based on a similar concept as MHDR method.Except that the candidate of the first step is now generated by AM algorithm rather than by M-H algorithm.For this fact, a normal proposal distribution is used which is centered on the current value of the chain and whose variance is adjusted at each iteration.For the second step, (Haario, Laine, Mira & Saksman, 2006) recommend that the proposal density of the previous step is maintained while reducing its variance.
Let π be a d−dimensional distribution that we desire to simulate through the DRAM algorithm.On the (t + 1) th iteration of the algorithm, we use, for the first step, the proposal density q 1 = N d (x t , C t+1 ) where C t+1 specifies the covariance matrix computed according to eq.( 1).The proposal density of the second step is q 2 = N d (x t , αC t+1 ) with α is a scaling factor chosen from the interval ]0, 1[.Thus, the transition from the current state x t to the future state x t+1 is done as follows: 1. Set the initial state x 0 and the initial covariance matrix C 0 .
(b) generate a candidate y 1 from the proposal distribution q 1 .
(d) set x t+1 = y 1 with probability α 1 (x t , y 1 ), otherwise i. generate a candidate y 2 from the proposal density q 2 , ii. calculate the acceptance probability ] .
As for the AM algorithm, the simulated process (x t , t = 0, 1, • • • ) does not form a Markov chain since its future development depends to all its past states.However and according to (Haario, Laine, Mira & Saksman, 2006), this process complies with the ergodicity property that ensures its convergence to the distribution of interest π.

Experimental Study
This section aims to compare, empirically, the performances of AM, SCAM, and DRAM algorithms.For this purpose, we use a real application from (Liang & Liu, 2005).It is a Bayesian hypothesis testing problem which consists of checking if one of the coefficients of a logistic regression model is positive or not.

Data Description
In this study, the dataset is taken from (Ashford, 1959).It concerns the proportion of British coal miners who present symptoms of severe pneumoconiosis and the length of service in the colliery.Following (Liang & Liu, 2005), we consider the logistic regression model given by where β 0 , β 1 , and β 2 are the regression coefficients, z i is the number of years of service of the i th miner and y i is the binary indicator of severe pneumoconiosis.By assuming that β 0 , β 1 , and β 2 are independently distributed with a common normal density of mean 0 and variance σ 2 = 100 and if we let β be the column vector of these coefficients, the log-posterior density of β is defined (up to an additive constant) by where D is the set of all observations.
The aim of this study is to test the null hypothesis β 2 < 0. In (Liang & Liu, 2005), this problem is reduced to the comparison of the posterior probabilities π(A 1 ) and π(A 2 ), where A 1 = {β : β 2 < 0} and A 2 = {β : β 2 ≥ 0}.The solution consists of sampling at first, through a MCMC algorithm, an ergodic chain (x t , t = 0, 1, • • • ) whose invariant distribution is the posterior π(β|D) to approximate later π(A i ), i = 1, 2, with the frequency estimator, where n is the number of iterations performed and I(•) is the indicator function of A i .In this study, we suggest to resolve this problem in a similar manner using AM, SCAM and DRAM algorithms.The objective is to compare the performances of these adaptive MCMC methods in terms of the precision of the frequency estimator.

Implementation
To set up a MCMC method, it is important to fix out some parameters that largely affect the simulation results.The choice of these parameters is based mainly on the characteristics of the considered problem.In the sequel, we detail our implementation for each considered sampling scheme.

Parameterizing of the AM Algorithm
Contrary to the M-H algorithm where the choice of the proposal distribution is arbitrary, the AM sampler uses a multinormal proposal density.The proper application of this method requires optimization of the proposal covariance matrix in order to better approximate the target covariance matrix.
As explained in section 2, AM algorithm generates the first t 0 observations from a fixed candidate density, centered at the current state of the chain and with covariance matrix C 0 .(Haario, Saksman & Tamminen, 2001) recommend that this first part of the simulated chain should be relatively short in order to speed up the convergence of the algorithm.
For this application, we choose t 0 = 1000 and C 0 = diag[1, 0.001, 0.0001].Then, to construct the rest of the chain, the AM sampler uses a proposal distribution whose covariance matrix is updated at each iteration according to eq.( 1).The calculation of the latter requires to set the parameters s d and ϵ.Following (Haario, Saksman & Tamminen, 2001), we take the scaling parameter to be s d = (2.4) 2 d .Concerning the parameter ϵ, which is chosen to be small (compared to the dimension of the states space), we assign it a value of 0.001.
To simulate from π(β|D), we choose β 0 , the null column vector of order 3, as the starting point of the AM algorithm.On (t + 1) th iteration, a value y is generated from the candidate distribution N 3 (β t , C t+1 ) where β t is the current state of the chain and C t+1 is the covariance matrix defined by The value y is then accepted with the probability ] .
In case of rejection, the produced chain stays in its current state i.e. β t+1 = β t .

Parameterizing of the SCAM Algorithm
Recall that in case of the SCAM algorithm, the realizations are simulated component by component using proposal normal distributions.To generate a Markov chain with stationary distribution π(β|D), each iteration of this sampler scheme contains three steps.Given the current state of the chain β t = (β t 0 , β t 1 , β t 2 ) ′ , the first step consists of replacing the first coordinate β t 0 by β t+1 0 to thereby produce the vector (β t+1 0 , β t 1 , β t 2 ) ′ .The second coordinate of the latter is then replaced during the second step by β t+1 1 .The last step finishes the movement from β t to β t+1 by replacing the third coordinate β t 2 by β t+1 2 .During the 1000 first iterations, we use to simulate β t+1 i , for i = 0, 1, 2, a proposal normal distribution, centered at the current value β t i , and with variance C 0 whose value is set to 0.01.In the remaining iterations, the proposal variance C 0 is replaced by another whose value changes at each step using eq.( 2).In this equation the choice of the parameter s can largely affect the performance of the SCAM algorithm.So, we take s = 2.4, corresponding to the optimum value presented in (Haario, Saksman & Tamminen, 2005).Concerning the parameter ϵ, which should be included between 0 and 1, we set its value to 0.001.

Parameterizing of the DRAM Algorithm
In order to combine MHDR algorithm with AM method, several strategies had been proposed and experimented in the literature.Among them, we retain here the one recommended by (Haario, Laine, Mira & Saksman, 2006).Its principle is to simulate a first part of the chain using the standard MHDR algorithm with a multi-normal proposal distribution.For the rest of iterations, the proposal covariance matrix of the first step is updated, at each step, using AM method.
To simulate from the target density π(β|D), we employ the MHDR algorithm to generate the first 1000 observations.On the (t + 1) th iteration of the algorithm, the first step is to suggest a value y 1 from the candidate density q 1 = N 3 (β t , C 0 ) where β t is the current state of the chain and C 0 denotes the covariance matrix such that C 0 = diag[1, 0.001, 0.0001].In the second step of the algorithm, a new value y 2 is generated from the candidate distribution q 2 = N 3 (β t , M 0 ).As advised by (Green & Mira, 2001), we set M 0 = 0.5C 0 so that, in stationarity, the acceptance rate would be optimum.
In the remaining iterations, we use, in the first step of the MHDR algorithm, the AM method, and this in order to adjust the proposal covariance matrix.Thus, on the (t + 1) th iteration, a first value y 1 is generated from the proposal distribution q 1 = N (β t , C t+1 ) where C t+1 is the covariance matrix obtained by Recall that this proposal distribution is the same used to simulate from π(β|D) via the AM algorithm.During the second step of the sampling scheme, a second value y 2 is generated from the candidate density q 2 = N 3 (β t , M t+1 ) where M t+1 denotes the covariance matrix such that M t+1 = 0.5C t+1 .
As indicated in section 4, the acceptance probability of the proposition y 1 is defined by ] .
In case of rejection, the second proposed candidate y 2 becomes the new state of the chain with the probability ] .

Experimental Results and Interpretation
To simulate from the posterior distribution π(β|D), we run 110000 iterations using AM, SCAM and DRAM algorithms with an initial common value β 0 , the null column vector of order three.After an initial burn-in period of 10000 iterations, we consider that each generated chain has reached its stationary distribution.Then the quantities π(A 1 ) and π(A 2 ) are solely estimated on the basis of the remaining 100000 iterations.These estimates have been repeated 100 times in order to know their variability.The results given in table-1 present the mean values of estimates and their standard errors.Whereas, those in table-2 concern the acceptance rates calculated in (Baffoun, Hajlaoui & Farhat, 2017) and in this work.Examining table-1 shows that all the tested sampling schemes lead practically to the same estimated quantities π(A 1 ) and π(A 2 ).But in terms of precision, we can note some disparities in the obtained results.According to this criterion, DRAM algorithm seems to be the most attractive method while, the M-H algorithm appears to have the worst performances.In what follows, we will try to analyze and interpret these results function of the behavior of each considered algorithm, particularly, using the acceptance rate whose values are provided in table-2.
It is important to remark that the AM and SCAM algorithms provide very close results for the precision of the considered estimators.This finding is perfectly coherent with the literature that indicates for small dimension, the performances of AM sampling scheme are identical to those of SCAM method.This similarity appears even in the values of the computed acceptance rate for each algorithm.
Compared with M-H algorithm, AM and SCAM methods seem to offer a tangible improvement since the standard error is decreased about 15%.Moreover, we note that these two sampling schemes tend to reject fewer propositions than MH algorithm, thus resulting in relatively high acceptance rates, quite close to the optimum value of 0.25.This shows clearly that the tuning of the proposal distribution allowed a good exploration of the target density support and, at the same time, a decrease in the correlation between the states of a same chain.It is precisely this last point which explains the observed reduction of the standard error by referring to Peskun's asymptotic variance ordering (Peskun, 1973).
Considering the performances of DRAM algorithm, we can notice that the obtained results are clearly better than that of AM method.In fact, the standard error has been decreased at about 10% while the acceptance rate is increased at about 35%.This constatation indicates that the proposed jumps, during the second step of the algorithm, are very small and then frequently accepted.Intuitively, this augmentation of the number of accepted candidates would decrease the correlation between the states of the simulated chain.This fact justifies the observed reduction of the standard error under the Peskun ordering.
Compared to the M-H method, the DRAM algorithm gives appreciably better results since the standard error is decreased by more than 20%.Moreover, the low acceptance rate realized by the M-H scheme indicates that most of the proposed values are localized in low density zones of π(β|D).This leaves to suppose that the used proposal distribution is significantly different from the target density.The DRAM algorithm allowed overcoming this failure by adjusting the covariance matrix of the instrumental distribution so as to correspond with that of the target density.In addition, this algorithm has the advantage to propose more than one value per iteration, that ensures a better exploration of the support of the distribution of interest.Thus, in view of its exceptional proprieties, the DRAM scheme constitutes an interesting alternative to the M-H algorithm.

Conclusion
In this paper, three advanced MCMC methods: AM, SCAM and DRAM algorithms are studied experimentally in the context of estimating non-standard discrete distributions.The performance criterion used is the precision of the frequency estimator.The obtained results reflect the superiority of the DRAM algorithm over to the other considered sampling schemes.We also deduced that the performance of AM and SCAM algorithms are almost similar and significantly superior to that of the M-H sampler, whose results are presented in (Liang & Liu, 2005).All these results prove the high performance of adaptive MCMC methods, and more particularly the DRAM sampler.However, this powerful algorithm has the disadvantage to be expensive in terms of computing time.

Table 1 .
Estimation results obtained by M-H, AM, SCAM and DRAM methods.

Table 2 .
Acceptance rates of MCMC algorithms.