Characterization of the Bayesian Posterior Distribution in Terms of Self-information

It is well known that the classical Bayesian posterior arises naturally as the unique solution of di ﬀ erent optimization problems, without the necessity of interpreting data as conditional probabilities and then using Bayes’ Theorem. Here it is shown that the Bayesian posterior is also the unique minimax optimizer of the loss of self-information in combining the prior and the likelihood distributions, and is the unique proportional consolidation of the same distributions. These results, direct corollaries of recent results about conﬂations of probability distributions, further reinforce the use of Bayesian posteriors, and may help partially reconcile some of the di ﬀ erences between classical and Bayesian statistics.


Introduction
In statistics, prior belief about the value of an unknown parameter, θ ∈ Θ ⊆ R n obtained from experiments or other methods, is often expressed as a Borel probability distribution P 0 (·) on Θ ⊆ R n called the prior distribution. New evidence or information about the value of θ, based on an independent experiment or survey, i.e. a random variable X, is recorded as a likelihood L(·) = p(·|X), the conditional distribution of θ given an observable X. Given the prior distribution P 0 and the likelihood distribution L, a posterior distribution P 1 = P 1 (P 0 , L) for θ incorporates the new likelihood information about θ into the information from the prior, thereby updating the prior.
Bayesian and likelihood inference in general does not require the prior and/or the likelihood to be normalizable. This is the case, for instance, of improper priors, that are often used to convey the notion of lack of prior information about the parameter. Similarly, the likelihood, while conceived as a parametric family of probability distributions over the data, does not even require, in principle, to be measurable w.r.t. the parameter space.
We will assume the prior P 0 and the likelihood L to be non-negative measures 1 . Such measures will be discrete, yielding a mass function 2 (p.m.f.), or absolutely continuous (w.r.t. the Lebesgue measure), yielding a density function (p.d.f.). In both cases we will require the prior and the likelihood to be compatible, i.e. such that the measure defined by the products of the p.m.f.'s (p.d.f.'s, respectively) is normalizable. Here and throughout we will assume θ to be real-valued (with generalizations to the multidimensional framework left to the interested reader).
In the classical framework, the posterior distribution P 1 is the Bayes posterior distribution obtained as the conditional distribution of θ given the new likelihood information, but the same Bayes posterior distribution has also been derived in several information-theoretic contexts. Shore and Johnson (1980) give axiomatic foundations for deriving various probabilistic rules and, more specifically, the combining mechanism for the Bayes rule in Bernardo (1979) is expected utility, in Zellner (1988) is an information processing rule, and in Zellner (1996) is a maximum entropy principle. More recently, the self information loss, together with the Kullback-Leibler divergence, has been employed in a proper Bayesian setting to derive objective prior distributions for specific discrete parameter spaces (Villa and Walker, 2015) and to estimate the number of degrees of freedom in the t-distribution (Villa and Walker, 2014).
The main goal of this note is to complement those characterizations by applying recent results for conflations of probability distributions (Hill, 2011) to show that the Bayesian posterior is the unique posterior that minimizes the maximum loss of self-information in combining the prior and likelihood distributions. Secondary goals are to show that the Bayesian http://ijsp.ccsenet.org International Journal of Statistics and Probability Vol. 7, No. 1;2018 posterior is the unique posterior that is a proportional consolidation of the prior and likelihood distributions. Another direct corollary of recent results for conflations of probability distributions (Hill and Miller, 2011), the problem of identifying the best posterior when the prior and likelihood distributions are not weighted equally is addressed, complementing results in Zellner (2002). This new weighted posterior, the unique distribution that minimizes the maximum loss of weighted self-information, coincides with the classical Bayesian posterior if the prior and likelihood are weighted equally, but in general is different. We conclude with an open question regarding the minimax likelihood ratio of the prior and likelihood distributions.

Combining Priors and Likelihoods into Posteriors
There are many different methods for combining several probability distributions (e.g., see Genest and Zidek (1986); Hill (2011)), and in particular, for combining the prior distribution P 0 and the likelihood distribution L into a single posterior distribution P 1 = P 1 (P 0 , L). For example, the prior and likelihoods could simply be averaged, i.e. P 1 = P 0 2 +L , perhaps reflecting additional knowledge that the prior and likelihood distributions resulted from two different independent experiments, only one of which is assumed to be the "correct" experiment, and it is not known which.
The classical Bayesian posterior distribution P B is defined via Bayes Theorem: if P 0 and L are discrete with p.m.f.'s p 0 and p L respectively, then P B is discrete with p.m.f.
and if P 0 and L are absolutely continuous with probability density functions (p.d.f.'s) f 0 and f L respectively, then P B is absolutely continuous with The same results hold true for improper prior or likelihood distributions, provided the denominators are positive and finite.

Minimax Loss of Self-information
When the goal is to consolidate information from a prior distribution and a likelihood distribution into a (posterior) distribution, replacing those two distributions by a single distribution will clearly result in some loss of information, however that is defined. Recall that the self-information (also called the surprisal or Shannon information, Shannon (1948)) of the random event A, S P (A), is given by S P (A) = − log 2 P(A). (N.B. The Shannon entropy of a probability, on the other hand, is the expected value of the self-information, and in some contexts the terms surprisal or self-information are also used to mean this expected value entropy context.) The numerical value of the self-information of a given event is simply the number of binary bits of information reflected in its probability (so the smaller the value of P(A), the greater the information or surprise).
Example 3.1. If P is uniformly distributed on (0, 1) and A = (0, 0.25) ∪ (0.5, 0.75), then the self-information of A is S P (A) = − log 2 (P(A)) = − log 2 (0.5) = 1, so if X is a random variable with distribution P, then exactly one binary bit of information is obtained by observing that X ∈ A, in this case that the value of the second binary digit of X is 0.
Definition 3.2. The combined self-information associated with the event A under the prior distribution P 0 and the likeli- Note that when P(A) is finite, the combined self-information is simply the sum of the self-informations under the prior and likelihood distributions, and that this is the self-information of the event that A is observed independently under both the prior and the likelihood distributions.
Similarly, the maximum loss between the self-information of a posterior distribution P 1 and the combined self-information of the prior and likelihood distributions P 0 and L, M(P 1 ; P 0 , L), is { } P 1 (A) M(P 1 ; P 0 , L) = max log 2 .
In the case of improper distributions we will assume that, when P 1 (A) = ∞, and either P 0 (A) = ∞ or L(A) = ∞ (or both), the ratio is 1. Instead, when all distributions are proper, the quantity to be maximized in A is the difference between the combined self-information associated with the event A under the prior P 0 and the likelihood L, and the self-information of P 1 associated with the same event. Definition 3.3. A prior distribution P 0 and a likelihood distribution L, both proper or improper, are compatible if P 0 and ∑ L are both discrete with p.m.f.'s p 0 and p L satisfying 0 < θ∈Θ p 0 (θ)p L (θ) < ∞, or are both absolutely continuous with ∫ p.d.f.'s f 0 and f L satisfying 0 < f 0 (θ) f L (θ)dθ < ∞. Θ Example 3.4. Every two geometric distributions are compatible, every two normal distributions are compatible, and every exponential distribution is compatible with every normal distribution. Also, when improper priors are considered, they are chosen to be compatible with the likelihood. Distributions with disjoint support, discrete or continuous, are not compatible.
Remark. In practice, compatibility is not problematic when both P 0 and L are proper. Any two distributions may be easily transformed into two new distributions, arbitrarily close to the original distributions, so that the two new distributions are compatible, for instance by convolving each with a N(0, ϵ) distribution.
Theorem 3.5. Let P 0 and L be proper or improper discrete compatible prior and likelihood distributions. Then the Bayesian posterior P B is the unique proper or improper posterior distribution that minimizes the maximum loss of selfinformation from the prior and likelihood distributions, i.e., that minimizes M(P 1 ; P 0 , L) among all posterior distributions P 1 . Moreover, Proof. Since log 2 (x) is strictly increasing, the maximum loss between the self-information of a posterior distribution P 1 and the combined self-information of the prior and likelihood distributions P 0 and L, occurs for an event A where is maximized. Clearly, M 1 (P 1 ; P 0 , L) = ∞ whenever P 1 is improper, while P 0 and L are compatible. Since our goal is to minimize M(P 1 ; P 0 , L), we restrict our search for the optimal P 1 to proper distributions. The conclusion of Theorem 3.5 then follows as an application of Hill (2011, Corollary 4.4) where it is shown that the conflation of two discrete Borel probability distribution is the unique Borel probability distribution that minimizes the maximum loss of Shannon information between those distributions. It turns out that the Bayesian posterior is the conflation of the prior and the likelihood distributions. Consequently, the lower bound for the maxmin loss of information, valid for the conflation of any finite number of discrete Borel probability distributions, can be applied to the Bayesian paradigm as well. Analogous results hold (see Hill (2011, Theorem 4.5)) for a.c. distributions. 2

Proportional Posteriors
Another criterion to assess the quality of the posterior distribution is to require that it reflects the relative likelihoods of identical individual outcomes under both P 0 and L. For example, if the probability that the prior and the (independent) likelihood are both θ a is twice that of the probability both are θ b , then P 1 (θ a ) should also be twice as large as P 1 (θ b ). * Definition 4.1. A discrete (posterior) distribution P * , proper or improper, with p.m.f. p is a proportional posterior of a discrete prior distribution P 0 with p.m.f. p 0 and a compatible discrete likelihood distribution L with p.m.f. p L , both proper or improper, if Similarly, a proper or improper posterior a.c. distribution P * with p.d.f. f * is a proportional posterior of an a.c. prior distribution P 0 with p.d.f. f 0 and a compatible likelihood distribution L with p.d.f. f L , both proper or improper, if Theorem 4.2. Let P 0 and L be two proper or improper, compatible discrete or compatible absolutely continuous prior and likelihood distributions, respectively. Then the Bayesian posterior distribution P B is a proportional consolidation for P 0 and L.
Proof. A result from (2011, Theorem 5.5) shows that the conflation of two probability distributions is the unique proper proportional consolidation of those distributions. Consequently, the Bayesian posterior is the unique proper proportional consolidation of P 0 and L. No improper distribution shares the same property. In fact, if all the distributions are discrete, and Q is an improper proportional consolidation of P 0 and L with p.m.f. q, then q(θ 1 ) = kp 0 (θ 1 )L(θ 1 ) for some θ 1 ∈ Θ and k > 0, with k , 1. Since Q is a proportional consolidation, then International Journal of Statistics and Probability Vol. 7, No. 1;2018 Summing over all Θ we obtain a finite mass for Q -a contradiction. A similar proof works for a.c. distributions.

Optimal Posteriors for Weighted Prior and Likelihood Distributions
Definition 5.1. Given a prior distribution P 0 with weight w 0 > 0 and a likelihood distribution L with weight w L > 0, the combined weighted self-information associated with the event A, S (P 0 ,w 0 ;L,w L ) (A), is This definition ensures that only the relative weights are important, so for instance if w 0 = w L , the combined weighted self-information of the prior and likelihood always coincides with the (unweighted) combined self-information of the prior and likelihood. The next theorem, a special case of Hill and Miller (2011, (8)), identifies the posterior distribution that minimizes the loss of weighted self-information in the case the prior and likelihood distributions are compatible discrete distributions; the case for compatible absolutely continuous distributions is analogous.
Theorem 5.2. Let P 0 and L be compatible discrete prior and likelihood distributions, proper or improper, with p.m.f.'s p 0 and p L and weights w 0 > 0 and w L > 0, respectively. Then the unique posterior distribution that minimizes the maximum loss of self-information from the weighted prior and likelihood distributions, i.e., that minimizes, among all posterior distributions P 1 , proper or improper, the difference between the combined weighted self-information of the prior and the likelihood distributions and the self-information of the posterior, i.e.
Remark. If both the prior and likelihood distributions are normally distributed, the Bayesian posterior is also a best linear unbiased estimator (BLUE) and a maximum likelihood estimator (MLE); e.g. see Hill (2011).

An Open Question
In classical hypotheses testing, a standard technique to decide from which of several known distributions given data actually came is to maximize the likelihood ratios, that is, the ratios of the p.m.f.'s or p.d.f.'s. Analogously, when the objective is to decide how best to consolidate a prior distribution P 0 and a likelihood distribution L into a single (posterior) distribution P 1 = P 1 (P 0 , L), one natural criterion is to choose P 1 so as to make the ratios of the likelihood of observing θ under P 1 as close as possible to the likelihood of observing θ under both the prior distribution P 0 and the likelihood distribution L. This motivates the following notion of minimax likelihood ratio posterior.
Similarly, a proper a.c. distribution P * with p.d.f. f * is the MLR posterior of an a.c. prior distribution P 0 with p.d.f. is attained by f * .
The min-max terms in Definition 6.1 are similar to the min-max criterion for loss of self-information (Theorem 3.5), whereas the others are dual max-min criteria. Hill (2011, Theorem 5.2), can be used to prove that, when P 0 and L are both proper, the Bayesian posterior is the unique MLR consolidation of the prior and likelihood distributions among all proper Borel distributions. Whether the same result can be extended to prove that the Bayesian posterior is the unique MLR consolidation among both proper and improper distributions remains an open question.