Advantages of the probability amplitude over the probability density in quantum mechanics

We discuss reasons why a probability amplitude, which becomes a probability density after squaring, is considered as one of the most basic ingredients of quantum mechanics. First, the Heisenberg/Schrodinger equation, an equation of motion in quantum mechanics, describes a time evolution of the probability amplitude rather than of a probability density. There may be reasons why dynamics of a physical system are described by amplitude. In order to investigate one role of the probability amplitude in quantum mechanics, specialized codeword-transfer experiments are designed using classical information theory. Within this context, quantum mechanics based on probability amplitude provides the following: i) a minimum error of the codeword transfer; ii) this error is independent of coding parameters; and iii) nontrivial and nonlocal correlation can be realized. These are considered essential advantages of the probability amplitude over the probability density.


Introduction
Quantum mechanics (QM) is considered the most basic theory of nature. All phenomena including those of the gravitational force are considered to be expressed by a language of QM. However, an essential understanding of the basic nature of QM yet to be realized, and efforts to look for more fundamental explanations continue. Of course, QM itself is a self-consistent theory and requires no fundamental reasoning to support its truths beyond what are gains from experiments. Still, it is worth pursuing more basic reasons which determine QM to be the most fundamental law of nature. For instance, Wheeler asked "Why the quantum?" and discussed the relation between QM and information theory [1,2]. In this report we attempt to answer the same question from Wheeler's point of view. One of the most essential differences between quantum and classical mechanics is the former's need for a probabilistic treatment of theoretical predictions. One cannot avoid the probabilistic interpretation of a wave function proposed by Born [3], which is now known as the Copenhagen interpretation. A fundamental equation of QM, the Heisenberg/Schrödinger equation, does not describe the behavior of a physical observable nor its probability density; rather, it describes the probability amplitude, which is a characteristic of QM and possesses no classical counterpart. (In a narrow sense,"quantum amplitude" is a complex number whose square of the absolute value is a probability. In this report, we use a word "quantum amplitude" not only for complex numbers, but also for vectors whose square of the absolute value is a probability.) This report considers reasons why fundamental laws of physics are described by probability amplitude instead of probability density, leaving aside the question of why probability itself is necessary. To clarify essential properties of probability amplitude, codeword-transfer experiments are designed on the basis of classical information theory. Taking into account the discussions on these experiments, three essential advantages of probability amplitude over probability density are pointed out in the following sections.
First, definition of quantum system and probability amplitude are given in Section 2 under a very general mathematical framework. Then, codeword-transfer experiments are designed within classical information theory to investigate the role of probability amplitude. Experiments using a stochastic algorithm cannot avoid statistical error due to sample number. In Section 3, we show that a coding method based on probability amplitude should minimize statistical error. Moreover, statistical errors of the codeword-transfer are independent of the parametrization allowing each character to be transferred; this is shown in Section 4. Another essential feature of QM is its lack of local realism, which can be judged by Bell's inequality. This local realism and Bell's inequality are described using terminology of classical information theory, again as the codewordtransfer experiment. A method based on the probability amplitude can induce a violation of Bell's inequality, as shown in Section 5. Throughout this report, classical information theory is used to describe codeword-transfer experiments.

General quantum system
A general framework to define the probability amplitude appearing in QM is considered in this section. Here we emphasis algebraic aspects of QM and ignore dynamical ones. The question which must be asked here is "What minimum set of assumptions makes a system look like quantum mechanics?" We propose the following elements as indispensable ingredients for QM. Definition 2.1. (Quantum Space) K is any field and V is a linear (vector) space on it. K is named as a base field and is associated to each point of a set, M. State vector and probability measure are introduced on these spaces as follows.

1.
A map from a point on M to a tensor product of a vector space V , is named state vectors. Here, M is named the base set and x is a point on it.
2. A map from the state vector to a real number such as is named a probability measure. The index i on ψ i runs from 1 to k. The sequential map is also called a probability measure and represented by the same symbol, µ, when V are obvious.
3. The probability measure must be normalized as for each i, where Γ is an appropriate subset of the base set M. Since the probability measure is considered as Lebesgue measure, the integral should be interpreted as summation when M is a discrete set.
4. The set {K, V, Ψ, µ} is named a "quantum space." To construct QM, these conditions are necessary, but are not sufficient. For standard relativistic QM (or quantum field theory), we take Hilbert space as a vector-space V on a field of complex numbers C. State vector can be constructed using square integrable functions on a given support. The state vector is associated with each point of the Minkowski manifold as a base set. (Sometimes a Fourier transformation of ψ i defined in the momentum manifold is used instead of ψ i itself. In that case, a corresponding Hilbert space is called "Fock space".) The probability measure is introduces as µ(ψ i ) = |ψ i | 2 . For the normalization, Γ is taken as a hyper-surface on M such that any two points on Γ have a space-like distance each other. (Or it is normalized in the momentum space.) When the probability measure is defined as square of the absolute value of the state vector, the state vector is called a "probability amplitude" in this report, hereafter. In this report, simple quantum spaces are used since only algebraic aspects of QM are of interest here.

Minimization of measurement error
First, let us consider a statistical error for measurements of a single physical observable on the quantum space defined in the previous section. A codeword-transfer experiment simulating standard QM in a much simpler quantum space, retaining essential properties, is introduced here. In information theory, an encoding method which minimizes statistical error among methods using stochastic algorithms is known. The method using probability amplitude is shown to be an example of such an encoding method giving minimum errors. Terminology of classical information theory used here can be found in Appendix A.1 and references [4,6]. The experiment satisfying the following conditions is called a stochastic codeword-transfer experiment: 1. Alice (A) transfers a set of m different codewords W = {w 1 , · · · , w m } to Bob (B) after converting them to state vectors ψ ∈ V , where V is a m-dimensional vector space.
2. B receives a state vector sent from A and obtained one of codewords W by measuring them. Here meaning of "measuring" will be explain in following items.

The same probabilistic function of
is given for B. The value p i gives a probability to observe a codeword ω i . Only one vector space V appears here, then the function µ c (ψ)(ω i ) will be written as µ c (ω i ), hereafter.
4. Probabilistic function µ c is normalized as:

5.
A can repeat to send a finite number (n times here) of the same state vectors to B.
6. B obtains n independent codewords by measuring sets of state vectors sent from A, such as X = {x 1 , · · · , x n }.
7. B has an unbiased estimator to obtain a set of real numbersx i ∈ [0, 1] from measured data as Herex i converges in probability to p i when n → ∞, thanks to the law of large numbers.
8. Finally B obtains a sequence of numbers {x 1 , · · · ,x m }, which A intended to send.
This codeword-transfer experiment is constructed on the quantum space {W , V , ψ, µ c } as defined above. In this case, positions where A or B exists are not specified. No dynamical structure is assumed to transport a state vector from A to B here, however it is just assumed that these two points are separated from each other and there is no way to communicate other than the state-vector transfer. A question to ask here is how may one find the probability measure µ c , which maps the state vector ψ to a real number µ c (ψ) to minimize an error of this experiment for any ψ. The answer is already known as a theorem, which was first obtained by Fisher [7]. Wootters stated this theorem [8] without any proof but later provided the same by introducing a statistical distance [9]. Recently Wootters discussed this subject again in [10]. Here we state the theorem clearly again and give an independent and much simpler proof using an information theory.

Theorem 3.1. (Fisher-Wootters)
Among stochastic codeword-transfer experiments, that which employs the following probability measure gives the smallest error to measure a single codeword from a set of codewords: Figure 1: Example for m = 3: assignment of a vector V on two-dimensional sphere.

1.
A selects a set of codewords W = {w 1 , · · · , w m } and a sequence of numbers P = {p 1 , · · · , p n } which is intended to be sent to B. The P is normalized as Proof.
1. The smallest error ⇒ µ c (ω i ) = |y i | 2 : Data after n independent measurements are expressed as X = (x 1 , · · · , x n ) with the probability µ c (ω i ) = |y i | 2 . The probability density to obtain a set of data X is assumed to be expressed as f(X ; ψ) = µ c (ω i ), where µ c (ω i ) used defined as an equation (3). Then the Fisher information matrix (FIM) [4] can be written as The functions µ c (ω i ) are not independent of each other owing to conservation of the total probability, m i=1 µ c (ω i ) = 1. We can assume that all µ c (ω i ) (i ≥ 2) are independent except µ c (ω 1 ) = 1− m j=2 µ c (ω j ) without any loss of generality. Since all other µ c (ω i =1 ), except this correlation due to the conservation of probability, can be set to be independent after appropriate linear transformation of µ c , the FIM can be taken to be a diagonal matrix. Here we use a short-hand expression, µ c (ω i ) = µ i , dµ c (ω j )/dω i = µ j,i , and m j=2 µ c (ω j ) =μ; then the diagonal components of the FIM can be written as Here the independence of all µ k≥2 each other is used second line to third line in above calculations. The minimum value of J ii is obtained whenμ = µ i within the allowed region of µ i ≤μ ≤ 1. Then we get On the other hand, measured data after n independent measurements must follow a multinomial distribution, whose covariance matrix σ is wherep i is measured probability of an ith codeword. Then, after n independent measurements through estimator T defined in Definition 2.6, a covariant matrix Σ(X ) can be expressed as Then, diagonal components of the covariant matrix become In general, measured probability (p i ) differs from true probability (µ i ); however, it is certain that the error of |p i − µ i | will be less than any small value after a sufficient number of events accumulates, as a result of the law of large numbers and the assumption that the estimator is unbiased. Then, we use µ i instead ofp i in the discussions that follow. The probability µ i that maximizes diagonal components of the covariant matrix is given as µ i = 1/2 due to dΣ ii /dµ i = (1 − 2µ i ) = 0. Then, the diagonal components of the covariant matrix are given as Σ ii = 1/4. The Cramér-Rao inequality [11,12,4] gives the lower bound of the covariant matrix as A possible range of the inverse of the FIM is where we use the FIM (J) is a diagonal matrix. Then a solution of the following differential equation gives the minimum variance in general: The solution of this equation can be obtained as where φ i is an arbitrary phase factor. This phase factor corresponds to a rotation of the coordinate system prepared in Theorem 3.1 and gives no essential effect on the result. Then we set φ i = 0 hereafter as µ i = cos 2 ω i . Each ω i gives the same differential equation; then parametrization y i = √ µ i = cos ω i gives the lowest value of the variance, which is nothing other than the direction cosine of the vector V , whose endpoint is on the unit sphere S m−1 1 . Then, the method to give the minimum variance is: i) normalize the codeword ω i to 0 ≤ ω i ≤ π/2; ii) map on the S m−1 1 as ω i to be an angle from axis η i ; set iii) the probability to observe the codeword ω i to be cos 2 ω i , which are the same as the assumptions of the theorem. 2. µ c (ω i ) = |y i | 2 ⇒ the smallest error: When we set µ i = |y i | 2 = cos 2 ω i , the diagonal components of a covariant matrix become Then the minimum value of Σ ii is obtained to be 1/4 at ω i = π/4. On the other hand, the diagonal component of the FIM matrix can bẽ ThenJ −1 ii = 1/4, which matches the minimum value of Σ ii .
In the above decoding method, a relation between probability amplitude and density is algebraically the same as in the standard QM, which means the latter employs a coding method that minimizes statistical error among other stochastic methods. This is our first example outlining the advantage of the method using probability amplitude.

Parametrization independence of a measurement error
Related to the Theorem 3.1, one can prove following theorem, which is also given by Wootters [8,9] and is important to consider one role of the probability amplitude.
which means a mean-square error is determined by the statistics per degree of freedom and independent of the position on an m-dimensional sphere. A factor σ 2 ∝ 1/n follows from the central limit theorem. B has a λ/4 plate with fixed plane and photon detector with 100% efficiency. A knows the angle of the polariser plane of B, say θ 0 , and has a clock exactly synchronised to that of B. A assigns codewords on equally separated points on a unit circle, and selects an integer, say j. Then A sets an angle of the polariser according to a codeword to be θ = θ 0 + α, where α = j/2π. A transfers one photon a second and n photons in total. B measures photons behind the λ/4 plate. If B observes a photon, he records "1" and if not, he records "0". As a result B obtains data X n = {X 1 , X 2 , · · · , X n } = {1, 1, 0, 1, 0, · · · }, and decodes them to one real number with averagex = n i X i /n. According to quantum mechanics this number must bex = sin α. Finally, B obtains a number which A intended to send. This codeword-transfer experiment satisfies Definition 3.1, which means quantum mechanics gives codeword-transfer experiments with the smallest errors, given by Theorem. 3.1.

Nonlocal realism
A point definitely distinguishing QM from classical mechanics is that QM does not have local realism. Related to this fact, there are two important theorems: violation of Bell's inequality [13] and Kochen-Specker theorem [14]. Both theorems are related to a correlation of two independent measurements. It is shown in this section that these two theorem can be realized again using the probability amplitude. In order to discuss a correlation of two independent measurements, a double codeword-transfer experiment is designed.
2. (B) and (C) are placed opposite to A and receive state vectors sent from A, stochastically choose one of the two sets to be measured. Neither B and C know which set is chosen by the other (independence of set selection).

5.
A can send a finite number (n times here) of the same set of state vectors to B and C.

Measurements:
(a) B obtains n independent codewords by measuring sets of state vectors sent from A, such as For C, the same as (a) with a replacement B → C.

Estimator:
(a) B has an unbiased estimator to obtain a set of real numbersx i ∈ [0, 1] from measured data as where i runs from 1 to 2m.
(b) For C, the same as above with a replacement B → C.
8. After completing measurement, B and C make a tablex i,j = (x B i ,x C j ), wherex i,j converges in probability to p i,j when n → ∞, thanks to the law of large numbers.
Bell's inequality is a critical test to distinguish a nonlocal theory from a local one. This theorem can be expressed by the language of classical information theory [15]. We state this theorem and give a proof in the context of Definition 6.1.

Theorem 6.1. (Bell) Let us consider a case with a complete table to give the probability of observing any pair of codewords as
These measurements are performed as the stochastic double codeword-transfer experiment defined above. In this case, a conditional entropy follows the inequality Definitions and necessary formulae for following proof can be found in [4] and summarized in Appendix A.2.
Proof. On the assumption there exists a complete probability table, P (α i1 , α i2 , β j1 , β j2 ), a joint entropy can be written as Using the chain rule of entropy sequentially, one can get On the other hand, this joined entropy satisfies Inequality follows from nonnegativity of entropy. From the property of the probability measure in the probability space, and the definition of joint entropy, the inequalities follow. Then Bell's inequality is proved.
The necessary condition for Bell's inequality, the existence of the complete probability table P (α i1 , α i2 , β j1 , β j2 ), corresponds to local realism in the physical terminology. Here, we give an example where Bell's inequality is not maintained.

Definition 6.2. (Stochastic double codeword-transfer experiment without a complete probability table)
Here, the number of codewords in the set is m = 2 for simplicity.
1. Set m = 2 in Definition 6.1-1 for two sets of codewords such as and for state vectors as where 0 ≤ θ α , θ β ≤ π. This parametrization configures an example of Theorem 3.1.
3. Encoding is performed using following probabilistic function: 4. B and C select for measurement one of the elements (codewords) in W A or W B , independently. Before measurement, B (C) rotates a detector angle up to θ b (θ c ). Neither knows the rotating angle of the other. B and C correct this rotation angle after completing all measurements. This rotation does not affect the error of the measurement, owing to Theorem 4.1.
(a) If state vectors {α i } and {β i } exist locally before the measurement for B, the probability that B may obtain each codeword can be obtained after rotation as where R(θ) is a rotation matrix, ψ γ ∈ V α ⊕ V β , and θ γ = θ α or θ β depending on ψ γ . The probability for C is similar to the above. In this case we do not observe any violation of Bell's inequality since we can prepare the complete probability table.
(b) Suppose the angles θ α and θ β are not fixed before measurement and are fixed when B or C measure the code from W A or W B and the probability measure µ dc depends on the result of their decision. Moreover we require that the probability measure does not follow the functional composition condition (FUNC) [5]. In a context of the report, the FUNC is a requirement for any function f as arithmetic operations on vectors and real numbers as .
A function f in l.h.s. maps real numbers to a real number. On the other hand, f in r.h.s from vectors to a real number. Here we consider a natural isomorphism between real numbers and vectors in operations of addition, subtraction, and (scalar) product, and represented the same symbol f . For a current example, the probability measure does not satisfy the FUNC, for example, as = µ dc (α 1 + α 2 , ψ k ).
Suppose C obtains α 1 (α 2 ). The angle θ α for B is fixed as θ α = θ c (θ α = π/2+θ c ), i.e., the probability table is now situation-dependent. The state vectors for B are now If B decided to measure a codeword from a set W β , nothing would happen. On the other hand, if B decided to measure a codeword from the same set as C, then Again the probability to obtain one of the α can be calculated using only local parameters on B.
In both cases, B can obtain a set of codewords that A intended to send. The probability table is situation-dependent and there is a possibility that Bell's inequality will be violated.
It is proved that the Kochen-Specker theorem is incompatible the FUNC [5]. Above stochastic double codewordtransfer experiment is a model of the QM violating the FUNC to incorporate the Kochen-Specker theorem.
In order to confirm a violation of Bell's inequality, it is tested numerically according to the above example. A correlation between measured codewords independently obtained by B and C is defined mimically like CHSH [16] as According to the results of Theorem 6.1, ∆S is bounded by negative values when the complete probability table exists. If the theory is based on local realism, one can always prepare the complete table to observe codewords for both B and C. In order to design the experiment that gives a stronger correlation (∆S > 0), one has to employ a rule for choosing the probability table, i.e., a choice that cannot be determined locally. Moreover, the rule must also satisfy requirements from special relativity, if one would like to interpret as physical law. The stochastic double codeword-transfer experiment defined by Definition 6.2 is an example of such a rule. Under Definition 6.2-4b, for instance, B cannot know the probability table he is using because it depends on C's decision, and that cannot be known by B. This lack of the complete probability table is deeply related to the Kochen-Specker theorem (KST). The KST insists of absence of a complete set of physical quantities without measurements in QM, and corresponds exactly to lack of the complete probability table introduced in Definition 6.2. Moreover, if we look at only B's results, we cannot extract any information about C's choices and results; that means C's information cannot transferred to B immediately, which is a requirement from special relativity. This coexistence of nonlocality and special relativity is realized by the rule of Definition 6.2-4b of the stochastic double codeword-transfer experiment. The probability tables, µ B (α 1 ) and µ B (α 2 ), include θ c , though these are tables for B, which is called "entanglement". However, B cannot extract a value of θ c because θ c appears only in phase of the unitary transformation and disappears after reaching the average. Violation of Bell's inequality can be judged by checking whether the correlation ∆S is greater than zero or not. Numerical results with employing rule of Definition 6.2-4b are calculated and shown in Fig. 3. One can clearly see the violation of Bell's inequality in some parameter regions. This trick can be implemented because the probability table represents probability amplitude. For example, if B decided to measure a codeword from the same set as C, say V A , state vectors for B is superposition of two possible sates depending on the result of measurement of C such as where R(θ) is a rotation matrix. Then the probabilities of µ dc (α i ) are obtained as in Definition 6.2-4b. The nonlocal realism is induced by squaring the state vector after superposition of two possible states. This is another example outlining the advantage of the method using probability amplitude.

Summary
In this report, a basic definition of quantum mechanics concerning its static aspect is proposed. Here, a dynamic aspect of quantum mechanics is not treated. The simple codeword-transfer experiment which satisfies the definition of quantum mechanics is designed to investigate some of it aspects. Then it is proved that a method using probability amplitude gives the minimum error for the physical observables using information theory. Also, it is shown that the size of the error doesn't depend on parametrization of the coding. Nonlocal realism is one of the most essential parts of the nature of quantum mechanics. It is shown that quantum mechanics defined here can include nonlocal realism for the double codeword-transfer experiment introduced by extending a codeword-transfer experiment, above. We showed that the quantum mechanics defined here can violate Bell's inequality, thanks to the property of the probability amplitude.
In conclusion, the probability amplitude rather than probability density gives the minimum and independent mean-square errors from parametrization. Moreover, it allows one to obtain nontrivial and nonlocal correlation on two independent measurements which violate Bell's inequality incorporate with the Kochen-Specker theorem. It is worth pointing out that nonlocal realism can be realized without any complex-number valued amplitude here. The complex-number valued amplitude could be one of convenient representations for quantum mechanics, but indispensable ingredient of that.

A.1 Classical estimation theory
We define terms associated with physical measurement according to classical estimation theory [4] as follows. Let X be a random variable for a given physical system described by the N -tuple θ = {θ 1 , · · · , θ N }, where θ i is the i th physical parameter. The set of all possible values of θ i ∈ R, denoted by Θ, is called the parameter set. The random variable X is distributed according to the probability density function f(x; θ) ≥ 0, which is normalized as x∈Ω dx f(x; θ) = 1, where x ∈ R is one possible value of the whole event (= Ω). For physical applications, we introduce the probability amplitude defined by |ω(x; θ)| 2 = f(x; θ).
A part of experimental apparatus is assumed to output numbers distributed according to the probability density. Any resulting set of numbers X n = {x 1 , · · · , x n }, drawn independently and identically distributed (i.i.d.), is called the experimental data. The estimate of the physical parameter is called a measurement. Because experimental data are i.i.d., the corresponding probability density function can be expressed as a product: f(X n ; θ) = n j=1 f(x j ; θ).
A function mapping the experimental data to one possible value of the parameter set such as T i : X n → Θ : {x 1 , · · · , x n } →θ i is called an estimator for the ith physical parameter, denoted by T i (X n ) =θ i . The experimental error in the ith physical parameter is defined as the root mean square error: where θ i is the true value of the i th physical parameter. True values of physical parameters are typically unknown, but a mean-square error can be reduced below any desired value by accumulating a sufficiently large amount of experimental data, thanks to the law of large numbers. If the mean value of the experimental error converges to zero in probability, i.e., lim n→∞ E θi [θ i − θ i ] → 0 (in probability), after accumulation of infinitely many statistics, that estimator is called an unbiased estimator. Among such estimators, the one giving the least error is called the best estimator.

A.2 Information theory
For a probability space (Ω, A, P ) and probability variable X defined on it, information entropy S(X) is defined as S(X) = − x∈Ω P (x) log P (x).
S(X) ≥ 0 immediately follows from 0 ≤ P ≤ 1. For two probability variable X, Y whose domains are Ω x , Ω y , where Ω x , Ω y ⊆ Ω, a joint entropy is defined as where P (x ∩ y) is a probability to observe x in X and y inY , simultaneously. A conditional entropy is defined as S(Y |X) = − x∈Ωx y∈Ωy P (x ∩ y) log P (y|x), Where P (y|x) is conditional probability to observe y in Y when x inX is obtained. On those entropies, following formulae are obtained: S(X, Y ) = S(X) + S(Y |X), S(X|Y ) ≤ S(X).