E ﬃ cient Decomposition of Bayesian Networks With Non-graded Variables

,


Introduction
Bayesian Networks (BNs, Pearl, 1988) provide a formal framework to represent uncertain knowledge and to reason under uncertainty. A BN consists of a directed acyclic graph (DAG), encoding a factorization of the joint probability distribution over a set of random variables with finite sample space, and a Conditional Probability Table (CPT) for each variable, containing its probability distribution for each combination of the values of its parents in the DAG.
A CPT is defined by a number of free parameters which is exponential in the number of parent variables, thus elicitation may require relevant effort for domain experts (Druzdzel and van der Gaag, 2000), and estimation from collected data may be highly inefficient. Also, the DAG of a BN where variables have many parents involves few but very large CPTs, so that the joint tree algorithm for exact inference (Lauritzen and Spiegelhalter, 1988) may not be scalable. A widely adopted solution to this problem is the use of decompositions to break down each large CPT into several smaller CPTs. The most popular among such decompositions is the Noisy-OR, pioneered by Good (1961) and further studied by Pearl (1988). In the Noisy-OR, dichotomous parent variables are assumed to influence the value of a dichotomous response through independent latent causes. Each latent cause is 'activated' by a specific parent variable with a certain probability, and a single 'active' latent cause is sufficient for the response to change its value from the reference state to the non-reference one. The independence among latent causes implies the absence of interaction among parent variables, an assumption called causal independence (Heckerman and Breese, 1996). In Henrion (1987), the Noisy-MAX decomposition was introduced as a generalization of the Noisy-OR to graded variables, i.e., ordinal variables with the lowest state as reference. In the Noisy-MAX, latent causes have the same sample space of the response and each one may 'activate' one non-reference state of the response, which in turn takes value on the highest among the 'active' states. Further elaborations of the Noisy-MAX decomposition were provided by Díez (1993) and Srinivas (1993).
The Noisy-MAX decomposition simplifies elicitation from domain experts and estimation from collected data because, due to the assumption of causal independence, the number of free parameters is linear in the number of parent variables, instead of exponential. Also, parent divorcing can be recursively applied to latent causes until each node in the DAG has no more than two parents. In this way, the number of CPTs is increased, but they have a small size, thus the joint tree algo-rithm for exact inference is faster and, moreover, it may become scalable also for high dimensional BNs. Unfortunately, real-world applications of BNs may also involve a number of non-graded variables, like ordinal variables with reference state in the middle of the sample space (double-graded variables) and variables with two or more unordered non-reference states (multi-valued nominal variables).
In this paper, we propose the causal independence decomposition for BNs, which includes the Noisy-MAX and two generalizations suited to double-graded and multi-valued nominal variables. The software implementing our proposal is contained in an R (R Core Team, 2020) package available on Github at https://github.com/alessandromagrini/ cibn. The package can be installed from the R console by typing install github("alessandromagrini/cibn") after loading the devtools package. This paper is structured as follows. Section 2 includes the definition of BN and the notation used in the paper. In Section 3, an overview of the Noisy-MAX decomposition is provided. In Section 4, the causal independence decomposition and its properties are detailed, together with the extension to causal interactions. In Section 5, the impact of our proposal is investigated on a published BN for the diagnosis of acute cardiopulmonary diseases. Section 6 includes concluding remarks.

Definitions and Notation
In this section, the definition of Bayesian network is provided following Pearl (1988), together with the notation used in the paper.
Bayesian network. A Bayesian Network (BN) consists of the following three elements: 1. a set of variables V with finite sample space; 2. a Directed Acyclic Graph (DAG) G defined on V; 3. a set of Conditional Probability Tables (CPTs), one for each variable in V, containing the probability distribution of the variable for each combination of the values of its parents in G.
For each CPT, we denote the response variable as Y (sample space Ω Y ) and its parent variables as X 1 , . . . , X n (sample spaces Ω X 1 , . . . , Ω X n ). The states of a variable are labelled by consecutive integer numbers reflecting their order (if one holds), with value 0 assigned to the reference state. If non-reference states are unordered, they are labelled starting from value 1. The cardinality of any set S is denoted by |S|. The number of non-reference states of the response variable is denoted as n Y ≡ |Ω Y | − 1.
Variables can be either random, i.e., defined by a non-degenerate probability distribution, or deterministic, i.e., defined by a deterministic function. Each combination of the values of a variable's parents is called parent configuration. A generic realization of a random variable is written in lower case, e.g., v denotes a realization of random variable V. A probability distribution is indicated with the symbol p, with the random variable to which it refers within brackets, e.g., p(V). The elements of a probability distribution are indicated within angle brackets, e.g., < π 0 , π 1 , . . . >. An unordered set is indicated within curly brackets, e.g., {X 1 , X 2 , . . .}. An ordered set is indicated within round brackets, e.g., (0, 1, . . .).
In a DAG, each node is labelled by the name of the variable it refers to, circles represent random variables, double circles represent deterministic variables, and squares indicate variables the response is conditioned on (i.e., parent variables). When possible, the representation through plates (Buntine, 1994) is used: a rectangle contains the nodes to be replicated as many times as shown by the index in the rectangle.
In this paper, we focus on the number of free parameters defining the CPTs in a BN, which determines the efficiency of elicitation and estimation, and on the size (number of cells) of each CPT, which determines the efficiency of exact inference. Proposition 1 states that both these two features increase exponentially with the number of parent variables.
Proposition 1. A CPT has size equal to (n Y + 1) n i=1 |Ω X i | and is defined by a number of free parameters equal to n Y n i=1 |Ω X i |.
Proof. The total number of parent configurations is equal to the product between the cardinalities of each parent's sample space: n i=1 |Ω X i |. Thus, the size of the CPT is equal to the cardinality of the sample space of Y multiplied by the total number of parent configurations: (n Y + 1) n i=1 |Ω X i |. Also, since the number of free parameters of each conditional probability distribution of Y is equal to the number of its non-reference states n Y , the total number of free parameters is equal to n Y n i=1 |Ω X i |.

The Noisy-MAX Decomposition
A graded variable is an ordinal variable with the lowest state as reference. According to the notation introduced in Section 2, the sample space of a graded response variable is Ω Y = (0, 1, . . . , n Y ). The Noisy-MAX decomposition for a graded response variable is defined below following Heckerman and Breese (1996).
Noisy-MAX decomposition. The Noisy-MAX decomposition for a graded response variable Y with parents X 1 , . . . , X n consists of the following steps: 1. latent cause Λ 0 is defined with sample space equal to Ω Y and such that: 2. for i = 1, . . . , n, latent cause Λ i is defined with sample space equal to Ω Y , with X i as parent, and such that: 3. latent causes Λ 0 , . . . , Λ n determine the value of Y through the MAX function.
The Noisy-MAX decomposition hypothesizes the existence of one latent cause for each parent (i.e., Λ 1 , . . . , Λ n ) plus one for unmodeled causes (i.e., Λ 0 ), and assumes that each latent cause may 'activate' one non-reference state of the response variable Y, which in turn takes value on the highest among the 'active' states. Note that the constraint p(Λ i | X i = 0) =< 1, 0, . . . , 0 > ∀i = 1, . . . , n means that a parent taking value on its reference state cannot cause the response to take value on a non-reference state, a feature called amechanistic property in Heckerman and Breese (1996).
The graphical representation of the Noisy-MAX decomposition is shown in Figure 1. Proposition 2 states that the Noisy-MAX decomposition is defined by a number of free parameters which is linear in the number of parent variables, instead of exponential. Proposition 2. The Noisy-MAX decomposition is defined by a number of free parameters equal to n Y 1 + n i=1 (|Ω X i | − 1) . Proof. The only random variables in the Noisy-MAX decomposition are latent causes Λ 0 , . . . , Λ n , each with n Y nonreference states. Since Λ 0 has no parents, its probability distribution is defined by n Y free parameters. For i = 1, . . . , n, latent cause Λ i has only X i as parent, which has |Ω X i | − 1 non-reference states, thus the probability distribution of Λ i is defined by n Y (|Ω X i | − 1) free parameters. Thus, the total number of free parameters is equal to n Y 1 + n i=1 (|Ω X i | − 1) .
The linearity of the number of free parameters in the number of parent variables is a good property when CPTs are elicited from domain experts, as they may focus on the influence on the response of one parent at a time, and/or estimated from collected data, as estimates have higher efficiency. Note that, in the Noisy-MAX decomposition, probabilities refer to the states of latent causes rather than of observable variables. For instance, parameter π i, j,l can be elicited from a domain expert by asking a question like: 'What is the probability that the event represented by variable X i taking value j causes the event represented by variable Y taking value l?'.
A further property of the Noisy-MAX decomposition is that parent divorcing can be recursively applied to latent causes Λ 0 , . . . , Λ n in order to obtain an arbitrary (but no less than two) maximum number of parents for each node in the DAG. Auxiliary nodes introduced by parent divorcing are determined by the MAX function. We refer to the Noisy-MAX decomposition where parent divorcing is applied to obtain a maximum of two parents per node as maximal Noisy-MAX decomposition ( Figure 2). The maximal Noisy-MAX decomposition introduces 2n new nodes in the DAG as stated by Proposition 3, thus replacing the original CPT with several new CPTs of smaller size as stated by Proposition 4. As a consequence, the joint tree algorithm for exact inference is faster and, moreover, it may become scalable also for high dimensional BNs. Proof. By definition, the Noisy-MAX decomposition introduces n + 1 new nodes (i.e., latent causes Λ 0 , . . . , Λ n ) in the DAG, and parent divorcing achieving the maximal decomposition requires n − 1 additional nodes. Thus, the total number of new nodes introduced in the DAG by the maximal Noisy-MAX decomposition is equal to (n + 1) + (n − 1) = 2n.
Proof. The maximal Noisy-MAX decomposition introduces 2n new nodes in the DAG (Proposition 3), thus implying 2n + 1 CPTs: one for each of the n + 1 latent causes, one for each of the n − 1 auxiliary nodes, and one for Y. Latent cause Λ 0 has n Y + 1 states and no parents, thus its CPT has size n Y + 1. For i = 1, . . . , n, latent cause Λ i has n Y + 1 states and X i as parent, thus its CPT has size (n Y + 1)|Ω X i |. Auxiliary nodes and Y have n Y + 1 states and, as parents, two nodes among latent causes or auxiliary nodes, thus their respective CPT has size (n Y + 1) 3 .
Since each latent cause is influenced at most by a single parent variable, the Noisy-MAX decomposition implicitly assumes the absence of causal interaction among parents, i.e., that they independently influence the response variable. Such property is called causal independence in Heckerman and Breese (1996).

The Causal Independence Decomposition
The Noisy-MAX decomposition is suited to graded response variables only, but real-world applications of BNs may also involve a number of non-graded variables, like double-graded and multi-valued nominal ones. In this section, we propose the Causal Independence Decomposition (CID), which includes the Noisy-MAX and two generalizations suited to doublegraded and multi-valued nominal variables. These two decompositions are detailed in Subsections 4.1 and 4.2. Subsection 4.3 provides the main properties of the CID, while the extension to causal interactions is addressed in Subsection 4.4.

Causal Independence Decomposition for a Double-graded Response Variable
A double-graded variable is an ordinal variable with reference state in the middle of the sample space. According to the notation introduced in Section 2, the sample space of a double-graded response variable is: The definition of the CID for a double-graded response variable is provided below.
CID for a double-graded response variable. The CID for a double-graded response variable Y with parents X 1 , . . . , X n consists of the following steps: 1. two latent causes Λ (L) 0 and Λ (R) 0 are defined with sample space: and such that: 2. for i = 1, . . . , n, two latent causes Λ (L) i and Λ (R) i are defined with sample space: with X i as parent, and such that: 3. variable ξ (L) is defined with sample space − n Y 2 , . . . , −1, 0 and such that latent causes Λ (L) 0 , . . . , Λ (L) n determine its value through the MIN function; 4. variable ξ (R) is defined with sample space 0, 1, . . . , n Y 2 and such that latent causes Λ (R) 0 , . . . , Λ (R) n determine its value through the MAX function; 5. variables ξ (L) and ξ (R) determine the value of Y through the SUM function.
The CID for a double-graded response variable hypothesizes the existence of two sets of latent causes: one of type L and another one of type R. Latent causes of type L may 'activate' one non-reference state in the left side of Ω Y , and the lowest 'active' state is stored into variable ξ (L) , while latent causes of type R may 'activate' one non-reference state in the right side of Ω Y , and the highest 'active' state is stored into variable ξ (R) . Finally, the value of Y is determined as a balance between the 'active' states in the left and right sides of Ω Y by summing ξ (L) and ξ (R) .
The graphical representation of the CID for a double-graded response variable is shown in Figure 3. The maximal CID is obtained by applying parent divorcing to latent causes (nodes denoted by letter Λ) until ξ (L) and ξ (R) have a maximum of two parents. Auxiliary nodes introduced by parent divorcing as intermediary between latent causes of type L and ξ (L) are determined by the MIN function, while those between latent causes of type R and ξ (R) are determined by the MAX function. Figure 4 displays the maximal CID for a double-graded response variable with three parents.

Causal Independence Decomposition for a Multi-valued Nominal Response Variable
A multi-valued nominal variable is a variable with two or more unordered non-reference states. According to the notation introduced in Section 1, the sample space of a multi-valued nominal response variable is Ω Y = (0, 1, . . . , n Y ), where labels of non-reference states 1, . . . , n Y do not reflect any real order. Note that a dichotomous variable with unordered states can be considered as graded, provided that one of the two states can be chosen as reference. The definition of the CID for a multi-valued nominal response variable is provided below.  CID for a multi-valued nominal response variable. The CID for a multi-valued nominal response variable Y with parents X 1 , . . . , X n consists of the following steps: 1. for l = 1, . . . , n Y , latent cause Λ (l) 0 is defined with sample space (0, 1) and such that: 2. for i = 1, . . . , n and for l = 1, . . . , n Y , latent cause Λ (l) i is defined with sample space (0, 1), with X i as parent, and such that: 3. for l = 1, . . . , n Y , variable ξ (l) is defined with sample space (0, 1) and such that latent causes Λ (l) 0 , . . . , Λ (l) n determine its value through the MAX function; 4. variables ξ (1) , . . . , ξ (n Y ) determine the value of Y through the following function: g : The CID for a multi-valued nominal response variable assumes the existence of one set of dichotomous latent causes for each non-reference state of the response. Each set of latent causes is merged through the MAX function into variable ξ (l) (l = 1, . . . , n Y ) to determine the 'active' non-reference states, and the response takes value on a non-reference state if and only if there is a single 'active' non-reference state.
The graphical representation of the CID for a multi-valued nominal response variable is shown in Figure 5. The maximal CID is obtained by applying parent divorcing to latent causes (nodes denoted by letter Λ) until ξ (1) , . . . , ξ (n Y ) have a maximum of two parents. Figures 6 and 7 display the maximal CID for a multi-valued nominal response variable with three parents in the case, respectively, of two and three non-reference states. Note that the maximal CID applied to a multi-valued nominal response variable reduces the number of parents to two for each node, excepting for node Y which maintains n Y parents, although each of them has only two states (see Figure 7).

Properties of the Causal Independence Decomposition
The CID is defined by the same number of free parameters as the Noisy-MAX decomposition whichever the type of response variable, as stated by Proposition 5.
Proposition 5. The CID is defined by a number of free parameters equal to n Y 1 + n i=1 (|Ω X i | − 1) , whichever the type of response variable.
Proof. The only random variables in the CID are latent causes (nodes denoted by letter Λ), whichever the type of response variable. The CID for a graded response variable equates to the Noisy-MAX decomposition, which, according to Proposition 2, is defined by a number of free parameters equal to n Y 1 + n i=1 (|Ω X i | − 1) . In the case of a double-graded response variable, latent causes have n Y 2 non-reference states and include: Λ (L) 0 and Λ (R) 0 with no parents; Λ (L) i and Λ (R) i , each with X i as parent (i = 1, . . . , n). Thus, the number of free parameters is equal to 2 · n Y 2 · 1 + n i=1 (|Ω X i | − 1) = n Y 1 + n i=1 (|Ω X i | − 1) . In the case of a multi-valued nominal response variable, latent causes have one non-reference state and include: Λ (1) 0 , . . . , Λ (n Y ) 0 with no parent variables; Λ (1) i , . . . , Λ (n Y ) i , each with X i as parent (i = 1, . . . , n). Thus, the number of free parameters is equal to n Y 1 + n i=1 (|Ω X i | − 1) .
Analogously to the maximal Noisy-MAX decomposition, the maximal CID introduces a number of new nodes in the DAG as stated by Proposition 6, thus replacing the original CPT with several new CPTs of smaller size as stated by Proposition Figure 6. Maximal CID for a multi-valued nominal response variable with two non-reference states and three parents. Auxiliary nodes are denoted by letter A Figure 7. Maximal CID for a multi-valued nominal response variable with three non-reference states and three parents. Auxiliary nodes are denoted by letter A 7. As a consequence, the joint tree algorithm for exact inference is generally faster and, moreover, it may become scalable also for high dimensional BNs. The properties of the maximal CID are summarized in Table 1.
Proposition 6. The number of new nodes introduced in the DAG by the maximal CID is equal to 2n for a graded response, to 2(2n + 1) for a double-graded response, and to n Y (2n + 1) for a multi-valued nominal response.
Proof (graded response). The maximal CID for a graded variable equates to the maximal Noisy-MAX decomposition, which, according to Proposition 4, implies one CPT of size n Y + 1, n CPTs of size (n Y + 1)|Ω X i | (i = 1, . . . , n) and n CPTs of size (n Y + 1) 3 .
Proof (double-graded response). The maximal CID for a double-graded variable introduces 2(2n + 1) new nodes (Proposition 6), thus implying 4n + 3 CPTs: one for each of the n + 1 latent causes Λ (L) i (i = 0, . . . , n), one for each of the n + 1 latent causes Λ (R) i (i = 0, . . . , n), one for each of the 2(n − 1) auxiliary nodes to achieve the maximal decomposition, one for ξ (L) , one for ξ (R) and one for Y. Since Λ (L) 0 and Λ (R) 0 have n Y 2 + 1 states and no parents, their respective CPT has size n Y 2 + 1. For i = 1, . . . , n, latent causes Λ (L) i and Λ (R) i have n Y 2 + 1 states and X i as parent, thus their respective CPT has size n Y 2 + 1 |Ω X i |. Auxiliary nodes to achieve the maximal CID, as well as ξ (L) and ξ (R) have n Y 2 + 1 states and, as parents, two nodes among latent causes and auxiliary nodes, thus their respective CPT has size n Y 2 + 1 3 . Finally, Y has ξ (L) and ξ (R) as parents, thus its CPT has size (n Y + 1) n Y 2 + 1 2 .

Extension to Causal Interactions
Suppose that, for a response variable Y, causal interaction holds among the variables in a subset X S of the parents X 1 , . . . , X n . The following two-step technique allows to apply the CID in this case.

A new multi-valued nominal variable
Z is created with variables in X S as parents, sample space equal to the cartesian product of the sample spaces of variables in X S , and such that Z takes value on a particular combination of states x S if and only if each variable in X S takes value on the respective state in x S . The neutral state of Z is the one combining the neutral states of variables in X S .
2. The edges connecting variables in X S to Y are deleted and an edge from Z to Y is added.
After this technique is implemented, causal independence holds among the new parents of Y, thus the CID can be applied.
As an example, suppose that X S = {X 1 , X 2 } and Ω X 1 = Ω X 2 = (0, 1). In this case, we create the multi-valued nominal variable Z as follows: where the reference state is (0, 0), because it combines the reference states of X 1 and X 2 . Figure 8 illustrates the technique for a graded response variable, but it is identically implemented for a double-graded or a multi-valued nominal one. Figure 8. Illustration of the technique allowing to apply the CID in presence of causal interactions. Here, Y is a graded variable with parents X 1 , X 2 , X 3 and X 4 , with a causal interaction holding between X 1 and X 2 . Node Z is introduced as intermediary between the interacting parents and Y (left panel). In this way, the new set of parents {Z, X 3 , X 4 } satisfies causal independence and the maximal CID can be applied (right panel)

Practical Application
We investigate the impact of the maximal CID on the BN for the diagnosis of acute cardiopulmonary diseases developed by Magrini et al. (2018). The BN contains 278 variables, which are distinguished by the authors into dichotomous, polytomous and continuous. The reference state and an eventual order on the states is established for each dichotomous and polytomous variable, while a reference range is defined for continuous variables, which can be at the left side of the sample space (restricted continuous variable), or in its middle (non-restricted continuous variable).
We considered dichotomous, ordinal polytomous and restricted continuous variables as graded (249 variables), nonrestricted continuous variables as double-graded (12 variables), and non-ordinal polytomous variables as multi-valued nominal (17 variables). Most variables have a single non-reference state (207 variables), all the double-graded variables have four non-reference states, and, among multi-valued nominal variables, most have two or three non-reference states (14 variables out of 17). Also, there are 10 variables with at least two sets of interacting parents, for a total of 36 sets, all with cardinality equal to two. The BN is characterized by a high structural complexity: the mean size of the parent sets (inner degree) is 2.1 with a maximum of 10, the mean size of the child set (outer degree) is 2.1 with a maximum of 31, and the mean size of Markov blankets (i.e., number of parents, children and parents of the children) is 8.5 with a maximum of 50. The distribution of the main structural characteristics of the BN is summarized in Table 2. An illustration of the maximal CID applied to three variables in the BN is provided in the Appendix, and the decomposed BN is available as an R object at https://github.com/alessandromagrini/cibn. Table 3 summarizes the impact of the maximal CID on the BN with regard to the number of new nodes, the number of free parameters and the size of CPTs. We see that, at the cost of increasing by 5.7 times the number of nodes (from 278 to 1574) and by 6.6 times the number of CPTs (from 278 to 1828), the number of free parameters is reduced by 23.5 times (from 47560 to 2023), while the size of the CPTs is decreased by 24.1 times in mean (from 234 to 9.7) and by 108 times in maximum (from 23328 to 216).

Concluding Remarks
We have proposed an extension of the Noisy-MAX decomposition to non-graded variables, called Causal Independence Decomposition (CID). Our proposal maintains the two desirable properties of the Noisy-MAX: linearity of the number of free parameters with respect to the number of parent variables and significant reduction of the size of CPTs. The first property is important for elicitation from domain experts and estimation from collected data, as the addition of a new parent variable entails an increase in the number of parameters which is proportional to the number of that parent's non-reference states. The second property is important for exact inference, because speed and scalability of the joint tree algorithm depend inversely on the size of CPTs.
The CID is maximally efficient if nominal variables have no more than two non-reference states and all the sets of interacting parents have cardinality equal to two, situation where the DAG can be decomposed until each node has no more than two parents. In the Bayesian network for the diagnosis of acute cardiopulmonary diseases developed by Magrini et al. (2018), the number of nominal variables with more than two non-reference states is small compared to the dimension of the model, and all the sets of interacting parents have cardinality equal to two. The CID has a very significant impact in this case, with the number of free parameters and the mean size of CPTs reduced by 23.5 and 24.1 times, respectively.