1 Introduction
SumProduct Networks (SPNs) have recently attracted some interest because of their flexibility in modeling complex distributions as well as the tractability of performing exact marginal inference (Poon and Domingos, 2011; Gens and Domingos, 2012, 2013; Peharz et al., 2015; Zhao et al., 2015, 2016a, 2016b; Peharz et al., 2017). They are generalpurpose inference machines over which one can perform exact joint, marginal and conditional queries in linear time in the size of the network. It has been shown that discrete SPNs are equivalent to arithmetic circuits (ACs) (Darwiche, 2003; Park and Darwiche, 2004) in the sense that one can transform each SPN into an equivalent AC and vice versa in linear time and space with respect to the network size (Rooshenas and Lowd, 2014). SPNs are also closely connected to probabilistic graphical models: by interpreting each sum node in the network as a hidden variable and each product node as a rule encoding contextspecific conditional independence (Boutilier et al., 1996)
, every SPN can be equivalently converted into a Bayesian network where compact data structures are used to represent the local probability distributions
(Zhao et al., 2015). This relationship characterizes the probabilistic semantics encoded by the network structure and allows practitioners to design principled and efficient parameter learning algorithms for SPNs (Zhao et al., 2016a, b).Most existing batch learning algorithms for SPNs can be straightforwardly adapted to the online setting, where the network updates its parameters after it receives one instance at each time step. This online learning setting makes SPNs more widely applicable in various realworld scenarios. This includes the case where either the data set is too large to store at once, or the network needs to adapt to the change of external data distributions. Recently Rashwan et al. (2016) proposed an online Bayesian Moment Matching (BMM) algorithm to learn the probability distribution of the model parameters of SPNs based on the method of moments. Later Jaini et al. (2016)
extended this algorithm to the continuous case where the leaf nodes in the network are assumed to be Gaussian distributions. At a high level BMM can be understood as an instance of the general assumed density filtering framework
(Sorenson and Stubberud, 1968) where the algorithm finds an approximate posterior distribution within a tractable family of distributions by the method of moments. Specifically, BMM for SPNs works by matching the first and second order moments of the approximate tractable posterior distribution to the exact but intractable posterior. An essential subroutine of the above two algorithms (Rashwan et al., 2016; Jaini et al., 2016) is to efficiently compute the exact first and second order moments of the onestep update posterior distribution (cf. 3.2). Rashwan et al. (2016) designed a recursive algorithm to achieve this goal in linear time when the underlying network structure is a tree, and this algorithm is also used by Jaini et al. (2016) in the continuous case. However, the algorithm only works when the underlying network structure is a tree, and a naive computation of such moments in a DAG will scale quadratically w.r.t. the network size. Often this quadratic computation is prohibitively expensive even for SPNs with moderate sizes.In this paper we propose a linear time (and space) algorithm that is able to compute any moments of all the network parameters simultaneously even when the underlying network structure is a DAG. There are three key ingredients in the design and analysis of our algorithm: 1). A linear time reduction from the moment computation problem to the joint inference problem in SPNs, 2). A succinct evaluation procedure of polynomial by differentiation without expanding it, and 3). A dynamic programming method to further reduce the quadratic computation to linear. The differential approach (Darwiche, 2003) used for polynomial evaluation can also be applied for exact inference in Bayesian networks. This technique has also been implicitly used in the recent development of a concaveconvex procedure (CCCP) for optimizing the weights of SPNs (Zhao et al., 2016b). Essentially, by reducing the moment computation problem to a joint inference problem in SPNs, we are able to exploit the fact that the network polynomial of an SPN computes a multilinear function in the model parameters, so we can efficiently evaluate this polynomial by differentiation even if the polynomial may contain exponentially many monomials, provided that the polynomial admits a tractable circuit complexity. Dynamic programming can be further used to trade off a constant factor in space complexity (using two additional copies of the network) to reduce the quadratic time complexity to linear so that all the edge moments can be computed simultaneously in two passes of the network. To demonstrate the usefulness of our linear time subroutine for computing moments, we apply it to design an efficient assumed density filter (Sorenson and Stubberud, 1968) to learn the parameters of SPNs in an online fashion. ADF runs in linear time and space due to our efficient subroutine. As an additional contribution, we also show that ADF and BMM can both be understood under a general framework of moment matching, where the only difference lies in the moments chosen to be matched and how to match the chosen moments.
2 Preliminaries
We use to abbreviate , and we reserve to represent an SPN, and use to mean the size of an SPN, i.e., the number of edges plus the number of nodes in the graph.
2.1 SumProduct Networks
A sumproduct network
is a computational circuit over a set of random variables
. It is a rooted directed acyclic graph. The internal nodes of are sums or products and the leaves are univariate distributions over . In its simplest form, the leaves of are indicator variables , which can also be understood as categorical distributions whose entire probability mass is on a single value. Edges from sum nodes are parameterized with positive weights. Sum node computes a weighted sum of its children and product node computes the product of its children. If we interpret each node in an SPN as a function of leaf nodes, then the scope of a node in SPN is defined as the set of variables that appear in this function. More formally, for any node in an SPN, if is a terminal node, say, an indicator variable over , then , else . An SPN is complete iff each sum node has children with the same scope, and is decomposable iff for every product node , scope() scope() for every pair of children of . It has been shown that every valid SPN can be converted into a complete and decomposable SPN with at most a quadratic increase in size (Zhao et al., 2015) without changing the underlying distribution. As a result, in this work we assume that all the SPNs we discuss are complete and decomposable.Let
be an instantiation of the random vector
. We associate an unnormalized probability with each node when the input to the network is with network weights set to be :(1) 
where is the child list of node in the graph and is the edge weight associated with sum node and its child node . The probability of a joint assignment is computed by the value at the root of with input divided by a normalization constant : , where is the value of the root node when all the values of leaf nodes are set to be 1. This essentially corresponds to marginalizing out the random vector , which will ensure defines a proper probability distribution. Remarkably, all queries w.r.t. , including joint, marginal, and conditional, can be answered in linear time in the size of the network.
2.2 Bayesian Networks and Mixture Models
We provide two alternative interpretations of SPNs that will be useful later to design our linear time moment computation algorithm. The first one relates SPNs with Bayesian networks (BNs). Informally, any complete and decomposable SPN over can be converted into a bipartite BN with size (Zhao et al., 2015). In this construction, each internal sum node in corresponds to one latent variable in the constructed BN, and each leaf distribution node corresponds to one observable variable in the BN. Furthermore, the constructed BN will be a simple bipartite graph with one layer of local latent variables pointing to one layer of observable variables . An observable variable is a child of a local latent variable if and only if the observable variable appears as a descendant of the latent variable (sum node) in the original SPN. This means that the SPN can be understood as a BN where the number of latent variables per instance is .
The second perspective is to view an SPN as a mixture model with exponentially many mixture components (Dennis and Ventura, 2015; Zhao et al., 2016b). More specifically, we can decompose each complete and decomposable SPN into a sum of induced trees, where each tree corresponds to a product of univariate distributions. To proceed, we first formally define what we called induced trees:
Definition 1 (Induced tree SPN).
Given a complete and decomposable SPN over , is called an induced tree SPN from if 1). ; 2). If is a sum node, then exactly one child of in is in , and the corresponding edge is in ; 3). If is a product node, then all the children of in are in , and the corresponding edges are in .
It has been shown that Def. 1 produces subgraphs of that are trees as long as the original SPN is complete and decomposable (Dennis and Ventura, 2015; Zhao et al., 2016b). One useful result based on the concept of induced trees is:
Theorem 1 ((Zhao et al., 2016b)).
Let . counts the number of unique induced trees in , and can be written as , where is the th unique induced tree of and is a univariate distribution over in as a leaf node.
Thm. 1 shows that can also be computed efficiently by setting all the edge weights to be 1. In general counting problems are in the complexity class (Valiant, 1979), and the fact that both probabilistic inference and counting problem are tractable in SPNs also implies that SPNs work on subsets of distributions that have succinct/efficient circuit representation. Without loss of generality assuming that sum layers alternate with product layers in , we have , where is the height of . Hence the mixture model represented by has number of mixture components that is exponential in the height of . Thm. 1 characterizes both the number of components and the form of each component in the mixture model, as well as their mixture weights. For the convenience of later discussion, we call the network polynomial of .
Corollary 1.
The network polynomial is a multilinear function of with positive coefficients on each monomial.
Corollary 1 holds since each monomial corresponds to an induced tree and each edge appears at most once in the tree. This property will be crucial and useful in our derivation of a linear time algorithm for moment computation in SPNs.
3 Linear Time Exact Moment Computation
3.1 Exact Posterior Has Exponentially Many Modes
Let be the number of sum nodes in . Suppose we are given a fully factorized prior distribution over . It is worth pointing out the fully factorized prior distribution is well justified by the bipartite graph structure of the equivalent BN we introduced in section 2.2. We are interested in computing the moments of the posterior distribution after we receive one observation from the world. Essentially, this is the Bayesian online learning setting where we update the belief about the distribution of model parameters as we observe data from the world sequentially. Note that corresponds to the weight vector associated with sum node , so is a vector that satisfies and . Let us assume that the prior distribution for each is Dirichlet, i.e.,
After observing one instance , we have the exact posterior distribution to be: . Let and realize that the network polynomial also computes the likelihood . Plugging the expression for the prior distribution as well as the network polynomial into the above Bayes formula, we have
Since Dirichlet is a conjugate distribution to the multinomial, each term in the summation is an updated Dirichlet with a multiplicative constant. So, the above equation suggests that the exact posterior distribution becomes a mixture of Dirichlets after one observation. In a data set of instances, the exact posterior will become a mixture of components, which is intractable to maintain since .
The hardness of maintaining the exact posterior distribution appeals for an approximate scheme where we can sequentially update our belief about the distribution while at the same time efficiently maintain the approximation. Assumed density filtering (Sorenson and Stubberud, 1968) is such a framework: the algorithm chooses an approximate distribution from a tractable family of distributions after observing each instance. A typical choice is to match the moments of an approximation to the exact posterior.
3.2 The Hardness of Computing Moments
In order to find an approximate distribution to match the moments of the exact posterior, we need to be able to compute those moments under the exact posterior. This is not a problem for traditional mixture models including mixture of Gaussians, latent Dirichlet allocation, etc., since the number of mixture components in those models are assumed to be small constants. However, this is not the case for SPNs, where the effective number of mixture components is , which also depends on the input network .
To simplify the notation, for each , we define ^{1}^{1}1For ease of notation, we omit the explicit dependency of on the instance . and . That is, corresponds to the product of leaf distributions in the th induced tree , and is the moment of , i.e., the product of tree edges, under the prior distribution . Realizing that the posterior distribution needs to satisfy the normalization constraint, we have:
(2) 
Note that the prior distribution for a sum node is a Dirichlet distribution. In this case we can compute a closed form expression for as:
(3) 
More generally, let be a function applied to each edge weight in an SPN. We use the notation to mean the moment of function evaluated under distribution . We are interested in computing where , which we call the onestep update posterior distribution. More specifically, for each edge weight , we would like to compute the following quantity:
(4) 
We note that (4) is not trivial to compute as it involves terms. Furthermore, in order to conduct moment matching, we need to compute the above moment for each edge from a sum node. A naive computation will lead to a total time complexity . A linear time algorithm to compute these moments has been designed by Rashwan et al. (2016) when the underlying structure of is a tree. This algorithm recursively computes the moments in a topdown fashion along the tree. However, this algorithm breaks down when the graph is a DAG.
In what follows we will present a time and space algorithm that is able to compute all the moments simultaneously for general SPNs with DAG structures. We will first show a linear time reduction from the moment computation in (4) to a joint inference problem in , and then proceed to use the differential trick to efficiently compute (4) for each edge in the graph. The final component will be a dynamic program to simultaneously compute (4) for all edges in the graph by trading constant factors of space complexity to reduce time complexity.
3.3 Linear Time Reduction from Moment Computation to Joint Inference
Let us first compute (4) for a fixed edge . Our strategy is to partition all the induced trees based on whether they contain the tree edge or not. Define and . In other words, corresponds to the set of trees that do not contain edge and corresponds to the set of trees that contain edge . Then,
(5) 
For the induced trees that contain edge , we have
(6) 
where is the onestep update posterior Dirichlet distribution for sum node after absorbing the term . Similarly, for the induced trees that do not contain the edge :
(7) 
where is the prior Dirichlet distribution for sum node . The above equation holds by changing the order of integration and realize that since is not in tree , does not contain the term . Note that both and are independent of specific induced trees, so we can combine the above two parts to express as:
(8) 
From (2) we have
This implies that is in fact a convex combination of and . In other words, since both and can be computed in closed form for each edge , so in order to compute (4), we only need to be able to compute the two coefficients efficiently. Recall that for each induced tree , we have the expression of as . So the term can thus be expressed as:
(9) 
The key observation that allows us to find the linear time reduction lies in the fact that (9) shares exactly the same functional form as the network polynomial, with the only difference being the specification of edge weights in the network. The following lemma formalizes our argument.
Lemma 1.
can be computed in time and space in a bottomup evaluation of .
Proof.
Compare the form of (9) to the network polynomial:
(10) 
Clearly (9) and (10) share the same functional form and the only difference lies in that the edge weight used in (9) is given by while the edge weight used in (10) is given by , both of which are constrained to be positive and locally normalized. This means that in order to compute the value of (9), we can replace all the edge weights with , and a bottomup pass evaluation of will give us the desired result at the root of the network. The linear time and space complexity then follows from the linear time and space inference complexity of SPNs. ∎
In other words, we reduce the original moment computation problem for edge to a joint inference problem in with a set of weights determined by .
3.4 Efficient Polynomial Evaluation by Differentiation
To evaluate (8), we also need to compute efficiently, where the sum is over a subset of induced trees that contain edge . Again, due to the exponential lower bound on the number of unique induced trees, a brute force computation is infeasible in the worst case. The key observation is that we can use the differential trick to solve this problem by realizing the fact that is a multilinear function in , and it has a tractable circuit representation since it shares the same network structure with .
Lemma 2.
, and it can be computed in time and space in a topdown differentiation of .
Proof.
Define , then
where the second equality is by Corollary 1 that the network polynomial is a multilinear function of and the third equality holds because is the set of trees that do not contain . The last equality follows by simple algebraic transformations. In summary, the above lemma holds because of the fact that differential operator applied to a multilinear function acts as a selector for all the monomials containing a specific variable. Hence, can also be computed. To show the linear time and space complexity, recall that the differentiation w.r.t. can be efficiently computed by backpropagation in a topdown pass of once we have computed in a bottomup pass of . ∎
Remark. The fact that we can compute the differentiation w.r.t. using the original circuit without expanding it underlies many recent advances in the algorithmic design of SPNs. Zhao et al. (2016b, a)
used the above differential trick to design linear time collapsed variational algorithm and the concaveconvex produce for parameter estimation in SPNs. A different but related approach, where the differential operator is taken w.r.t. input indicators, not model parameters, is applied in computing the marginal probability in Bayesian networks and junction trees
(Darwiche, 2003; Park and Darwiche, 2004). We finish this discussion by concluding that when the polynomial computed by the network is a multilinear function in terms of model parameters or input indicators (such as in SPNs), then the differential operator w.r.t. a variable can be used as an efficient way to compute the sum of the subset of monomials that contain the specific variable.3.5 Dynamic Programming: from Quadratic to Linear
Define . Then the differentiation term in Lemma 2 can be computed via backpropagation in a topdown pass of the network as follows:
(11) 
Let and , then the final formula for computing the moment of edge weight under the onestep update posterior is given by
(12) 
Corollary 2.
For each edges , (8) can be computed in time and space.
The corollary simply follows from Lemma 1 and Lemma 2 with the assumption that the moments under the prior has closed form solution. By definition, we also have , hence , . This formula shows that computes the ratio of all the induced trees that contain edge to the network. Roughly speaking, this measures how important the contribution of a specific edge is to the whole network polynomial. As a result, we can interpret (12) as follows: the more important the edge is, the more portion of the moment comes from the new observation. We visualize our moment computation method for a single edge in Fig. 1.
Remark. CCCP for SPNs was originally derived using a sequential convex relaxation technique, where in each iteration a concave surrogate function is constructed and optimized. The key update in each iteration of CCCP (Zhao et al. (2016b), (7)) is given as follows: , where the R.H.S. is exactly the same as defined above. From this perspective, CCCP can also be understood as implicitly applying the differential trick to compute , i.e., the relative importance of edge , and then take updates according to this importance measure.
In order to compute the moments of all the edge weights , a naive computation would scale because there are edges in the graph and from Cor. 2 each such computation takes time. The key observation that allows us to further reduce the complexity to linear comes from the structure of : only depends on three terms, i.e., the forward evaluation value , the backward differentiation value and the original weight of the edge . This implies that we can use dynamic programming to cache both and in a bottomup evaluation pass and a topdown differentiation pass, respectively. At a high level, we trade off a constant factor in space complexity (using two additional copies of the network) to reduce the quadratic time complexity to linear.
Theorem 2.
For all edges , (8) can be computed in time and space.
Proof.
During the bottomup evaluation pass, in order to compute the value at the root of , we will also obtain all the values at each node in the graph. So instead of discarding these intermediate , we cache them by allocating additional space at each node . So after one bottomup evaluation pass of the network, we will also have all the for each node
, at the cost of one additional copy of the network. Similarly, during the topdown differentiation pass of the network, because of the chain rule, we will also obtain all the intermediate
at each node . Again, we cache them. Once we have both and for each edge , from (12), we can get all the moments for all the weighted edges in simultaneously. Because the whole process only requires one bottomup evaluation pass and one topdown differentiation pass of , the time complexity is . Since we use two additional copies of , the space complexity is . ∎We summarize the linear time algorithm for moment computation in Alg. 1.
4 Applications in Online Moment Matching
In this section we use Alg. 1 as a subroutine to develop a new Bayesian online learning algorithm for SPNs based on assumed density filtering (Sorenson and Stubberud, 1968). To do so, we find an approximate distribution by minimizing the KL divergence between the onestep update posterior and the approximate distribution. Let , i.e., is the space of product of Dirichlet densities that are decomposable over all the sum nodes in . Note that since is fully decomposable, we have . One natural choice is to try to find an approximate distribution such that minimizes the KLdivergence between and , i.e.,
It is not hard to show that when is an exponential family distribution, which is the case in our setting, the minimization problem corresponds to solving the following moment matching equation:
(13) 
where is the vector of sufficient statistics of . When is a Dirichlet, we have , where the is understood to be taken elementwise. This principle of finding an approximate distribution is also known as reverse information projection in the literature of information theory (Csiszár and Matus, 2003). As a comparison, information projection corresponds to minimizing within the same family of distributions . By utilizing our efficient linear time algorithm for exact moment computation, we propose a Bayesian online learning algorithm for SPNs based on the above moment matching principle, called assumed density filtering (ADF). The pseudocode is shown in Alg. 2.
In the ADF algorithm, for each edge the above moment matching equation amounts to solving the following equation:
where is the digamma function. This is a system of nonlinear equations about where the R.H.S. of the above equation can be computed using Alg. 1 in time for all the edges . To efficiently solve it, we take at both sides of the equation and approximate the L.H.S. using the fact that for . Expanding the R.H.S. of the above equation using the identity from (12), we have:
(14) 
Note that is approximately the mean of the prior Dirichlet under and is approximately the mean of , where is the posterior by adding one pseudocount to . So (14
) is essentially finding a posterior with hyperparameter
such that the posterior mean is approximately the weighted geometric mean of the means given by
and , weighted by .Instead of matching the moments given by the sufficient statistics, also known as the natural moments, BMM tries to find an approximate distribution by matching the first order moments, i.e., the mean of the prior and the onestep update posterior. Using the same notation, we want to match the following equation:
(15) 
Again, we can interpret the above equation as to find the posterior hyperparameter such that the posterior mean is given by the weighted arithmetic mean of the means given by and , weighted by . Notice that due to the normalization constraint, we cannot solve for directly from the above equations, and in order to solve for we will need one more equation to be added into the system. However, from line 1 of Alg. 1, what we need in the next iteration of the algorithm is not , but only its normalized version. So we can get rid of the additional equation and use (15) as the update formula directly in our algorithm.
Using Alg. 1 as a subroutine, both ADF and BMM enjoy linear running time, sharing the same order of time complexity as CCCP. However, since CCCP directly optimizes over the data loglikelihood, in practice we observe that CCCP often outperforms ADF and BMM in loglikelihood scores.
5 Conclusion
We propose an optimal linear time algorithm to efficiently compute the moments of model parameters in SPNs under online settings. The key techniques used in the design of our algorithm include the liner time reduction from moment computation to joint inference, the differential trick that is able to efficiently evaluate a multilinear function, and the dynamic programming to further reduce redundant computations. Using the proposed algorithm as a subroutine, we are able to improve the time complexity of BMM from quadratic to linear on general SPNs with DAG structures. We also use the proposed algorithm as a subroutine to design a new online algorithm, ADF. As a future direction, we hope to apply the proposed moment computation algorithm in the design of efficient structure learning algorithms for SPNs. We also expect that the analysis techniques we develop might find other uses for learning SPNs.
Acknowledgements
HZ thanks Pascal Poupart for providing insightful comments. HZ and GG are supported in part by ONR award N000141512365.
References

Boutilier et al. (1996)
C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller.
Contextspecific independence in Bayesian networks.
In
Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence
, pages 115–123. Morgan Kaufmann Publishers Inc., 1996.  Csiszár and Matus (2003) I. Csiszár and F. Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, 2003.
 Darwiche (2003) A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM (JACM), 50(3):280–305, 2003.
 Dennis and Ventura (2015) A. Dennis and D. Ventura. Greedy structure search for sumproduct networks. In International Joint Conference on Artificial Intelligence, volume 24, 2015.
 Gens and Domingos (2012) R. Gens and P. Domingos. Discriminative learning of sumproduct networks. In Advances in Neural Information Processing Systems, pages 3248–3256, 2012.
 Gens and Domingos (2013) R. Gens and P. Domingos. Learning the structure of sumproduct networks. In Proceedings of The 30th International Conference on Machine Learning, pages 873–880, 2013.
 Jaini et al. (2016) P. Jaini, A. Rashwan, H. Zhao, Y. Liu, E. Banijamali, Z. Chen, and P. Poupart. Online algorithms for sumproduct networks with continuous variables. In Proceedings of the Eighth International Conference on Probabilistic Graphical Models, pages 228–239, 2016.
 Park and Darwiche (2004) J. D. Park and A. Darwiche. A differential semantics for jointree algorithms. Artificial Intelligence, 156(2):197–216, 2004.
 Peharz et al. (2015) R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sumproduct networks. In AISTATS, 2015.
 Peharz et al. (2017) R. Peharz, R. Gens, F. Pernkopf, and P. Domingos. On the latent variable interpretation in sumproduct networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(10):2030–2044, 2017.
 Poon and Domingos (2011) H. Poon and P. Domingos. Sumproduct networks: A new deep architecture. In Proc. 12th Conf. on Uncertainty in Artificial Intelligence, pages 2551–2558, 2011.
 Rashwan et al. (2016) A. Rashwan, H. Zhao, and P. Poupart. Online and distributed bayesian moment matching for parameter learning in sumproduct networks. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 1469–1477, 2016.
 Rooshenas and Lowd (2014) A. Rooshenas and D. Lowd. Learning sumproduct networks with direct and indirect variable interactions. In ICML, 2014.
 Sorenson and Stubberud (1968) H. W. Sorenson and A. R. Stubberud. Nonlinear filtering by approximation of the a posteriori density. International Journal of Control, 8(1):33–51, 1968.
 Valiant (1979) L. G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189–201, 1979.
 Zhao et al. (2015) H. Zhao, M. Melibari, and P. Poupart. On the relationship between sumproduct networks and bayesian networks. In ICML, 2015.
 Zhao et al. (2016a) H. Zhao, T. Adel, G. Gordon, and B. Amos. Collapsed variational inference for sumproduct networks. In ICML, 2016a.
 Zhao et al. (2016b) H. Zhao, P. Poupart, and G. Gordon. A unified approach for learning the parameters of sumproduct networks. NIPS, 2016b.
Comments
There are no comments yet.