WMIS - Editorial

kevinsogo · June 21, 2015, 10:14pm

PROBLEM LINK:

Author: Sergey Kulik
Tester: Istvan Nagy and Utkarsh Lath and Kevin Atienza
Editorialist: Kevin Atienza

PREREQUISITES:

Memoization, backtracking, tree hashing

PROBLEM:

A greedy way to compute any maximal independent set (MIS) is to select an ordering of the nodes, and adding each node into the independent set in that order, as long as it doesn’t violate independence. What is the expected size of the maximal independent set among all possible orderings of nodes?

QUICK EXPLANATION:

The first node to be picked is equally likely to be any of the nodes of the tree. We fix the first node to be removed, and remove this node, and all of its neighbors, and all edges connected to them. This leaves us with a forest with fewer nodes. Each tree in this forest can be solved independently because they don’t affect each other, and the overall expected value is simply the sum of the expected values of all trees in the forest, plus 1 (to account for the node removed initially). The answer is the average of these sums.

Now, to reduce the number of subtrees traversed, we memoize the function and compute the function only once for each distinct subtree (it can be shown that the number of such distinct subtrees is not too large). To detect whether a subtree has been computed before, we hash the tree. However, we must discuss how to construct a hashing function/algorithm that has the following properties:

If two unrooted trees are isomorphic, then they must have the same hash.
If two unrooted trees are not isomorphic, then they must have different hashes with high probability.

Hashing rooted trees is simple:

Recursively compute the hashes of all subtrees
Combine these hashes using some hash function on a sequence where the order doesn’t matter.

However, hashing unrooted trees is different. One way to do it is to root the tree at its center (because a tree can only have either one or two centers) and hash the corresponding rooted tree. In case the tree has two centers, compute the hashes when the tree is rooted in either center, and take the larger (or smaller) hash.

Hashing a rooted or unrooted tree with N nodes takes O(N) time.

EXPLANATION:

Brute-force

The answer is simply the average of all the sizes of maximal independent sets for all permutations of the N nodes. Clearly, trying out all permutations is too slow, because there are simply too many permutations (27! is around 10^{28}). We have to find a different solution.

A faster way would be to use memoization; At any point in time, we only care about the following:

The set of nodes that can still be added into the MIS (say S).
The set of nodes not in S but haven’t already been processed, say T. Actually we’ll only need the size of T, say k.

Let f(S,k) be the average size of a maximal independent set assuming S and k are defined above. Note that the answer is f(\{1,2,\ldots,N\},0). Let n(i) be the set of neighbors of i. Then we have the following:

f(S,k) = \frac{1}{k + |S|}\left(k\cdot f(S,k-1) + \sum_{i \in S} (1 + f(S - n(i) - \{i\}, k + |n(i) \cap S|))\right)

This is because we can choose k + |S| in the next turn. The k\cdot f(S,k-1) addend is for the case where we choose from T, and the other is for the case where we choose from S.

This takes O(N^3 2^N) time, but if we represent S and n(i) as bitmasks, we have a program that is 32 times faster. A sample implementation is the following:

#include <stdio.h>
#define LIM 20

double f[1<<LIM][LIM+1];
int n[LIM];
int bitcount[1<<LIM];
int main() {
    for (int i = 0; i < 1<<LIM; i++) {
        bitcount[i] = bitcount[i<<1] + (i&1);
    }
    int z;
    scanf("%d", &z);
    while (z--) {
        int N;
        scanf("%d", &N);
        for (int i = 0; i < N; i++) {
            n[i] = 0;
        }
        for (int i = 1; i < N; i++) {
            int a, b;
            scanf("%d%d", &a, &b);
            a--, b--;
            n[a] |= 1 << b;
            n[b] |= 1 << a;
        }
        for (int S = 0; S < 1 << N; S++) {
            for (int k = !S; k <= N - bitcount[S]; k++) {
                double total = 0;
                if (k) total += k * f[S][k-1];
                for (int i = 0; i < N; i++) {
                    if (S & (1<<i)) {
                        total += 1 + f[S & ~n[i] & ~(1<<i)][k + bitcount[n[i] & S]];
                    }
                }
                f[S][k] = total / (k + bitcount[S]);
            }
        }
        printf("%.11lf\n", f[(1<<N)-1][0]);
    }
}

However, what you’ll find is that if we let LIM be 27, this uses too much memory (it might not even compile!), and even if we set LIM to be 20, this is still too slow. We need to optimize this.

The first optimization we have is to notice that the k paremeter is irrelevant. This is because selecting any node from T doesn’t affect the MIS whatsoever, and that we can decide the positions of the nodes in T in the permutation afterwards (and there are equally many ways to distribute the nodes in T in the permutation in every case). Therefore, f(S,k) = f(S,0) for any k, and we can define f(S) as f(S,0). We get the following:

\begin{aligned} f(S) &= \frac{1}{|S|}\left(\sum_{i \in S} (1 + f(S - n(i) - \{i\}))\right) \\\ &= 1 + \frac{1}{|S|}\left(\sum_{i \in S} f(S - n(i) - \{i\})\right) \end{aligned}

This now runs in O(N^2 2^N) time (still with the 1/32 constant). The following is a sample implementation:

#include <stdio.h>
#define LIM 27

double f[1<<LIM];
int n[LIM];
int bitcount[1<<LIM];
int main() {
    for (int i = 0; i < 1<<LIM; i++) {
        bitcount[i] = bitcount[i>>1] + (i&1);
    }
    int z;
    scanf("%d", &z);
    while (z--) {
        int N;
        scanf("%d", &N);
        for (int i = 0; i < N; i++) {
            n[i] = 0;
        }
        for (int i = 1; i < N; i++) {
            int a, b;
            scanf("%d%d", &a, &b);
            a--, b--;
            n[a] |= 1 << b;
            n[b] |= 1 << a;
        }
        for (int S = 1; S < 1 << N; S++) {
            double total = 0;
            for (int i = 0; i < N; i++) {
                if (S & (1<<i)) {
                    total += f[S & ~n[i] & ~(1<<i)];
                }
            }
            f[S] = 1 + total / bitcount[S];
        }
        printf("%.11lf\n", f[(1>>N)-1]);
    }
}

But this is still too slow!

Independent trees

We can speed up the above algorithm with the following crucial observation: connected components (after removing the nodes in S) can be answered independently of others! This is because for all selections of permutations for each connected component, there are equally many ways to combine them into a single permutation. Therefore, we can answer each subtree independently, and the answer is simply the sum of expected sizes of the MIS for all connected components, and we can restrict our S's to be connected subtrees, rather than general subsets of nodes.

But the number of subtrees can be very large. For example, consider the star graph of N nodes, i.e. N-1 nodes attached to a single “center” node. This has at least 2^{N-1} subtrees, by removing any subset of the N-1 noncentral nodes. Therefore, we must only consider distinct subtrees of N, which is provably not very large for N \le 27 But to consider only distinct subtrees, we must have a way to check whether two subtrees are equal, or whether the current subtree has been processed before. One way to do so is via tree hashing.

Tree hashing

We wish to find a way to hash unrooted trees. A hash of an unrooted tree is simply a value associated with the tree. We will define the hash so that we satisfy the following property: If two unrooted trees are isomorphic, then they must have the same hash. The hash we will use must be “random” enough so that if two unrooted trees are not isomorphic, then they must have different hashes with high probability.

Hashing an unrooted tree

How do we define an unrooted tree hash so that two isomorphic unrooted trees have the same hash? Well, one way is to root the tree on some node and hash the rooted tree (there are some well-known ways of hashing a rooted tree). However, the hash is not necessarily the same for every choice of root, so this violates our desired property of having the same hash for isomorphic trees.

One way to fix this is to either get the sum of the hashes for all roots, or simply get the minimum (or maximum). But this hash could be slow to compute, because there are N possible roots and each one requires \Omega(N) time to compute. In this case, we can restrict the possible roots to a few well-defined set of nodes, such as the centers of a tree. Since there are at most two centers in a tree, this hashing is much faster!

All that remains is to discuss how to hash a rooted tree itself.

Hashing a rooted tree

In order to hash a rooted tree, we can recursively hash all its subtrees to get a sequence of hashes, and somehow combine them into a single hash. In order to compute the same hash for isomorphic rooted trees, we should define the hash of a sequence so that the order doesn’t matter, i.e. two sequences have the same hash if their multiset of elements are the same. To do this, we could use simple techniques of hashing a sequence such as sorting and then polynomial hashing, or modifying Zobrist hashing.

This hashing could take O(N \log N) time or O(N) time depending on what hash for sequences is used.

Time Complexity:

\displaystyle O\left(\frac{N}{32} \sum_{i=1}^{N-1} T(i) i^2\right), where T(i) is the number of distinct subtrees of size i that can be formed by a node and its neighbors (even repeatedly).

AUTHOR’S AND TESTER’S SOLUTIONS:

To be uploaded soon.

akash4983 · July 2, 2015, 2:47pm

I saw few DP solutions for this as well like this( CodeChef: Practical coding for everyone ). Can anyone explain the dp approach?

shinigami_ryuk · July 17, 2015, 6:57am

How to calculate bounds on T(i)?