Hybrid Tutorial #-2: Centroid Decomposition

galencolin · August 19, 2020, 4:06am

See the original post on Codeforces for a playlist of all similar tutorials and also more information about what this is.

“Intro”

Timestamp: 00:00

This is a “hybrid tutorial” - it has both a blog and a video component. See my original post for more explanation of what this is and also a playlist of all past tutorials. You can view the full video here.

This tutorial is on centroid decomposition, a cool concept that has many uses in different problems. It’s also another “decomposition” technique, one of many. I feel like this subject in particular doesn’t have many great tutorials online that really capture the essence of what we do with the technique, so hopefully, I can improve that with this one.

Notation

We’ll use 0-indexing for the whole tutorial, and the example problem will also be 0-indexed. In addition, “vertex” and “node” mean the same thing and may be used interchangeably.

Example problem

Timestamp: 02:04

If you’ve read other centroid decomposition tutorials, you’ve probably seen a problem called Xenia and Tree. While it’s rather well-known now, it’s a great example to use for an introductory tutorial because most of the implementation is simple.

I’ll restate the problem here as well:

You have a tree of n vertices, where each vertex is initially blue except for vertex 0, which is red. There are two types of queries (q in total):

Set the color of a vertex v to red
Given a vertex v, compute the distance to the closest red vertex to v (which could be itself)

There are other ways to solve this such as sqrt decomposition, but we’ll do it with centroid decomposition.

Prerequisite

Timestamp: 05:16

Trees + traversal (dfs)

Not many this time. It’s a rather lightweight topic overall, yet possibly difficult to understand.

Reformulating the problem

Timestamp: 06:06

First, let’s consider a natural yet inefficient solution, which we will later optimize with centroid decomposition. Root the tree arbitrarily, let’s say, at vertex 0. Then let’s maintain, for each vertex v, the closest red node to it that’s within its subtree, and call it best[v].

Now there’s a caveat here - this solution requires efficiently finding the distance between two vertices. I’ll use the notation dist(u, v) to denote the distance between vertex u and v. We can compute this in O(1) or O(\log{n}) with LCA, but this is not necessary to know for this tutorial. Just assume this function exists and can be computed efficiently. We’ll assume for the rest of the tutorial that it takes O(\log{n}) because the optimization to O(1) is unnecessary and harder to implement in my view.

Reformulating updates

Timestamp: 08:14

Back to the problem now. When we paint a node v red, to maintain the information we want, we have to update the distances to all of v's ancestors. This, in implementation, is simply setting best[u] := max(best[u], dist(u, v)) for all ancestors u of v (including v itself). How long does this take per query? At worst, v has O(n) ancestors, so it can take O(n\log{n}) taking the complexity of dist into account.

Reformulating queries

Timestamp: 15:29

But let’s ignore the complexity, we can optimize it later with the actual centroid decomposition. Let’s handle the queries now. To compute the closest red node to v, there are two cases:

The closest node is in v's subtree. In this case, we use the value best[v] that we already have stored.
The closest node is not in v's subtree. Then it’s in the subtree of some ancestor u of v but not in v's subtree (and of course the distance is best[u] since we want the closest red node in u's subtree). Therefore the path from the closest node to v goes through u, so we can represent its length as best[u] + dist(u, v). What’s most important here is that the path goes through u.

Of course, when computing a query, we don’t know which case is optimal, so we have to try both of them and all possible u. Just like updates, this can also take O(n\log{n}) in the worse case. You may note that we don’t have to recompute dist(u, v) each time as it just increases by 1, but this optimization won’t work later so we’ll ignore it.

Timestamp: 20:41

Now, you may say, what the hell even is this tutorial? I’ve given you an O(n^2\log{n}) solution when there’s a stupid O(n^2) brute-force that will work even faster! And you would be correct to think this. But we can improve it by noticing that a bottleneck is the height of the tree, that is, some vertices can have up to O(n) ancestors. As it turns out, we can actually construct a different, related tree of height O(\log{n}) and solve the problem in the same way with the new tree. But first, we need to talk about ~~parallel universes~~ centroids.

Centroids

Timestamp: 20:57

A centroid is defined as a vertex with the following property: when removed, all of the resulting subtrees have a size of at most half that of the original tree (that is, \lfloor \frac{n}{2} \rfloor). There is always at least one in any tree (there may even be multiple - consider a bamboo of length 4), which we’ll prove right now.

Proof

Let’s root the tree at an arbitrary vertex r. If r is a centroid, we’re done, otherwise, one subtree has a size of more than \lfloor \frac{n}{2} \rfloor. Let’s then go into that subtree’s root, and call it s. Now let’s treat s the same way: either it’s a centroid, or we move into the subtree that has a size greater than \lfloor \frac{n}{2} \rfloor.

With this process, the size of the subtree that’s “above” s - the subtree containing the parent of s - will never be greater than \lfloor \frac{n}{2} \rfloor. If this was true, then it would be impossible that the subtree we just moved into also has size greater than \lfloor \frac{n}{2} \rfloor. Both subtrees must have size at least \lfloor \frac{n}{2} \rfloor + 1 in this case, and 2 \cdot (\lfloor \frac{n}{2} \rfloor + 1) > n.

Because of that, in this process, the size of the subtrees of our current s will always decrease until they’re small enough that s is a centroid (since we move down the tree, the size decreases by at least 1 each iteration), and the size of the subtree above s will also always be small enough. Therefore this process will terminate and we’ll eventually find a centroid. This takes O(n) in the worst case. Conveniently, this is also the implementation of finding a centroid.

The “centroid tree”

Timestamp: 30:09

Now, why are centroids useful? Let’s define a recursive process where we create a new tree - a “centroid tree” - from our input tree. Say we root the tree at a centroid of the tree. Then all the remaining subtrees have size at most half of the original tree. We’ll run the recursive process to create new trees for each of those subtrees, then attach them as children to our root. See the video for a full illustration of how this works.

This new tree has two important properties:

Timestamp: 35:42

The height is at most O(\log{n}). This is because, at each step, the size of every subtree created is at most half of the original tree. So we ask “how many times can we divide n by 2 before getting \leq 1”, which is the definition of \log_2{n}. This also means the process of constructing the tree is O(n\log{n}). Common sense says this is true, because there are at worst O(\log{n}) depths and each depth takes at worst O(n) to compute. But if you want formality, you can also use the master theorem with the worst-case recurrence T(n) = 2 * T(\frac{n}{2}) + O(n).
Timestamp: 37:11

Each subtree in the centroid tree forms a connected component in the original tree. This is because of our process of construction: for every vertex, when it’s added to the centroid tree, it’s part of some connected subtree that hasn’t been removed yet. Once we remove that vertex, every vertex in that subtree will be added as a child of that vertex, and thus all of its children in the centroid tree will be part of that connected subtree.

Both of these will be necessary to solve the original problem.

Returning to the problem

Timestamp: 39:52

Why is it important that subtrees of the centroid tree are contiguous in the original tree? It means that in the original tree, for a vertex v, every other vertex not part of v's subtree (in the centroid tree) is either an ancestor of v in the centroid tree or “blocked off” from v by at least one such ancestor.

Think about that claim for a second. It also comes from the way we construct the centroid tree - the subtree of v is always “surrounded” by its centroid-tree ancestors, as the removal of those ancestors “cuts” the subtree of v into what it is. It’s impossible for a vertex adjacent to a vertex from v's subtree to not be an ancestor of v in the centroid tree, because if that was true, that vertex would just be in v's subtree because we construct the centroid tree that way.

Now let’s modify best[v] to be the distance to the closest red vertex in v's subtree in the centroid tree, rather than in the original tree. With that, we end up with the exact same two cases as we had before:

The shortest red vertex to v is in v's subtree in the centroid tree. In that case, we just take best[v].
The shortest red vertex to v is “blocked off” by some centroid-tree ancestor u of v, and therefore it passes through u. Then the distance is best[u] + dist(u, v), and we take the minimum of this over all u.

And updates are also the same, we just set best[u] := min(best[u], dist(u, v)) for all centroid-tree ancestors u of v (including v).

But now, there’s a crucial difference: the height of the centroid tree is always O(\log{n}). That means each update and each query only affects O(\log{n}) vertices, and also only takes O(\log^2{n}) time to compute if dist is O(\log{n}). That means we’ve done it - that’s our solution, with a complexity of O(n\log{n} + q\log^2{n}), the first part for constructing the tree and the second for answering queries. Again, you could improve this to O((n + q)\log{n}) if you want with faster LCA.

Implementation

Timestamp: 53:22

Despite the difficulty of the concept, the implementation is rather simple. I’ll break it into sections.

Creating the tree

We first find the centroid c of the input tree. Then we mark it as the root, recursively solve for the subtrees created by removing c (in the implementation, marking vis[c] as true is equivalent to removing c), and set the parents of the resultant subtrees to c. This goes on recursively until we create the whole tree, like a modified DFS that traverses centroids.

Code

void init_centroid(int v = 0, int p = -1) {
  find_size(v);

  int c = find_centroid(v, -1, sz[v]);
  vis[c] = true;
  par[c] = p;

  for (int x: edges[c]) {
    if (!vis[x]) {
      init_centroid(x, c);
    }
  }
}

Finding centroids

We’ve basically gone over this in the proof that a centroid always exists, but I’ll re-hash it. There are two parts: finding subtree sizes and using them to find a centroid. In both functions, we avoid visited nodes to stay within our subtree, as those are centroids that have already been solved for.

Code (finding subtree sizes)

int find_size(int v, int p = -1) {
  if (vis[v]) return 0;
  sz[v] = 1;

  for (int x: edges[v]) {
    if (x != p) {
      sz[v] += find_size(x, v);
    }
  }

  return sz[v];
}

To find a centroid (this is the same as the algorithm mentioned in the proof), let’s root the tree at an arbitrary vertex r. If r is a centroid, we’re done, otherwise, one subtree has a size of more than \lfloor \frac{n}{2} \rfloor. Let’s then go into that subtree’s root, and call it s. Now let’s treat s the same way: either it’s a centroid, or we move into the subtree that has a size greater than \lfloor \frac{n}{2} \rfloor. As we proved above, this will always terminate.

Code (finding a centroid)

In this implementation, n represents the size of the subtree we’re solving for, not the size of the input. Note that n / 2 implies floor division.

int find_centroid(int v, int p, int n) {
  for (int x: edges[v]) {
    if (x != p) {
      if (!vis[x] && sz[x] > n / 2) {
        return find_centroid(x, v, n);
      }
    }
  }

  return v;
}

Problem-specific implementation

There are a few parts specific to this problem: the dist function, applying updates, and applying queries. Since centroid decomposition is just a technique rather than an algorithm, this will be different for each problem. I’ll briefly touch all of them.

Computing dist

Timestamp: 56:42

We’ll use LCA to do this. If you’ve seen my HLD tutorial, you’ll know that any path u → v on a tree can be broken into two parts: the path from u to lca(u, v) and the path from lca(u, v) to v. So now we want dist(u, lca(u, v)) + dist(lca(u, v), v). This is simpler because the LCA is always an ancestor of u and v, so the total distance can be computed using depths as (depth[u] - depth[lca(u, v)]) + (depth[v] - depth[lca(u, v)]). This is simplifiable to depth[u] + depth[v] - 2 \cdot depth[lca(u, v)].

If you don’t know LCA but still want to try the problem, you can copy my LCA template in the full solution below.

(Partial) code

You can decide on how to implement LCA for yourself. I won’t teach it here, since this tutorial is long enough as it is. The LCA and depths can be done with an arbitrary root.

int dist(int u, int v) {
  int lc = lca(u, v); // implement this
  return depth[u] + depth[v] - 2 * depth[lc];
}

Applying updates

This is exactly how it was described: just set best[u] to max(best[u], dist(u, v)) for all centroid-tree ancestors of v.

Code

void update(int v) {
  best[v] = 0;
  int u = v;
  while (centroid_tree.par[u] != -1) { // this is pseudocode, replace this with however you make the parents
    u = centroid_tree.par[u];
    best[u] = min(best[u], dist(u, v));
  }
}

Answering queries

This is also exactly how it was described, find the minimum of best[u] + dist(u, v) over all centroid-tree ancestors of v. Assume best[u] = \infty initially.

Code

int query(int v) {
  int ans = best[v];
  int u = v;
  while (centroid_tree.par[u] != -1) { // this is pseudocode, replace this with however you make the parents
    u = centroid_tree.par[u];
    ans = min(ans, best[u] + dist(u, v));
  }
  return ans;
}

Full implementation

You can find my full implementation here, and my solution for Xenia and Tree here. My solution also has an LCA template that you can copy to try the problem if you don’t know the concept.

Steps for using it:

Read in n, and call init(n).
Read in all edges, and call edge(u, v) for an edge between u and v.
Call init_centroid (the root v can be arbitrary, make sure p is -1).
That’s it - now the tree is represented by the parents for each vertex.

More practice

Centroid decomposition can be used in many ways. This application was a “bottom-up” approach, starting from a query/update vertex and going up the centroid tree, updating or querying the closest vertex in each subtree. In this case, the only special part about the centroid tree is that its height is O(\log{n}).

Other problems can require centroid decomposition in a “top-down” manner - that is, divide-and-conquer. For example, you could solve for all paths going through the centroid, then recursively solve for the resultant subtrees after removing the centroid. This method is upper-bounded by O(\log{n}) \cdot f(n), where f(n) denotes the time to solve for all paths through a specific vertex since there are O(\log{n}) “levels”.

I’m not big-brain enough to create a whole mashup contest for this subject, but I can do the next best thing: yoink other people’s problems. So here’s a list of what I’ve found so far:

Factories (JOIOC '14)
LWDB (Codeforces) (Clicking the URL doesn’t work if you have coach mode on apparently, so you have to copy it in manually if you have that enabled: Problem - D - Codeforces)
All problems from this blog (Codeforces)
TESTERS (CodeChef)
BTREE (CodeChef)
GERALD2 (CodeChef)
PRIMEDST (CodeChef)
Can Bash Save The Day? (Codeforces)

Feel free to recommend more practice problems or just leave feedback in general. I will likely add more in the future.

therealnishuz · August 19, 2020, 4:26am

It works for me.

galencolin · August 19, 2020, 4:29am

Weird. I think having coach mode on somehow causes that, because it worked after I disabled it.

phibrain · November 9, 2020, 12:42am

Hey colin, add this one: Problem - 757G - Codeforces