WAYPA - Editorial

kevinsogo · December 14, 2015, 6:09am

PROBLEM LINK:

Author: Fedor Korobeinikov
Tester: Sergey Kulik
Editorialist: Kevin Atienza

PREREQUISITES:

String hashing, polynomial hashing, tree centroid decomposition, binary search

PROBLEM:

Given an unrooted tree with N nodes, where each edge contains a single digit. What is the length of the longest simple path in the tree such that the sequence of digits along the path is palindromic?

(The original problem asks for the maximum number of nodes in such a path, but here we’ll define “length” as the maximum number of edges in such a path. The answer is simply one plus this number.)

QUICK EXPLANATION:

Binary search on the solution (performing two binary searches, one for even answers and another for odd answers). Thus, we want to know for a given length l whether there is a palindromic path of length l.

Root the tree at its centroid (so that each child has at most half the size of the whole tree). Then, recursively check for each subtree whether there is a length l palindromic path. Finally, we want to check whether there is a palindromic path passing through the root.

Compute polynomial hashes H(x) and H_{\text{rev}}(x) of each string that begins from the root, and their reverses. Then, for each node x of depth d, let y be its $(l-d)$th ancestor. Then there is a palindromic string of length l starting from x and passing through the root if the following things hold:

H(y) = H_{\text{rev}}(y) (because the string from y to the root represents the middle 2d-l characters, so it must also be palindromic).
There exists a node z from another subtree such that the hash of the string from y to x is equal to H(z). (This can be checked quickly with the use of sets, after some preprocessing)

If these conditions hold for some node x, then we declare that there is a palindromic path of length l passing through the root.

EXPLANATION:

Binary search

As with all problems involving minimizing something, one thing that immediately comes to mind is binary search. The general idea is: Suppose \phi(x) is some statement, and we want to find the largest x such that \phi(x) is true. Let L < R be two numbers such that \phi(L) is true and \phi(R) is false. Then we can perform binary search on the range [L,R] until we find the maximum x. But this only works if \phi(x) is monotonic in the following way: \phi(x) \implies \phi(x-1) for all x.

In our case, the statement \phi(x) is: “There exists a palindromic path of length x in the tree.” A lower bound could be L = 1 (every path of length 1 is palindromic) and an upper bound R = N (the longest path in any tree is N-1). However, this statement is not monotonic. For example, The string babab contains palindromic substrings of length 1, 3 and 5 but not of length 2 and 4. This means binary search will not work.

But \phi(x) is almost monotonic; specifically it satisfies the following for all x: \phi(x) \implies \phi(x-2). This condition allows us to perform two binary searches, one for even x and odd x (because the statement becomes monotonic when only considering x s with the same parity), and the largest x such that \phi(x) is true is simply the larger result of the two!

Thus, what we need to do now is to compute \phi(x). In other words:

Given x, is there a palindromic path of length x?

Finding a palindromic path of length x

We’re trying to implement the function f(T,x), which returns true if there is a palindromic path of length x in the tree T.

Let’s consider rooting T at, say, node 1. We can consider two kinds of paths: Those that pass through the root (node 1) and those that don’t. A path that doesn’t pass through the root is contained in one subtree, so we can simply call f(T',x) recursively for all subtrees T. Thus, we now only need to consider paths that pass through the root.

A path through the root consists of two edge-disjoint paths from two nodes a and b to the root, where a and b belong to different subtrees. When is such a path palindromic of length x? Let d_a and d_b be the depths (distance from the root) of a and b. Without loss of generality, assume d_a \ge d_b. The first condition would be that d_a + d_b = x, because the total length of the path is just d_a + d_b.

Now, let c be the $d_b$th ancestor of x. The second condition would be that the path from c to the root must be palindromic, because this path represents the “middle d_a - d_b characters” of the path a \rightsquigarrow b. A final condition would be that the path from a to c must be the same string as the path from b to the root. Once all these conditions are satisfied, the path a \rightsquigarrow b is now palindromic.

We can now determine whether there is a palindromic path through the root with length x. For a node a, let d_a be its depth, and c be its $(x-d_a)$th ancestor. Then there is a palindromic path starting at a if the following conditions are satisfied:

The path from c to the root is palindromic
There is a node b belonging to a different subtree such that the path from a to c is the same string as the path from b to the root.

If any node a satisfies this condition, then we have found our palindromic path of length x through the root. Otherwise, we can see that there is no such path.

Now, how fast this run depends on how we implement the following operations:

Given a node (c), check whether the path from that node to the root is palindromic.
Given some path from a node to one of its ancestors (a \rightsquigarrow c), check whether there is a path from a node in a different subtree (b) to the root but with the same string.
Given a, find its $(x-d_a)$th ancestor.

It also depends on the shape of the tree. If the tree is highly degenerate, then we expect to have lots of heavy recursive calls, and the running time will be slow. Our algorithm works best with balanced trees. We’ll deal with this problem later.

String processing

Two of the operations we want to implement above require some sort of way to compare strings with each other. For example, consider the first operation. Palindromic string is simply a string whose reverse is equal to itself. Thus, checking whether the path to the root is palindromic is simply checking whether that path is equal to its reverse. Clearly the second operation requires string comparison too.

Comparing strings naïvely runs in \Theta(N) time, and since we need to perform the operations above for each node, the overall running is at least \Theta(N^2) which is unacceptable. Clearly we need to be able to compare strings without iterating through all their characters.

One common way to check whether two strings are “equal” is to hash them with some sort of hash function and check if their hashes are equal. A hash function is simply a function that maps things from a set of objects to a smaller set of objects. Now, since hashes belong to a smaller set of objects, inevitably there will be collisions, or two distinct objects that have the same hash. Thus, this method of comparing strings is not completely correct. However, if our hashes are “random” enough in some precise way, then it might be hard to generate two things with the same hash, and we can just assume that we won’t encounter collisions during our algorithm.

In our case, we can use hashing to compare two strings. One common way to hash strings to integers is via the polynomial hash: Let c_0c_1c_2c_3\ldots c_{k-1} be the string. Then the polynomial hash of this string is the following:

(c_0' + c_1'B + c_2'B^2 + \cdots + c_{k-1}'B^{k-1}) \bmod M

where B and M are fixed constants, and where each character c is assigned to some distinct (nonzero) number c'. For this to be “random” enough, one way would be to pick M to be some large prime, say 10^9+7, and B to be some number with a large (multiplicative) period modulo M.

The main advantage of the polynomial hash is the ease at which to update the hash when a new character is appended to the beginning or end of the string. For example, if H is the hash of the original string, then appending a new character c at the beginning results in the new hash (c + H\cdot B)\bmod M, while appending it at the end results in (H + cB^k) \bmod M. This turns out to be very useful for us since we’re hashing many related strings in a tree.

For each node a, we define H(a) to be the hash of the string from a to the root. We also define H_{\text{rev}}(a) to be the hash of the reversal of this string. Using H(a) and H_{\text{rev}}(a), we can now implement the first operation: The path from a to the root is palindromic “if and only if” H(a) = H_{\text{rev}}(a)! (Note that we quote “if and only if” because we are assuming that we won’t encounter collisions in our hash function.)

Calculating H(a) and H_{\text{rev}}(a) is easy enough because of the properties of the polynomial hash. If a is the root (i.e. a = 1), we have H(a) = H_{\text{rev}}(a) = 0, because the empty string hashes to 0. If a is not the root,

Let d be the depth of a,
Let p be the parent of a, and
Let c be the digit at the edge (p,a).

Then we have H(a) = (c' + H(p)\cdot B)\bmod M and H_{\text{rev}}(a) = (H_{\text{rev}}(p)) + c'B^d)\bmod M. Thus, with a single pass on the tree, these hashes can be computed in overall O(N) time (assuming powers of B have been precomputed.)

Now, let’s try to implement the second operation: Given some path from a node to its ancestor (a \rightsquigarrow c), check whether there is a path from a node in a different subtree (b) to the root with the same string. We can still use the $H(a)$s for this task, but we will need additional tricks. First, hash of the string for the path a \rightsquigarrow c can be obtained from H(a) and H(c) as the following:

(H(a) - H(c)B^{d_a - d_c})\bmod M

Let V be this value. Thus, the problem is now to find a node b in a different subtree such that H(b) = V. Here’s one way to do it:

Collect all $H(b)$s in a set S, for all nodes b from other subtrees.
Check whether V is in S.

This obviously works, but constructing S takes time. O(N) time in fact, which is pretty slow. But the thing is, many of these S'es are the same for many nodes. Specifically, the set S is the same for all nodes belonging in the same subtree.

Thus, one way to optimize this operation would be to process the nodes subtree by subtree. Before processing any subtree, first construct the multiset S of all $H(b)$s for all nodes. Then for a given subtree:

Remove the $H(a)$s for all nodes a in the subtree from the multiset.
Perform the operation above for all nodes in this subtree.
When you’re done, put the $H(a)$s back in the multiset.

Clearly, the S that we use for all operations is correct. But with this scheme, every H(a) is removed and added at most once, so the overall running time is O(N\cdot k), where k is the cost of an insertion and deletion in the multiset! (For example k = O(\log N) for balanced trees.)

Getting the $(x-d)$th ancestor of every node.

Next, we also need to get a very specific ancestor of every node in order for the algorithm above to work. This is the standard level ancestor problem, and there are many common ways to do this, such as using power-of-two ancestor pointers, and things called ladders.

But our case is an offline version of the problem, (all the queries are known beforehand) and there is in fact a simpler way to do it. Simply run a DFS, but always maintain a stack of nodes toward the root. As you go up and down the tree you push and pop nodes in this stack. The nice thing about it is that we have the sequence of nodes to the root every time we encounter a new node, and if we use a self-resizing array, we can get the $(x-d_a)$th ancestor of node a with a single array lookup!

Overall, this step requires a total of O(N) time.

Recursion

From the above, we can now check for the existence of a length-x palindromic path passing through the root. With recursion, we can check for all other paths. However, the speed of our algorithm greatly depends on the shape of the tree: the more imbalanced the tree, the slower our algorithm is! Thus, we must find a way to somehow reduce the overhead of recursion.

Thankfully, there is a simple way to do it: via centroid decomposition. The idea is to not root the tree arbitrarily, rather, we root it at a special node called the centroid. A centroid is a node such that when the tree is rooted at it, there is no heavy subtree, or a subtree with more than half the number of nodes in the original tree.

If we root the tree and all subsequent trees in all recursion levels at their centroids, then we guarantee that the depth of the recursion is at most O(\log N), because you can only halve N at most 1 + \log_2 N times. Thus, this allows us to perform the above step just a few times, which gives us a fast algorithm as long as we can find the centroid of a tree quickly.

How do you find a centroid of a tree? First, root the tree arbitrarily, and compute the sizes of all subtrees (including the tree itself). Now, remember that a tree of N nodes can only have at most one heavy subtree. (Why?) So the algorithm is this: start at the root. Then if there is a heavy subtree rooted at, say r, then rotate the tree to the new root r (and update the sizes), and repeat until you find a root with no heavy subtrees, in which case you now have a centroid. It can be shown that this algorithm always halts.

We now have the solution to the problem! The overall complexity is O(N \log^3 N): O(N \log N) per step, performed on O(\log N) levels of recursion, and O(\log N) times due to binary search.

If you use a hash set instead, the running time is expected O(N \log^2 N).

Implementation notes

Here we describe a few more things worth mentioning.

On the hash function

First, the above algorithm assumes that we won’t ever encounter collisions in our algorithm. However, if we choose M = 10^9 + 7 or some other prime with a similar magnitude, then we might run into the famous birthday paradox, which roughly describes that you don’t need very many people to have a very high chance of a collision of birthdays. (In fact, only 23 people are required to have a more than 50% chance of two people having the same birthday!) In our case, this means that there is quite a high chance that the will be collisions in our hash function, purely because of the number of strings we are considering. ({10^5 \choose 2} \approx 5\times 10^9 in the worst case.)

One way to reduce the chance of collision would be to use a much higher modulus, say an 18-digit prime. But in languages without big integer arithmetic, this can be quite hard to implement.

Another way would be to use two relatively prime moduli, M_1 and M_2, each around 1 billion, and perform two hashes, one for each modulus. This enlarges our hash space similarly but without requiring big integer arithmetic! (This is equivalent to using the modulus M_1\cdot M_2, but implicitly expressing the hash modulo M_1 and M_2 separately via the Chinese remainder theorem.) Thus, instead of a single integer, the hash of our string is now a pair of 32-bit integers. (In fact, you can express it as a single 64-bit integer by storing the two numbers in the 32 higher- and lower-order bits.)

Optimizations

Being a hard problem, the time limit for this problem was intentionally set a bit tighter than usual. Thus, one must really ensure to optimize their implementation of the algorithm. Here are a few details:

Instead of binary searching the whole answer and then performing the algorithm, you can instead push the binary search part inwards. This is because many parts of the algorithm such as rooting the tree at the centroid and computing the hashes take a lot of time. By pushing the binary search inwards, these things will be done less frequently (specifically, by a \log N factor).
When starting a new binary search, use the best result so far as your lower bound. This can shave off a few steps in later binary searches!
Even though hash sets seem theoretically better than tree sets in some sense, it isn’t necessarily so in practice, because it depends on the library implementation. For example, in C++, the tree set (std::set) seems to be 10 times faster than the unordered set (std::unordered_set), which is quite surprising (or maybe I’m just missing something in my implementation, that’s why it’s slow.)
Sometimes, using less memory results in faster running time. This is because using less memory means more things can be stored in the cache, and this helps increase cache hits. So if you have a Node/Tree class, you could try to use the smallest possible data type for each variable in the class to reduce memory usage.
In general, try to be more friendly to your cache. For example, don’t use parallel arrays when implementing your tree, especially if you need multiple fields at once many times.

Time Complexity:

O(N \log^3 N) or O(N \log^2 N)

AUTHOR’S AND TESTER’S SOLUTIONS:

Setter
Tester
Editorialist

xellos0 · December 14, 2015, 3:41pm

I wrote this, but it failed on some tests (even after hashing precautions, I probably have some bug). But it passed all large tests. So I wrote an O(N^3) bruteforce - DFS-ing over triples (R,u,v), where R is a root from which (or from an edge) we grow a palindrome along the path u-v by moving deeper into a son of u and a son of v at once. It passed all but one large test! I merged the wrong and slow solution and got 100 points

animesh_f · December 14, 2015, 4:07pm

Your editorials are always so awesome and wonderfully written! Thanks a lot!