TREEP - Editorial

xellos0 · November 14, 2015, 7:44pm

PROBLEM LINK:

Author: Vsevolod Stepanov
Tester: Istvan Nagy
Editorialist: Xellos

DIFFICULTY:

MEDIUM-HARD

PREREQUISITES:

String hashing, heavy-light decomposition, lowest common ancestor, binary-indexed trees (BITs, Fenwick trees), modular inverse.

PROBLEM:

You’re given a tree with a letter on each edge and queries of 2 types:

update the letter on some edge
find the smallest period of the string on some path

Answer the queries of the second type.

QUICK EXPLANATION:

Test periodicity (only for some divisors) using hashes. Computing hashes with updates can be done using HLD and Fenwick trees over their paths.

EXPLANATION:

We’ll consider the tree rooted at vertex 1. It’s easy to DFS through the tree and find depths of vertices.

At first glance, it seems like we’d have too many periods to consider. However, we know that for a string of length l (which means a path between u and v with length l), its period must divide l, so we can just look at each divisor of l, check if it’s a period of our string and pick the smallest valid period.

How would we check whether p is a valid period? We can’t generate the strings, that’s way too slow. In that case, the usual approach is hashing. In particular, polynomial hashing is very useful. Think about it: if we have a string s with hash H=\left(\sum_{k=0}^{|s|-1} b^k s_k\right)\% MOD for some chosen b,MOD and its prefix of length p repeats in it periodically, then s_k=s_{k\%p} and

H=\sum_{k=0}^{p-1} s_k\sum_{j=0}^{|s|/p-1} b^{k+jp}=\sum_{k=0}^{p-1} s_k b^k \sum_{j=0}^{|s|/p-1} (b^p)^j = \left(\sum_{k=0}^{p-1} s_k b^k\right) \frac{(b^p)^{|s|/p}-1}{b^p-1} = H_p \frac{b^{|s|}-1}{b^p-1}

where we just used the formula for the sum of a finite geometric series; the sum \sum_{k=0}^{p-1} s_k p^k is the hash H_p of our prefix of length p and if we choose b and MOD carefully (so that b would be coprime to MOD for all p \le N; MOD should be large, and it’s best to make it a prime close to 10^9), we can precompute all possible powers of b modulo MOD and their modular inverses, which allows us to check in O(1) whether that equality holds and p is a period of s. The problem reduces to different queries: update character / query hash along a path.

If there were no updates, we could just precompute hashes of strings read from each vertex to the root and from the root to each vertex (let’s call them direct and reverse hashes) in a DFS using the same formulas as for a linear string - if a parent vertex at depth d has direct hash h_p and reverse hash h^r_p, then its child will have direct hash bh_p+c and reverse hash h^r_p+c b^{d-1}, where c is the letter on the edge between them; from them, we can find the hash on any path that goes only up or only down (we also need to know the length of such a path, which is just the difference of depths of its endpoints). Afterwards, any path u\texttt{-> }v can be split into 2 parts by the lowest common ancestor (LCA) of u,v; one part goes up and the other down, we can compute their corresponding hashes and combine them.

However, we need to process updates. Note that this approach with sums from the root to each vertex generally works if we compute some property which can be recomputed when merging or splitting paths (think about it as a generalisation of prefix sums). When we add updates, too many of those values would change too often, so we need something more efficient. And that something is usually heavy-light decomposition (HLD) with a binary-indexed tree over each path.

It works for this problem as well. Specifically, we’ll remember the contribution of each edge to the direct or reverse hash (a power of b times the value of the letter on it) of the string on the path it’s in, using two BITs F^d and F^r for that path. Updates are easy, just change the contribution of a given edge in both BITs. In at most 2 queries on one BIT, we can compute the hash of any subpath of an HLD-path in any direction - the sum of contributions of edges in the subpath divided by a suitable power of b. Depths don’t change, so we can find lengths of substrings easily and then merge them using formulas.

As before, we only need one direct and one reverse hash to the LCA. The approaches to computing them differ only in the specific formulas, so let’s just describe the one for direct hashes. We’ll move up from u to l and store the current direct hash. There are multiple cases to consider:

l and u lie on the same HLD-path: compute the hash on the path u\texttt{ -> }v, add it to the current hash, quit
u lies on top of an HLD-path: just move up an edge to the parent of u, updating the current hash by one letter and exiting the current HLD-path
otherwise: compute the hash from u to the top of its HLD-path, add it to the current hash

The formulas for computing and merging direct and reverse hashes are some dank weed, there’s no point in writing them here. Just try to read some codes or derive them yourself.

Since we can only encounter O(\log{N}) HLD-paths when moving up from any u and for each of those paths and spend O(\log{N}) time on BIT-queries on each path, this allows us to compute hashes in O(\log^2{N}). However, we can compute a set of K prefix hashes in parallel even faster. We only need independent BIT-queries in case 1, and that only occurs once for each prefix. For cases 2 and 3, we’re adding a hash of the same substring to current hashes of some prefixes; that hash can be calculated in O(\log{N}) and adding it to some of the K current hashes takes O(K).

Let’s also note one small optimisation: if we make u the deeper vertex of the given path, then each prefix that we need to test for a period doesn’t cross the LCA, so it can be computed only in the part for direct hashes.

Overall, we’ve got an algorithm in O(Nd(N)\log{N}) time, where d(N) is the max. number of divisors of some number up to N. With a good implementation (especially with not too many modulo operations), this is enough for full score. However, we don’t need to try all divisors due to one simple reason: if p_1 and p_2 are periods, then their GCD will also be a period - if p_2 > p_1 and s_k=s_{k+p_1}=s_{k+p_2} (in order to not deal with string finiteness, we can just imagine it’s repeated infinitely), then s_k=s_{k+p_2-p_1}, so p_2-p_1 is also a period; subtracting the smaller from the larger number is the basis of Euclid’s GCD algorithm, so by repeatedly doing that, we’ll arrive to a period equal to GCD(p_1,p_2).

This leads to the following observation: let’s have a path with length decomposed into prime powers as l=\prod_i p_i^{e_i} with e_i > 1. Then, we just need to try periods of the form l/p_i^{\alpha_i} for all 1 \le \alpha_i \le e_i and the “period” of l (for which we have to find the hash anyway); their GCD is the correct period. The point is that this way, we’ll find out which maximum power of each prime appears in the optimal period. And since the number of such periods to test = the sum of exponents + 1 = O(\log{N}) (use the estimate p_i \ge 2), we have \log{N} instead of d(N) in the complexity, which then becomes O(N\log^2{N}).

ALTERNATIVE SOLUTION:

The author prefers a segment tree over an Eulerian tour of the tree to HLD.

AUTHOR’S AND TESTER’S SOLUTIONS:

The author’s solution can be found here.
The tester’s solution can be found here.
The editorialist’s solution can be found here.