PROBLEM LINK:Author: Ke Bi DIFFICULTY:MediumHard PREREQUISITES:Suffix array, longest common prefix, string hashing, range minimum query, segment tree PROBLEM:A tandem is a nonempty string repeated thrice, i.e., a string of the form $AAA$ for a nonempty string $A$. A tandem is interesting if the character following each instance of $A$ are not all the same, otherwise it is boring. Given a string $s$ of length $N$, how many interesting and boring tandems does $s$ have? QUICK EXPLANATION:We enumerate interesting and boring tandems per length of $A$. Let's say $A = L$. Thus, the length of the tandem is $3L$. Fix this $L$. Consider the positions $0, L, 2L, 3L, \ldots$. A tandem of length $3L$ will cover exactly three such positions. So we count boring and interesting tandems per three positions $i$, $j$, $k$ with $(i,j,k) = (i,i+L,i+2L)$.
Let $V = \min(LCP(i,j,k),L) + \min(LCS(i,j,k),L)  1$. Then:
The answers are the sums of all these contributions across all $L$s and across all $(i,j,k)$s. EXPLANATION:We will use the notation $s_{i, j}$ to denote the substring from $s_i$ to $s_j$. Thus, $s_{i,j} = ji+1$. Faster than $O(N^3)$The obvious $O(N^3)$ bruteforce algorithm here doesn't pass, because $N$ is very large. (And actually, even $O(N^2)$ shouldn't pass.) So let's try to improve on $O(N^3)$. Here we'll try to show a solution that will lead to the intended one. Let's enumerate the tandems according to the length of $A$. Suppose $A = L$, so the tandem $AAA$ has length $3L$. For now, let's fix this $L$ and then count all boring and interesting tandems of length $3L$. Ideally our solution will be fast enough so we can do this for each $L \in [1,N/3]$. Consider the characters $s_0, s_L, s_{2L}, s_{3L}, \ldots$. There are approximately $N/L$ such positions. But since these are spaced $L$ characters apart from each other, it follows that every tandem of length $3L$ covers exactly three of these. Therefore, let's fix three adjacent characters, say $s_i, s_j, s_k$ where $(i,j)$ and $(j,k)$ are spaced $L$ characters away from each other, then count all boring and interesting tandems covering these three. The tandem must start at some position in $[iL+1,i]$. Otherwise, if it starts at some position $> i$ it will not be able to cover $s_i$, and if it starts at some position $\le iL$, it will not be able to cover $s_k$. But it can't just start at any such position! If we want the string $s_{a, a+3L1}$ to be a tandem, the following must hold (by definition): $$s_{a,a+L1} = s_{a+L,a+2L1} = s_{a+2L,a+3L1}$$ But due to the way we selected the position $a$, we find that $a \le i \le a+L1$, $a+L \le j \le a+2L1$ and $a+2L \le k \le a+3L1$. It means that we can decompose the equality above into two parts: $$s_{a,i} = s_{a+L,j} = s_{a+2L,k}$$ $$s_{i,a+L1} = s_{j,a+2L1} = s_{k,a+3L1}$$ We can restate these two equalities in another way. The first one is the same as: The strings $s_{0, i}$, $s_{0, j}$ and $s_{0, k}$ must have a common suffix of length $ia+1$. The second one is the same as: The strings $s_{i, N1}$, $s_{j, N1}$ and $s_{k, N1}$ must have a common prefix of length $a+Li$. This now gives us a way to compute the number of tandems covering $s_i$, $s_j$ and $s_k$. Let's first define two things:
There is a tandem starting at $a$ if $ia+1 \le LCS(i,j,k)$, $a+Li \le LCP(i,j,k)$, and $a\in [iL+1,i]$. Rearranging these inequalities gives: $$\max(iLCS(i,j,k)+1, iL+1) \le a \le \min(LCP(i,j,k)L+i, i)$$ which means the number of valid $a$s is $$\min(LCP(i,j,k)L+i, i)  \max(iLCS(i,j,k)+1, iL+1) + 1$$ or $0$ if this number is negative. We can actually simplify that number to the following: $$\min(LCP(i,j,k),L) + \min(LCS(i,j,k),L)  L$$ But which among these are boring, and which are interesting? Notice that if $s_{a,a+3L1}$ and $s_{a+1,a+3L}$ are both tandems, then $s_{a,a+3L1}$ is automatically boring, because the following characters are the same. It means that only the last tandem, $s_{i,i+3L1}$ can be interesting. To check if it's interesting, simply check if $s_{i+L} = s_{j+L} = s_{k+L}$. If this is true, then the string is boring, otherwise it is interesting! Computing $\min(LCP(i,j,k),L)$ and $\min(LCS(i,j,k),L)$ naïvely takes $O(L)$ time each. Since we will do this approximately $N/L$ times for each length $L$, the running time is therefore $O(N/L\cdot L) = O(N)$ per $L$. So if we do this for each $L\in [1,N/3]$, the running time is $O(N^2)$. This is much better than $O(N^3)$! Faster solutionTo improve that solution, we need a faster way to compute $LCP(i,j,k)$ and $LCS(i,j,k)$. We'll describe various solutions here. Solution 1: Hashing + Binary SearchIf we can compare any two substrings for equality in $O(1)$ time, then we can improve computing $LCP$ and $LCS$ quickly using binary search. One way to do substring comparisons quickly is with hashing! Specifically, we will use polynomial hashing. We fix a base $B$ and a modulus $M$. Let's now define the hash function $H(S)$. For the string $s = s_0s_1s_2\ldots s_{N1}$, its hash will be the following value: $$H(s) = (v(s_0) + v(s_1)B^1 + v(s_2)B^2 + v(s_3)B^3 + \ldots + v(s_{N1})B^{N1}) \bmod M$$ where $v(c)$ is another function that maps a character to a number. We will use this hash to compare substrings in $O(1)$:
If two strings are equal, then their hashes must be the same. Unfortunately, if they are not equal, the hashes might still be the same. For this reason, this hashing algorithm fails sometimes. But if you choose $B$, $M$ and $v(c)$ well enough, then you'll probably still get accepted :D So how do we preprocess the string? The key advantage of polynomial hashing is the following properties: (assuming we have precomputed powers of $B$ modulo $M$)
These properties are what we can use to preprocess. We simply compute the hashes of all suffixes of the string. This can be done in $O(N)$ by using the concatenation property above. Then, to compute the hash of a substring $s$, find the appropriate suffixes $st$ and $t$, then compute $H(s)$ from $H(st)$ and $H(t)$ using the splitting property! So what's the running time of this algorithm? Notice that $LCP$ and $LCS$ are computed with binary search, which takes $O(\log N)$ time each, so doing that $N/L$ times gives $O(N/L \log N)$. Summing from $1$ to $L$ gives $$O\left(\sum_{L=1}^{N/3} N/L \log N\right) = O\left(N \log N \sum_{L} 1/L\right) = O(N \log^2 N)$$ owing to the fact that the Harmonic series grows as $O(\log N)$. Solution 2: Suffix array + segment treeComputing longest common prefixes is a standard step when you already have the suffix array. After constructing the suffix array, constructing the LCP array takes just $O(N)$. Afterwards, to find the longest common prefix of two suffixes, $s_{i,N1}$ and $s_{j,N1}$:
Modifying this for three suffixes is straightforward. Thus, we have reduced LCP to a range minimum query, which is pretty standard :) Similarly, you can construct the "$LCS$ array" by constructing a suffix array for the reverse of $s$, so $LCS$ can be reduced to a range minimum query as well. How do we compute range minimum queries? The most standard way is to use segment trees. This way, each range minimum query takes $O(\log N)$ time to compute. Thus, the running time is again $O(N \log^2 N)$ like above. (Note that the suffix array can be constructed in $O(N \log^2 N)$ time.) But if you implement a segment tree normally, you might get TLE, even though the time complexity is correct! If this happens to you, here's an improvement: Notice that we actually only need $\min(LCP(i,j,k),L)$. Thus, we can add a third argument to our
Notice that the partial results are getting passed everywhere, so that the pruning done by Solution 3: Suffix array + sparse tableAnother way to compute range minimum queries is by computing a table of minimums. Let $M[i][k]$ be the range minimum in the range $[i,i+2^k1]$. We can build the table $M[i][k]$ using the following property: Notice that we moved the $\log N$ factor from query time to construction time; whereas segment trees take $O(N)$ time to construct and has $O(\log N)$ query time, this solution takes $O(N \log N)$ time to construct and has $O(1)$ query time. But this change is significant, because the number of queries that we have is actually: $$O\left(\sum_L N/L\right) = O(N \log N)$$ which means that the overall running time (after constructing the suffix array) is $$O(N \log N) + O(N \log N) = O(N \log N)$$ which is an improvement over $O(N \log^2 N)$! (Note that there are $O(N\log N)$, and even $O(N)$ algorithms to construct the suffix array.) Time Complexity:$O(N \log^2 N)$ AUTHOR'S AND TESTER'S SOLUTIONS:
This question is marked "community wiki".
asked 22 May '16, 22:03
