You are not logged in. Please login at www.codechef.com to post your questions!

×

TANDEM - Editorial

PROBLEM LINK:

Contest
Practice

Author: Ke Bi
Tester: Kevin Atienza
Translators: Vasya Antoniuk (Russian), Team VNOI (Vietnamese) and Hu Zecong (Mandarin)
Editorialist: Kevin Atienza
Contest Admin: Praveen Dhinwa

DIFFICULTY:

Medium-Hard

PREREQUISITES:

Suffix array, longest common prefix, string hashing, range minimum query, segment tree

PROBLEM:

A tandem is a nonempty string repeated thrice, i.e., a string of the form $AAA$ for a nonempty string $A$. A tandem is interesting if the character following each instance of $A$ are not all the same, otherwise it is boring. Given a string $s$ of length $N$, how many interesting and boring tandems does $s$ have?

QUICK EXPLANATION:

We enumerate interesting and boring tandems per length of $A$. Let's say $|A| = L$. Thus, the length of the tandem is $3L$. Fix this $L$.

Consider the positions $0, L, 2L, 3L, \ldots$. A tandem of length $3L$ will cover exactly three such positions. So we count boring and interesting tandems per three positions $i$, $j$, $k$ with $(i,j,k) = (i,i+L,i+2L)$.

  • Let $LCP(i,j,k)$ be the length of the longest common prefix of $s[i\ldots N-1]$, $s[j\ldots N-1]$ and $s[k\ldots N-1]$.
  • Let $LCS(i,j,k)$ be the length of the longest common suffix of $s[0\ldots i]$, $s[0\ldots j]$ and $s[0\ldots k]$.

Let $V = \min(LCP(i,j,k),L) + \min(LCS(i,j,k),L) - 1$. Then:

  • If $V < L$, then there are no tandems.
  • If $V \ge L$ and $LCP(i,j,k) > L$, then there are $V-L+1$ boring tandems and $0$ interesting tandems.
  • If $V \ge L$ and $LCP(i,j,k) \le L$, then there are $V-L$ boring tandems and $1$ interesting tandem.

The answers are the sums of all these contributions across all $L$s and across all $(i,j,k)$s.

EXPLANATION:

We will use the notation $s_{i, j}$ to denote the substring from $s_i$ to $s_j$. Thus, $|s_{i,j}| = j-i+1$.

Faster than $O(N^3)$

The obvious $O(N^3)$ brute-force algorithm here doesn't pass, because $N$ is very large. (And actually, even $O(N^2)$ shouldn't pass.) So let's try to improve on $O(N^3)$. Here we'll try to show a solution that will lead to the intended one.

Let's enumerate the tandems according to the length of $A$. Suppose $|A| = L$, so the tandem $AAA$ has length $3L$. For now, let's fix this $L$ and then count all boring and interesting tandems of length $3L$. Ideally our solution will be fast enough so we can do this for each $L \in [1,N/3]$.

Consider the characters $s_0, s_L, s_{2L}, s_{3L}, \ldots$. There are approximately $N/L$ such positions. But since these are spaced $L$ characters apart from each other, it follows that every tandem of length $3L$ covers exactly three of these. Therefore, let's fix three adjacent characters, say $s_i, s_j, s_k$ where $(i,j)$ and $(j,k)$ are spaced $L$ characters away from each other, then count all boring and interesting tandems covering these three.

The tandem must start at some position in $[i-L+1,i]$. Otherwise, if it starts at some position $> i$ it will not be able to cover $s_i$, and if it starts at some position $\le i-L$, it will not be able to cover $s_k$. But it can't just start at any such position! If we want the string $s_{a, a+3L-1}$ to be a tandem, the following must hold (by definition): $$s_{a,a+L-1} = s_{a+L,a+2L-1} = s_{a+2L,a+3L-1}$$ But due to the way we selected the position $a$, we find that $a \le i \le a+L-1$, $a+L \le j \le a+2L-1$ and $a+2L \le k \le a+3L-1$. It means that we can decompose the equality above into two parts: $$s_{a,i} = s_{a+L,j} = s_{a+2L,k}$$ $$s_{i,a+L-1} = s_{j,a+2L-1} = s_{k,a+3L-1}$$ We can restate these two equalities in another way. The first one is the same as:

The strings $s_{0, i}$, $s_{0, j}$ and $s_{0, k}$ must have a common suffix of length $i-a+1$.

The second one is the same as:

The strings $s_{i, N-1}$, $s_{j, N-1}$ and $s_{k, N-1}$ must have a common prefix of length $a+L-i$.

This now gives us a way to compute the number of tandems covering $s_i$, $s_j$ and $s_k$. Let's first define two things:

  • $LCP(i,j,k)$ is the length of the longest common prefix of $s_{i, N-1}$, $s_{j, N-1}$ and $s_{k, N-1}$.
  • $LCS(i,j,k)$ is the length of the longest common suffix of $s_{0, i}$, $s_{0, j}$ and $s_{0, k}$,. (Not to be confused with the longest common subsequence.)

There is a tandem starting at $a$ if $i-a+1 \le LCS(i,j,k)$, $a+L-i \le LCP(i,j,k)$, and $a\in [i-L+1,i]$. Rearranging these inequalities gives: $$\max(i-LCS(i,j,k)+1, i-L+1) \le a \le \min(LCP(i,j,k)-L+i, i)$$ which means the number of valid $a$s is $$\min(LCP(i,j,k)-L+i, i) - \max(i-LCS(i,j,k)+1, i-L+1) + 1$$ or $0$ if this number is negative. We can actually simplify that number to the following: $$\min(LCP(i,j,k),L) + \min(LCS(i,j,k),L) - L$$ But which among these are boring, and which are interesting? Notice that if $s_{a,a+3L-1}$ and $s_{a+1,a+3L}$ are both tandems, then $s_{a,a+3L-1}$ is automatically boring, because the following characters are the same. It means that only the last tandem, $s_{i,i+3L-1}$ can be interesting. To check if it's interesting, simply check if $s_{i+L} = s_{j+L} = s_{k+L}$. If this is true, then the string is boring, otherwise it is interesting!

Computing $\min(LCP(i,j,k),L)$ and $\min(LCS(i,j,k),L)$ naïvely takes $O(L)$ time each. Since we will do this approximately $N/L$ times for each length $L$, the running time is therefore $O(N/L\cdot L) = O(N)$ per $L$. So if we do this for each $L\in [1,N/3]$, the running time is $O(N^2)$. This is much better than $O(N^3)$!

Faster solution

To improve that solution, we need a faster way to compute $LCP(i,j,k)$ and $LCS(i,j,k)$. We'll describe various solutions here.

Solution 1: Hashing + Binary Search

If we can compare any two substrings for equality in $O(1)$ time, then we can improve computing $LCP$ and $LCS$ quickly using binary search. One way to do substring comparisons quickly is with hashing!

Specifically, we will use polynomial hashing. We fix a base $B$ and a modulus $M$. Let's now define the hash function $H(S)$. For the string $s = s_0s_1s_2\ldots s_{N-1}$, its hash will be the following value: $$H(s) = (v(s_0) + v(s_1)B^1 + v(s_2)B^2 + v(s_3)B^3 + \ldots + v(s_{N-1})B^{N-1}) \bmod M$$ where $v(c)$ is another function that maps a character to a number.

We will use this hash to compare substrings in $O(1)$:

  • Preprocess the string so that $H(s)$ can be computed in $O(1)$ time for any substring $s$.
  • To compare two substrings $s$ and $t$, simply compare their hashes, $H(s)$ and $H(t)$.

If two strings are equal, then their hashes must be the same. Unfortunately, if they are not equal, the hashes might still be the same. For this reason, this hashing algorithm fails sometimes. But if you choose $B$, $M$ and $v(c)$ well enough, then you'll probably still get accepted :D

So how do we preprocess the string? The key advantage of polynomial hashing is the following properties: (assuming we have precomputed powers of $B$ modulo $M$)

  • Concatenation. Given $H(s)$ and $H(t)$, $H(st)$ can be computed in $O(1)$ time as $H(st) = (H(s) + H(t)B^{|s|})\bmod M$.
  • Splitting. Given $H(st)$ and $H(t)$, $H(s)$ can be computed in $O(1)$ time as $H(s) = (H(st) - H(t)B^{|s|})\bmod M$.

These properties are what we can use to preprocess. We simply compute the hashes of all suffixes of the string. This can be done in $O(N)$ by using the concatenation property above. Then, to compute the hash of a substring $s$, find the appropriate suffixes $st$ and $t$, then compute $H(s)$ from $H(st)$ and $H(t)$ using the splitting property!

So what's the running time of this algorithm? Notice that $LCP$ and $LCS$ are computed with binary search, which takes $O(\log N)$ time each, so doing that $N/L$ times gives $O(N/L \log N)$. Summing from $1$ to $L$ gives $$O\left(\sum_{L=1}^{N/3} N/L \log N\right) = O\left(N \log N \sum_{L} 1/L\right) = O(N \log^2 N)$$ owing to the fact that the Harmonic series grows as $O(\log N)$.

Solution 2: Suffix array + segment tree

Computing longest common prefixes is a standard step when you already have the suffix array. After constructing the suffix array, constructing the LCP array takes just $O(N)$. Afterwards, to find the longest common prefix of two suffixes, $s_{i,N-1}$ and $s_{j,N-1}$:

  • Find the positions of $i$ and $j$ in the suffix array. Let's say they are $i'$ and $j'$. Without loss of generality, assume $i' \le j'$. (We can swap $i$ and $j$ otherwise.)
  • The LCP is now $\min(LCP[i'+1],LCP[i'+2],\ldots,LCP[j'])$.

Modifying this for three suffixes is straightforward. Thus, we have reduced LCP to a range minimum query, which is pretty standard :) Similarly, you can construct the "$LCS$ array" by constructing a suffix array for the reverse of $s$, so $LCS$ can be reduced to a range minimum query as well.

How do we compute range minimum queries? The most standard way is to use segment trees. This way, each range minimum query takes $O(\log N)$ time to compute. Thus, the running time is again $O(N \log^2 N)$ like above. (Note that the suffix array can be constructed in $O(N \log^2 N)$ time.) But if you implement a segment tree normally, you might get TLE, even though the time complexity is correct! If this happens to you, here's an improvement: Notice that we actually only need $\min(LCP(i,j,k),L)$. Thus, we can add a third argument to our range_min query call, which denotes the current upper bound. range_min now looks like this:

def range_min(i, j, L):
    if this.min >= L: // special pruning.
        return L      // if this.min >= L, it's impossible to improve upon 'L'.
    if i <= this.i and this.j <= j:
        return this.min
    if this.j < i or j < this.i:
        return L
    return left.range_min(i, j, right.range_min(i, j, L))

Notice that the partial results are getting passed everywhere, so that the pruning done by this.min >= L is maximized. With this, I was able to get the segment tree solution accepted!

Solution 3: Suffix array + sparse table

Another way to compute range minimum queries is by computing a table of minimums. Let $M[i][k]$ be the range minimum in the range $[i,i+2^k-1]$. We can build the table $M[i][k]$ using the following property:
$$\begin{align*} M[i][0] &= A_i \\\ M[i][k] &= \min(M[i][k-1], M[i+2^{k-1}][k-1]) \end{align*}$$ Thus, this construction takes $O(N \log N)$ time. Afterwards, each range query runs in $O(1)$ time, because the minimum in the range $[i,j]$ is $\min(M[i][k], M[j-2^k+1][k])$, where $2^k$ is the largest power of two $\le j-i+1$. (Now, computing $k$ takes $O(\log N)$, but you can just precompute it in $O(N)$ time for each length $j-i+1$.)

Notice that we moved the $\log N$ factor from query time to construction time; whereas segment trees take $O(N)$ time to construct and has $O(\log N)$ query time, this solution takes $O(N \log N)$ time to construct and has $O(1)$ query time. But this change is significant, because the number of queries that we have is actually: $$O\left(\sum_L N/L\right) = O(N \log N)$$ which means that the overall running time (after constructing the suffix array) is $$O(N \log N) + O(N \log N) = O(N \log N)$$ which is an improvement over $O(N \log^2 N)$! (Note that there are $O(N\log N)$, and even $O(N)$ algorithms to construct the suffix array.)

Time Complexity:

$O(N \log^2 N)$

AUTHOR'S AND TESTER'S SOLUTIONS:

Setter
Tester-hash
Tester-tabl
Tester-tree

This question is marked "community wiki".

asked 22 May '16, 22:03

kevinsogo's gravatar image

7★kevinsogo
1.7k586142
accept rate: 11%

edited 23 May '16, 00:56


That was a good one!

link

answered 24 May '16, 18:24

vatsalsharma_3's gravatar image

4★vatsalsharma_3
1785
accept rate: 11%

toggle preview
Preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported
  • mathemetical formulas in Latex between $ symbol

Question tags:

×15,639
×1,755
×1,240
×90
×62
×56
×36
×24

question asked: 22 May '16, 22:03

question was seen: 2,236 times

last updated: 24 May '16, 18:24