PROBLEM LINK:Author: Ankit Srivastava PREREQUISITES:String processing, suffix arrays, LCP array, segment trees PROBLEM:You are given a string $S$, and you need to answer $Q$ queries. Each query is a pair of integers $(\text{idx}, \text{len})$, and the answer is the index of (the first letter of) a substring satisfying the following conditions:
If there are many such strings, output the smallest index. If there are none, output $1$. QUICK EXPLANATION:Construct the suffix array and LCP array of $S$. To answer the query $(\text{idx}, \text{len})$, we need to follow a couple of steps:
The first and the last steps can be done easily in $O(1)$ time, while the rest can be done in $O(\log S)$ time using segment trees. EXPLANATION:We are given a substring of $S$ of a certain length, and we are asked to find the starting index of the next strictly lexicographically larger substring of the same length, and in case there are many such occurrences, give the one with the smaller starting index (In this problem, we define the index of a substring as the index of its first letter in the overall string). This problem would have been easier if we have a list of all the subtrings of $S$ in sorted order. In this list we consider two substrings different if they start, even if they represent the same sequence of characters, so there are $N(N+1)/2$ string in this list (assuming $S = N$). We break ties according to the index of the substring. Using this list, we can answer the question easily: walk through this array starting from the position of the input substring and find the next substring that has the same length and is not equal to the input substring! An example implementation of this idea in Python is the following:
Unfortunately, this is too slow (obviously). This implementation runs in $O(N^3 + QN^2)$ time. It can be optimized though, but however you optimize this, we cannot avoid the fact that there are $N(N+1)/2$ substrings in the list. Therefore, we need to find a better algorithm. Suffix array and LCP arrayThe key insight is to notice that the list of sorted substrings is implicitly represented in the wellknown data structure known as the suffix array. The suffix array of $S$ is simply the array of suffixes of $S$ in sorted order. There are wellknown algorithms to construct the suffix array in $O(N \log^2 N)$ time (and possibly even faster). A tutorial can be found here. Consider the string
Now, try to enumerate the distinct substrings of
Since this is a sorted list, it's clear that the suffixes will appear in the correct order in the list of substrings. It's also clear that all the remaining strings will occur as a prefix of the next suffix because, well, a substring is simply a prefix of some suffix. Therefore, the suffix array implicitly contains all the substrings in sorted order! But there's a minor blemish. Notice that, after Computing the array of LCP values can be done in $O(N \log N)$ (using binary search and hashing) or $O(N)$ using a different approach. Here's the LCP values for the string
and the sorted list of substrings:
Now that we have the suffix array and the LCP array, let's discuss how to answer the query: get the next strictly lexicographically larger substring of the same length. Let us set $s_i$ as the index of the $i$th substring in the suffix array (in the Suppose we are given the substring at index $i$ and with length $l$, i.e. $S[i..i+l1]$, and suppose we ignore first the requirement of finding the smallest index, i.e. we only want to find the next lexicographically larger substring than $S[i..i+l1]$ with the same length, say $T$. In this case, we can instead find the first occurence of $T$ in the suffix array, i.e. find the $j$ with the smallest $p_j$ such that $S[j..j+l1] = T$. What can we say about $j$ in relation to $i$? Here are a few properties of $j$:
These three properties also define $j$, i.e. $j$ is the index with the smallest $p_j$ satisfying these properties! Therefore, here's a possible approach to compute $j$:
It's easy to see why this is correct: In the second step we just searched for the next lexicographically larger suffix from $S[i..]$, and then in the third step we searched for the nearest one that has length at least $l$. However, a naïve implementation of the algorithm is still too slow. Luckily, the operations above can be efficiently implemented with segment trees! Segment treesLet's discuss the second step first: Given $I$, find the smallest index $K > I$ such that $S[s_I..s_I+l1] \not= S[s_K..s_K+l1]$, or $s_K + l  1 > N$. This step is complicated by the string comparison operation (a really naïve implementation might even take $O(N^2)$ time!). Luckily, we are walking on the suffix array, and when comparing suffixes, we can exploit the LCP array, i.e.: For any two indices $I < K$, $S[s_I..s_I+l1] = S[s_K..s_K+l1]$ if and only if $\min(LCP[I+1], LCP[I+2], \ldots, LCP[K]) \ge l$. This is because the strings are sorted in the suffix array, so if two strings from different parts of the suffix array have an equal prefix of length $l$, then all intermediate strings also have the same prefix of length $l$. Given this, we now seek the smallest $K > I$ such that $\min(LCP[I+1], LCP[I+2], \ldots, LCP[K]) < l$. This can be easily answered by building a segment tree on top of the LCP array. Each node in this segment tree represents a subinterval of the LCP array $LCP[I..J]$, and contains the minimum of that subinterval. Such a segment tree can be constructed in $O(N)$ time. We start at the node represented by $LCP[I..I]$ (this can be found in $O(\log N)$ time). Next, we walk up the tree, seeking the first node $[A..B]$ to the right of $LCP[I..I]$ such that $\min(LCP[A..B]) < l$. If we don't find any such node, we conclude that there is no lexicographically larger string to $S[i..i+l1]$ of the same length, so we immediately stop and output $1$. Otherwise, we now have a subtree $[A..B]$ that definitely contains our $K$. Now we walk down the tree and find the smallest $K$ such that $LCP[K] < l$. The overall algorithm runs in $O(\log N)$ time because we only walk the tree once upwards and then downwards! Now that we have the $K$, we now discuss the third step: Find the smallest index $J \ge K$ such that $s_J + l  1 \le N$, in other words $s_J \le N  l + 1$. A very similar approach can be done: Create another segment tree, this time on top of the suffix array $[s_0, s_1 \ldots, s_N]$. This steps also takes $O(\log N)$ time. Combining the above steps, we now have an $O(N)$time preprocessing, $O(\log N)$time query to find some index $j$ of the next larger substring! Getting the smallest indexNow we have the smallest $J$ such that $S[s_J..s_J+l1]$ is our target string $T$, we now seek to find the smallest index $j'$ such that $S[j'..j'+l1] = T$. In other words, we need to find the smallest $s_{J'}$ such that $S[s_{J'}..s_{J'}+l1] = S[s_J..s_J+l1] = T$. However, remember that we are still in the suffix array, so our job is made a lot easier by the fact that all such strings appear consecutively! And since $J$ is the first such occurrence of $T$, we just need to find the index $M$ of the last occurrence of $T$. In other words, we seek the largest $M$ such that:
But this equality is equivalent to $\min(LCP[J+1], LCP[J+2], \ldots, LCP[M]) \ge l$. Therefore the same segment tree above can be used to find this $M$ in $O(\log N)$ time! Finally, now we have the first and last occurrences of $T$, we now seek the smallest index in that range. But since we already have a segment tree on top of the suffix array, this can easily be computed with a regular range minimum query to find the answer! Time Complexity:$O(S \log^2 S + Q \log S)$ AUTHOR'S AND TESTER'S SOLUTIONS:
This question is marked "community wiki".
asked 12 Jun '15, 05:46
