ANKLEX - Editorial

kevinsogo · June 12, 2015, 5:46am

PROBLEM LINK:

Author: Ankit Srivastava
Tester: Kevin Atienza and Gedi Zheng
Editorialist: Kevin Atienza

PREREQUISITES:

String processing, suffix arrays, LCP array, segment trees

PROBLEM:

You are given a string S, and you need to answer Q queries. Each query is a pair of integers (\text{idx}, \text{len}), and the answer is the index of (the first letter of) a substring satisfying the following conditions:

Its length is exactly \text{len}.
It is the next strictly lexicographically larger substring from S[\text{idx}..\text{idx}+\text{len}-1].

If there are many such strings, output the smallest index. If there are none, output -1.

QUICK EXPLANATION:

Construct the suffix array and LCP array of S.

To answer the query (\text{idx}, \text{len}), we need to follow a couple of steps:

First we need to find the index I of S[\text{idx}..] in the suffix array.
Next, we need to find the smallest index J > I such that LCP[J] < \text{len}.
Next, we need to find the smallest index K \ge J such that the K th string in the suffix array has length at least \text{len}.
Next, we need to find the smallest index L > K such that LCP[L] < \text{len}.
Next, find the longest suffix from the K th string to the $(L-1)$th string in the suffix array.
The answer is the position of the suffix acquired in the previous step.

The first and the last steps can be done easily in O(1) time, while the rest can be done in O(\log |S|) time using segment trees.

EXPLANATION:

We are given a substring of S of a certain length, and we are asked to find the starting index of the next strictly lexicographically larger substring of the same length, and in case there are many such occurrences, give the one with the smaller starting index (In this problem, we define the index of a substring as the index of its first letter in the overall string).

This problem would have been easier if we have a list of all the subtrings of S in sorted order. In this list we consider two substrings different if they start, even if they represent the same sequence of characters, so there are N(N+1)/2 string in this list (assuming |S| = N). We break ties according to the index of the substring.

Using this list, we can answer the question easily: walk through this array starting from the position of the input substring and find the next substring that has the same length and is not equal to the input substring!

An example implementation of this idea in Python is the following:

s = raw_input()
n = len(s)

# construct sorted list of substring
substrings = [(i,l) for i in xrange(n) for l in xrange(1,n-i+1)]
substrings.sort(key=lambda (i,l): (s[i:i+l],i))

# position of each substring in sorted array
position = {(i,l): I for I,(i,l) in enumerate(substrings)}

def find(i, l):
    i -= 1 # to zero-indexing
    pos = position[i, l] + 1
    while pos < len(substrings):
        I, L = substrings[pos]
        if l == L and s[i:i+l] != s[I:I+L]:
            return I + 1 # to one-indexing
        pos += 1
    return -1

for qq in xrange(input()):
    i, l = map(int, raw_input().strip().split())
    print find(i, l)

Unfortunately, this is too slow (obviously). This implementation runs in O(N^3 + QN^2) time. It can be optimized though, but however you optimize this, we cannot avoid the fact that there are N(N+1)/2 substrings in the list. Therefore, we need to find a better algorithm.

Suffix array and LCP array

The key insight is to notice that the list of sorted substrings is implicitly represented in the well-known data structure known as the suffix array. The suffix array of S is simply the array of suffixes of S in sorted order. There are well-known algorithms to construct the suffix array in O(N \log^2 N) time (and possibly even faster). A tutorial can be found here.

Consider the string reread. For simplicity, we add an EOF character at the end, say !, to form reread!. The suffix array of this string is the following:

STRING   | INDEX
---------+------
!        | 7
ad!      | 5
d!       | 6
ead!     | 4
eread!   | 2
read!    | 3
reread!  | 1

Now, try to enumerate the distinct substrings of reread in sorted order:

SUBSTRINGS | SUFFIX ARRAY
-----------+------------
a          |                 
ad         | ad!                
d          | d!                
e          |      
ea         |       
ead        | ead!                
er         |       
ere        |        
erea       |         
eread      | eread!                
r          |      
re         |       
rea        |        
read       | read!                
rer        |        
rere       |         
rerea      |          
reread     | reread!

Since this is a sorted list, it’s clear that the suffixes will appear in the correct order in the list of substrings. It’s also clear that all the remaining strings will occur as a prefix of the next suffix because, well, a substring is simply a prefix of some suffix. Therefore, the suffix array implicitly contains all the substrings in sorted order!

But there’s a minor blemish. Notice that, after read in the above list, we immediately begin with rer instead of starting with r (and re). This is because the re part is a common prefix of the suffixes read! and reread!; in fact it is the longest common prefix (or LCP). This means that the strings r and re have already occurred, so we don’t enumerate them anymore and start instead at rer.

Computing the array of LCP values can be done in O(N \log N) (using binary search and hashing) or O(N) using a different approach. Here’s the LCP values for the string reread!:

STRING  | INDEX | LCP
--------+-------+-----
!       | 7     | 0
ad!     | 5     | 0
d!      | 6     | 0
ead!    | 4     | 0
eread!  | 2     | 1
read!   | 3     | 0
reread! | 1     | 2

and the sorted list of substrings:

SUBSTRINGS | SUFFIX ARRAY | LCP
-----------+--------------+----
a          |              |                
ad         | ad!          | 0                             
d          | d!           | 0                             
e          |              |     
ea         |              |      
ead        | ead!         | 0                             
er         |              |      
ere        |              |       
erea       |              |        
eread      | eread!       | 1                             
r          |              |     
re         |              |      
rea        |              |       
read       | read!        | 0                             
rer        |              |       
rere       |              |        
rerea      |              |         
reread     | reread!      | 2

Now that we have the suffix array and the LCP array, let’s discuss how to answer the query: get the next strictly lexicographically larger substring of the same length. Let us set s_i as the index of the i th substring in the suffix array (in the reread! example above, we have [s_0,s_1,s_2,s_3,s_4,s_5,s_6] = [7,5,6,4,2,3,1]). Also, let us define p_I as the inverse of the suffix array, i.e. p_I is the position of S[I..] in the suffix array.

Suppose we are given the substring at index i and with length l, i.e. S[i..i+l-1], and suppose we ignore first the requirement of finding the smallest index, i.e. we only want to find the next lexicographically larger substring than S[i..i+l-1] with the same length, say T. In this case, we can instead find the first occurence of T in the suffix array, i.e. find the j with the smallest p_j such that S[j..j+l-1] = T. What can we say about j in relation to i? Here are a few properties of j:

The suffix S[j..] appears later than S[i..] in the suffix array.
The suffix S[j..] has length at least l, i.e. j + l - 1 \le N.
The substring S[i..i+l-1] is not equal to S[j..j+l-1].

These three properties also define j, i.e. j is the index with the smallest p_j satisfying these properties! Therefore, here’s a possible approach to compute j:

Let I = p_i.
Find the smallest index K > I such that S[i..i+l-1] \not= S[s_K..s_K+l-1], or s_K + l - 1 > N.
Find the smallest index J \ge K such that s_J + l - 1 \le N.
Set j = s_J.

It’s easy to see why this is correct: In the second step we just searched for the next lexicographically larger suffix from S[i..], and then in the third step we searched for the nearest one that has length at least l. However, a naïve implementation of the algorithm is still too slow. Luckily, the operations above can be efficiently implemented with segment trees!

Segment trees

Let’s discuss the second step first: Given I, find the smallest index K > I such that S[s_I..s_I+l-1] \not= S[s_K..s_K+l-1], or s_K + l - 1 > N. This step is complicated by the string comparison operation (a really naïve implementation might even take O(N^2) time!). Luckily, we are walking on the suffix array, and when comparing suffixes, we can exploit the LCP array, i.e.:

For any two indices I < K, S[s_I..s_I+l-1] = S[s_K..s_K+l-1] if and only if \min(LCP[I+1], LCP[I+2], \ldots, LCP[K]) \ge l.

This is because the strings are sorted in the suffix array, so if two strings from different parts of the suffix array have an equal prefix of length l, then all intermediate strings also have the same prefix of length l.

Given this, we now seek the smallest K > I such that \min(LCP[I+1], LCP[I+2], \ldots, LCP[K]) < l. This can be easily answered by building a segment tree on top of the LCP array. Each node in this segment tree represents a subinterval of the LCP array LCP[I..J], and contains the minimum of that subinterval. Such a segment tree can be constructed in O(N) time.

We start at the node represented by LCP[I..I] (this can be found in O(\log N) time). Next, we walk up the tree, seeking the first node [A..B] to the right of LCP[I..I] such that \min(LCP[A..B]) < l. If we don’t find any such node, we conclude that there is no lexicographically larger string to S[i..i+l-1] of the same length, so we immediately stop and output -1. Otherwise, we now have a subtree [A..B] that definitely contains our K. Now we walk down the tree and find the smallest K such that LCP[K] < l.

The overall algorithm runs in O(\log N) time because we only walk the tree once upwards and then downwards!

Now that we have the K, we now discuss the third step: Find the smallest index J \ge K such that s_J + l - 1 \le N, in other words s_J \le N - l + 1. A very similar approach can be done: Create another segment tree, this time on top of the suffix array [s_0, s_1 \ldots, s_N]. This steps also takes O(\log N) time.

Combining the above steps, we now have an O(N)-time preprocessing, O(\log N)-time query to find some index j of the next larger substring!

Getting the smallest index

Now we have the smallest J such that S[s_J..s_J+l-1] is our target string T, we now seek to find the smallest index j' such that S[j'..j'+l-1] = T. In other words, we need to find the smallest s_{J'} such that S[s_{J'}..s_{J'}+l-1] = S[s_J..s_J+l-1] = T.

However, remember that we are still in the suffix array, so our job is made a lot easier by the fact that all such strings appear consecutively! And since J is the first such occurrence of T, we just need to find the index M of the last occurrence of T. In other words, we seek the largest M such that:

S[s_J..s_J+l-1] = S[s_M..s_M+l-1]

But this equality is equivalent to \min(LCP[J+1], LCP[J+2], \ldots, LCP[M]) \ge l. Therefore the same segment tree above can be used to find this M in O(\log N) time!

Finally, now we have the first and last occurrences of T, we now seek the smallest index in that range. But since we already have a segment tree on top of the suffix array, this can easily be computed with a regular range minimum query to find the answer!

Time Complexity:

O(|S| \log^2 |S| + Q \log |S|)

AUTHOR’S AND TESTER’S SOLUTIONS:

Setter
Tester 1
Tester 2

johnathan79717 · June 13, 2015, 8:54am

I didn’t use segment trees. I use RMQ by sparse tables, and then do binary search.

My code