QSTRING-Editorial

djdolls · October 13, 2014, 3:01pm

PROBLEM LINK:

Practice
Contest

Author: Minako Kojima
Tester: Jingbo Shang
Editorialist: Ajay K. Verma

DIFFICULTY:

Medium

PREREQUISITES:

Suffix Array, LCP Array, Persistent Segment Tree

PROBLEM:

Given a string S, support the following queries:

Given integers k1 and k2, compute the k1-th lexicographic smallest substring S1 among all unique substrings of S. The substring S1 may occur several times in S, find its k2-th occurrence.
Given a substring S[l…r], find its rank k1 among the unique substrings of S, Also this substring may occur several times in the string, find how many of these substrings start before l.

EXPLANATION:

First let us ignore the k2, and instead consider a simpler problem, which supports the following two queries:

Given k1, find the k1-th lexicographic smallest substring S1 among all unique substrings of S.
Given a substring S[l…r], find its rank k1 among the unique substrings of S.

The two queries can be answered easily using the suffix array and lcp array of the string S. In order to see the construction algorithm for suffix and lcp array consider looking at this.

Briefly, suffix array of an string S[1…N], consists of of an array SA[1…N], where SA[i] is the index x such that S[x…N] is the i-th lexicographically smallest suffix of string S. Usually, we also create the inverse array of SA, known as POS[1…N]. The element POS[i] gives the rank of suffix S[i…N] among all suffixes of S.

the lcp array lcp[1…N] is used to compute the longest common prefix of suffixes of S. The value lcp[i] represents the length of the longest common prefix of the suffix S[SA[i]…N] and S[SA[i - 1]…N] for i > 1. A combination of lcp array and range minimum data structure can be used to compute the length of the common prefix of arbitrary suffixes of S in O(1) time as shown here.

Unique Substrings:

Suffix array can be used to compute the number of unique substrings of S. We count these substrings in lexicographic order. Since the array SA[] consist of the string indices in lexicographic order of suffixes, hence, in the lexicographic enumeration of unique substrings of S, the substrings starting at SA[i] will be counted before counting the substrings starting at SA[j] for i < j.

Suppose we are considering the substrings starting at index SA[i]. There are (N + 1 - SA[i]) such substrings, however, many of these substrings have already been counted. More specifically, if the length of the longest common prefix of suffix starting at index SA[i] and the suffix starting at SA[i - 1] is x, then x of these (N + 1 - SA[i]) substrings have already been counted. So the number of additional substrings will be (N + 1 - SA[i] - x) = (N + 1 - SA[i] - lcp[i]).

Adding these values together will give us the number of unique substrings of S. Now, let us create an array substr[1…N], such that substr[i] gives the number of unique substrings that have been counted after considering the suffixes starting at SA[1], SA[2], …, SA[i].

substr[1] = N + 1 - SA[1],
substr[i] = substr[i - 1] + N + 1 - SA[i] - lcp[i]

Using the array substr[], we can compute the k1-th lexicographic smallest substring easily. Basically, we perform a binary search for k1 in the array substr[]. Let us say substr[i - 1] < k1 <= substr[i], this means the k1-th smallest substring must start at SA[i]. More specifically, the substring will be S[l…r], where,
l = SA[i]
r = SA[i] + lcp[i] + k1 - substr[i - 1]

On the other hand, if we want to compute the rank of substring S[l…r], we need to find out when this substring was considered in the enumeration of unique substrings. Suppose this string was counted for the first time when we were considering the substrings starting at SA[x], that means x is the smallest index such that the suffix starting at SA[x] contains S[l…r] as a prefix, i.e., the the length of the longest common prefix of the suffix starting at l and SA[x] is >= (r - l + 1).

Since the suffix starting at index l contains the substring S[l…r] as a prefix, x must not exceed POS[l]. Moreover, for all x <= y <= POS[l], the length of the longest common prefix between the suffix starting at SA[y] and the suffix starting at l will be >= (r - l + 1). Hence, x can be found using a binary search.

Once we have found x, the rank k1 can be found easily. In fact the rank k1 will be
k1 = substr[x - 1] + (r - l + 1) - lcp[x]

This means using the suffix and lcp array, we can handle the simplified queries in O (lg N) time. Next, we discuss how to handle the queries which also consider the ranking of same substring based on their starting points.

Ranking of Substrings based on Starting Index:

Suppose that we are handling the first query, where k1 and k2 are given and we want to compute the substring S[l…r]. Using the algorithm described in previous section, we can easily find one occurrence of this substring, which is at
l = SA[i]
r = SA[i] + lcp[i] + k1 - substr[i - 1]

where i is the smallest index such that substr[i] >= k1.

Note that the substring S[l…r], will occur at position x, iff the length of the longest common prefix of the suffix starting at x and the suffix starting at l is >= (r - l + 1). We need to find all such x’s. It can be seen that the set of these x’s will be of the form
{SA[low], SA[low + 1], …, SA[i], …, SA[high - 1], SA[high]}

The value of low and high can be computed using binary search and the lcp computation. Once, we have the set of all possible x’s, we need to find the k2-th smallest value among these. We will discuss how to do this efficiently a bit later. First let us discuss how to handle the second type of query.

For the second query, we are given the substring S[l…r]. Using the method described in previous section, we can compute its rank k1 in O (lg N) time. This substring will occur many times in the string, and the starting indices of these occurrence will form the set
{SA[low], SA[low + 1], …, SA[POS[l]], …, SA[high - 1], SA[high]}.

Here also the value of low and high can be computed using a binary search, and we want to compute the rank of SA[POS[l]] in this set, which will be the value of k2.

Persistent Segment Tree:

In the above section we have seen that the problem of handling the two kind of queries finally reduces to design a data structure that supports rank and select queries on the subarray. More specifically,

Given an array SA[1…N], design a data structure structure that supports the following queries:

Given l, r, and k find the k-th smallest element in the subarray SA[l…r]
Given l, r, and x, find the rank of x in the subarray SA[l…r].

The data structure that handles these queries efficiently is known as Persistent Segment Tree, which can handle the two queries in O (lg N) time. The details of this data structure can be seen at here and here (in Russian, I guess).

Hence, we can handle the two kind of queries in O (lg N) time.

Time Complexity:

Preprocessing time: O (N lg N)
Query time: O (lg N)

Weak Test Cases:

Unfortunately, the test cases for this problem were weak (or the time limit was too relaxed), that allowed O (QN) solutions to run in time. Some contestants informed the admins about this during the contest, however, those mails slipped through the cracks, and the setter never became aware of this issue during the contest. The setter has now added the strong test cases, which are available in practice section. However, at this point it does not seem possible to rejudge all accepted solutions.

AUTHOR’S AND TESTER’S SOLUTIONS:

Author’s solution will be put up soon.
Tester’s solution will be put up soon.

chalubhalu · October 13, 2014, 3:06pm

Even far worse complexities work! Query at worst case of nlogn also works!

triveni · October 13, 2014, 3:24pm

I am surprised to see so many solutions having worst case complexity O(Q * N) are accepted !!

I am more surprised because, I had mailed codechef about weakness of the test data of this problem but they did not consider that, I don’t know why (may be because they were busy with ICPC online round). My this solution is O(N * Q) in worst case, but took less time than my this solution which is (N log N + Q log N) !!

xiaodao · October 13, 2014, 10:08pm

It seems that the time-limits are too large …
I haven’t received your E-mail … and nobody told me about this issue …
I will try to add more data and definitely put the problem to rejudge!.

Thank you.

xellos0 · October 14, 2014, 12:06am

After I spent a lot of time optimizing my O(N log N) suffix array (time-wise and timetocode-wise), and made the rest O(N log^2 N) (with a fairly good constant) worst-case…

logic_max · October 13, 2014, 4:10pm

Well, now I don’t feel bad about my O(N*log^3 N) solution.

coda · October 13, 2014, 4:13pm

what the hell… i didnt submit the solution thinking my solution would time out… whats wrong with codechef?

wittyceaser · October 13, 2014, 4:46pm

Will they be rejudging the solutions?

dcod · October 13, 2014, 10:52pm

Well they may but they should’nt as the contest is over.

rishul_nsit · October 13, 2014, 11:22pm

Rejudging after the contest is over? Strange!

triveni · October 14, 2014, 3:25am

@xiaodao : I had sent my mail to feedback[at]codechef.com (since I have no idea how to contact you).

fauzdar65 · October 15, 2014, 8:50am

If i used O(Nlog^2N) suffix array construction, was it supposed to give TLE?

ma5termind · October 15, 2014, 12:52pm

I solved this problem with the complexity of O(log(N))^2. For the whole two days i was thinking that my solution will surely run out of time. I dont dare to code. So finally on the last day i finally code it and got it accepted.

dull_jester · June 19, 2021, 11:54am

I believe there is an error in this problem’s judge I/O. Some reasons:

the Russian translation does not exactly correspond to the English one
over the 8 years, there is only one successful solution
Could anyone look into this?