ANUSAR Editorial
#PROBLEM LINK:
[Practice][444]
[Contest][555]
**Author:** [Anudeep Nekkanti][111]
**Tester:** [Gerald Agapov][112]
**Editorialist:** [Praveen Dhinwa][113]
#DIFFICULTY:
MEDIUM HARD
#PREREQUISITES:
Suffix Array, Suffix Tree, dfs, Segment tree, Fenwick Tree or BIT.
#PROBLEM:
Given a string S and Q queries. In each query you are given an integer F. You need to find out number of substrings of S
which occur at least F times in S.
#QUICK EXPLANATION
**Suffix Array solution:**
Construct suffix array SA of the string S. Create an array LCP such that LCP[i] = lcp(SA[i], SA[i + 1]). Full form of lcp is longest common prefix.
Now the LCP array can be seen as a histogram with each bar having height LCP[i]. Then for some range [i,j] if minimum of [LCP[i], LCP[i+1].. LCP[j]] is m, then it means that there is a substring of length m, which is present at j-i+2 indices in string S.
Using this property we can find the solution in following way. Let D[i] represent the number of substrings which repeat exactly i times. If we can compute the array D, we can easily answer all the queries. For a given frequency F, required answer will be D[F] + D[F+1] + ... D[length of S].
For a bar of height H (LCP[i] = H), let us assume that we know the largest interval (L[i], R[i]) such that minimum of LCP array over the interval is equal to H, then it means that we have got a substring of length H, which repeats exactly R[i]-L[i]+2 times. (See examples given in the explanation).
We will update the D array accordingly.
**Suffix Tree Solution**
Construct suffix tree of the string S, then you just need to do a dfs over it and keep updating the count of appearance of substrings.
#EXPLANATION
First we will explain more about significance of array D and how it helps to solve the problem. Then the next section will elaborate on
different ways of computing L[i] and R[i] for a given histogram H.
Let us take an example string and create its suffix array. Consider the string S = "ABABBAABB".
<table>
<tr>
<td>
Suffix position in S
</td>
<td>
Suffixes of S
</td>
<td>
LCP array of S
</td>
</tr>
<tr>
<td>
6
</td>
<td>
AABB
</td>
<td>
1
</td>
</tr>
<tr>
<td>
1
</td>
<td>
ABABBAABB
</td>
<td>
2
</td>
</tr>
<tr>
<td>
7
</td>
<td>
ABB
</td>
<td>
3
</td>
</tr>
<tr>
<td>
3
</td>
<td>
ABBAABB
</td>
<td>
0
</td>
</tr>
<tr>
<td>
9
</td>
<td>
B
</td>
<td>
1
</td>
</tr>
<tr>
<td>
5
</td>
<td>
BAABB
</td>
<td>
2
</td>
</tr>
<tr>
<td>
2
</td>
<td>
BABBAABB
</td>
<td>
1
</td>
</tr>
<tr>
<td>
4
</td>
<td>
BBAABB
</td>
<td>
2
</td>
</tr>
<tr>
<td>
8
</td>
<td>
BB
</td>
<td>
Not Defined
</td>
</tr>
</table>
****
Now A appears 4 times in S. A is the prefix of first 4 suffixes in the suffix array. minimum of Minimum of LCP[0], LCP[1], LCP[2], LCP[3] should be LCP[2] (0 based indexing) is 1 representing
which represents that A has appeared 4 times.
As said pointed out in the quick explanation, explanation section, we can express LCP array as histogram with following values [1, 2, 3, 0, 1, 5, 1, 2]. You can also very easily
verify the fact that if for some range[i, j] if j], minimum of LCP array is m, then their exactly exists a substring of length m occurring at j - i + 2 positions.
positions in the string S.
L[i] for given LCP will be [0, 1, 2, 0, 4, 5, 6, 7].
R[i] for given LCP will be [2, 2, 2, 7, 7, 5, 7, 7].
Now let us find out how to update D using LCP.
We will iterate over each possible value in the LCP array (in whichever order you wish), Let the value on which we are currently iterating be V.
As we know that for the range [L[i], R[i]], minimum value of LCP array is equal to V.
As Note that there are at least exactly V substrings (consider any prefix of the suffix corresponding to LCP[i] (ie LCP[i], ie longest common prefix of i and i + 1 in the suffix array) (it array, it has size V and hence there will be V prefixes of it)). it). Hence there will V * (D[R[i] - L[i] + 2) substrings repeating (R[i] - L[i] + 2) times. So we will update D[R[i] - L[i] + 2] by adding val * D[R[i] - L[i] + 2] to it.
eg. Let us suppose we have following suffixes in the sorted order.
AA
AAB
AAC
LCP = {2, 2}.
L = {0, 0}.
R = {1, 1}.
If
When we are iterating over V = 2, Assume we are at position i = 0
As 0.
V = 2, 2 and R[i] - L[i] + 2 = 3.
We will add 2 * 3 = 6 into D[3]. In other words it refers to adding both substrings A and AA 3 times.
The above mentioned approach has some small *pitfalls* which need to be taken care of. Pay attention to the following points.
We are doing some over-counting in the approach. We need to fix that up. There are two kind of issues with the above approach which are explained in detail here.
*************
**Issue 1**
Let us consider that a string S = ABABAA has following suffixes
A
AA
ABAA
ABABAA
BAA
BABAA
LCP = [1, 1, 3, 0, 2]
L = [0, 0, 2, 0, 4]
R = [2, 2, 2, 4, 4]
When we are iterating over LCP array, Let us assume that current V = 1, If we are at i = 0 (ie at suffix A (V = 1), A) we will count that A has occurred exactly 4 times (R[0] (because R[0] - L[0] + 2 = 4, LCP[0] = 1).
But when are considering V = 2, If we are at position i = 2 (ie we are at suffix ABAA (V = 2), ABAA), we have R[2] = 2, L[2] = 2. We will increment the count of D[(R[2] - L[2] + 2)] ie D[2] by 2,
(ie according to (we are updating count of occurrences of substrings A and AB which are prefixes of suffix at position by 2 in suffix array). more).
But this is not correct, because in this way, we will count that
approach we have counted A has appeared 6 times times, which is certainly wrong (You can easily see that A appears exactly 4 times). times).
So There is over counting in our current method. We need to somehow take care get rid of this. Note that at suffix ABAA, the suffix just preceding it is AA
AA.
For V = 2 and note that suffix ABAA, when we count for A we have already counted the fact that substring A has appeared four times. times when the V = 1.
So we can see that
we can not say that all prefixes of length LCP[2] are counted exactly once, But one thing we can say for sure once. Notice that as LCP[L[i] - 1] is 1,
1 and hence we are going to recount just the first prefix of suffix ABAA (ie A). So instead of directly saying that there are exactly LCP[i] substrings which appear exactly R[i] - L[i] + 2 times, we will say that there are exactly min(LCP[i], LCP[i] - LCP[L[i] - 1], LCP[i] - LCP[R[i] + 1]) unique substrings appearing exactly R[i] - L[i] + 2 times. If you have not understood this fact, please **please** take more examples.
examples and convince yourself.
*****
**Issue 2**
Now let us consider that our suffix array have following suffixes.
A
AB
ACD
Our LCP array will be [1, 1].
As said earlier, we will iterate iterating over all possible values in the array LCP. For now, we have only one value in it ie 1. (So we have LCP and our current V = 1).
Assume 1.Assume that we are currently at the suffix A (ie i = 0 position in the array LCP), 0), we have R[i] = 1.
We will update D[3] by 1 * 3.
When we go to the i = 1, we will also do the same. This amounts to overcounting. over counting.
For a particular V in the LCP array, When at some point I have considered the range betwen between L[i] to and R[i] corresponding to some i (st LCP[i] = V), then I should not re-count the values corresponding to index j (LCP[j] = V) such that j lies in the range L[i] to R[i] because this range has already been considered and it will only amount to overcounting. over counting.
We can implement this by a two pointer method, We can maintain a pointer 'right' which denote the rightmost index which has been considered. Note that right is always non-decreasing, Hence it will guarantee that our algorithm takes O(N) time.
***********
**Pseudo Code:**
Let P be a list of lists, P[i] denotes the positions of the the occurrence of i in LCP array where i have occurred.
array.
eg. If LCP = [1, 2, 0, 1, 2, 1]
Then P will contain three list. P[0] = {2}, P[1] = {0, 3, 5}, P[2] = {1, 4}.
for V = 0 to N:
sz = P[val].size()
// right is maintained for using implementing two pointer method.
right = 0;
for j = 0 to sz:
index = P[val][j]
// to take care of the issue 2, dont overcount, always check whether your current element under
// consideration is not inside the range which has already been considered.
if (index >= right):
lo = L[index], hi = R[index]
// len denotes the current number of prefixes of the suffix corresponding to size V, we are going to add.
V.
len = V;
// to take care of the issue 1.
if (hi is defined, ie 0 <= hi < LCP.size()):
len = min(len, val - LCP[hi])
if (lo is defined, ie 0 <= lo < LCP.size()):
len = min(len, val - LCP[hi]);
// as add the current prefixes
D[hi-lo+2] += len * (hi - lo + 2);
// update the right pointer
right = index + 1;
Now only difficulty in the entire algorithm described is to how to compute L[i] and R[i] for a given array H.
LCP.
#Ways of Computing L[i], R[i] for all elements of Array A
So now Now you have to solve the following problem. You are given an array A. For each element A[i] you have to find L[i] and R[i] where L[i] is the largest j <= i such that L[k] >= L[i] for all k lying between i and j. Similarly define R[i].
(In other words, For each index i, [L[i], R[i]] denotes the largest range such that all elements have minimum value equal to A[i].)
Now if we can solve the problem of finding L[i], we can easily solve the problem of finding R[i] for each i (We can just reverse the array A and then finding R[i] is exactly the same as finding L[i]). Hence
from now on, we will only look forward to solve problem of finding L[i].
If you have not solved the problem [HISTOGRA][119] on spoj, then try solving that. There are a lot of ways of solving that this problem, you can read [University of Ulm Local Contest judges solutions for problem H][1110] and
[solution mentioned geeks of geeks sire][1111]. on geeksforgeeks site][1111]. I will very briefly go through all the methods. of the methods mentioned there. Note that all these methods are explained in the details in the
above given links. You are highly recommended to read those.
**Segment Tree Based Solution**
As we know the largest value in the array A can go up to 10^5, 10<sup> 5 </sup>, Hence our segment tree is made on new array B of size [10^5 + 1]
where B[j] shows that largest index i where j has occurred in the array A.
We scan the array A from left to right. For finding out L[i], we can query the segment tree built on array B to find out the
maximum value in the range [0, B[A[i]]].
For updating the array B, we can simply add B[A[i]] = i and update the segment tree accordingly. Essentially we have to maintain
the information in the segment tree about the indices of the elements and maximum of each node and
we will go from left to right and will do update and query operations the segment tree accordingly.
For reference solution, you can see [editorialists solution][99] to get an idea of its implementation. There are two segment trees, minSegmentTree
and maxSegmentTree representing finding L[i] and R[i] respectively. Rest details are just similar to things explained before.
**Stack Based Solution**
Assume that we make maintain a stack. Initially the stack is empty. We go from left to right in the array A, we will pop out the elements in stack which are >= A[i], then the element at the top is the element which
we were searching for, we can easily find its index (For that we can store pair of [value, index] in the stack). L[i] will be the value of the of index of the topmost element in the stack. Now we can We also have to add the current element in
into the stack.
Observe the following important properties:
1. Elements in the stack are always in the increasing order. This fact is very important for correctness of the algorithm.
2. Each element is inserted at most once. Only the inserted elements in the stack are popped out, Hence this ensures that complexity of
this step the entire algorithm is O(N).
For sample implementation of this idea, you can refer [setter solution][66].
**Disjoint set union Set Union Based solution**
This is also an interesting solution, You should look into this [link][1110] for getting more ideas. ideas about the solution. It is left as an a home work for the
reader to understand and implement this method. readers. You can look at [setter solution][77] to get an example of working solution.
##Suffix Tree Approach
You can solve the task easily by using suffix tree structure, You should use do dfs over the suffix tree nodes. To get exact implementation
details, you can view [tester's solution][88].
#AUTHOR'S, TESTER'S AND EDITORIALIST's SOLUTIONS:
[Author's solution based on stack + suffix array][66]
[Author's solution based on DSU + suffix array][77]
[Tester's solution based on suffix tree][88]
[Editorialists's solution based on segment tree + suffix array][99]
[66]:http://www.codechef.com/download/Solutions/COOK46/Setter/SAR-STACK.cpp
[77]: http://www.codechef.com/download/Solutions/COOK46/Setter/SAR-DSU.cpp
[88]: http://www.codechef.com/download/Solutions/COOK46/Tester/ANUSAR.cpp
[99]: http://ideone.com/cUy6i8
[444]: http://www.codechef.com/problems/ANUSAR
[555]: http://www.codechef.com/COOK46/problems/ANUSAR
[111]: http://www.codechef.com/users/anudeep2011
[112]: http://www.codechef.com/users/gerald
[113]: http://www.codechef.com/users/dpraveen
[119]: http://www.spoj.com/problems/HISTOGRA/
[1110]: http://www.informatik.uni-ulm.de/acm/Locals/2003/html/judge.html
[1111]: http://www.geeksforgeeks.org/largest-rectangle-under-histogram/