TASTR - Editorial
#PROBLEM LINKS
[Practice][1]
[Contest][2]
#DIFFICULTY
MEDIUM
#PREREQUISITES
[Suffix Array][5], [Longest Common Prefix Array][6]
#PROBLEM
Given two strings <b>a</b> and <b>b</b>, let <b>A</b> be the set of all substrings of <b>a</b>, and <b>B</b> be the set of all substrings of <b>b</b>.
Find the number of <b>unique</b> strings in <b>A</b> plus the number of <b>unique</b> strings in <b>B</b>, that are not common to both <b>A</b> and <b>B</b>.
#QUICK EXPLANATION
Efficient solutions to a large set of problems on strings are approachable by use of <b>Suffix</b> <b>Arrays</b> (or [Trees][7]). If you have not yet added <b>Suffix</b> <b>Sorting</b> (for construction of Suffix Arrays) to your skill-set, then this is the perfect problem to do so.
Suffix Sorting can be done using
* <b>Manber Myers algorithm in O(L log L) time</b>
* <b>Karkkainan Sander's algorithm in O(L) time</b>
<b>Longest Common Prefixes</b> of adjacent items in the list of sorted suffixes is a classical problem in strings. Most texts discuss construction of Suffix Arrays immediately followed by calculation of <b>LCP arrays</b>.
The <b>LCP Array</b> can be found by <b>augmenting</b> the Suffix Sorting algorithms above. The time complexity for LCP Array calculation will be <b>exactly</b> equal to the time complexity of the Suffix Sorting. You can find it as an entirely different step as well, following the Suffix Sorting.
Both <b>O(L)</b> and <b>O(L log L)</b> approaches will fit within the time limit for this problem. Hence, choose as you please :)
Let us assume that we can find the <b>number of unique sub-strings of a string</b> by use of the <b>LCP array</b> for the suffixes of that string. (We will see the algorithm to do so in the EXPLANATION section)
<pre>U(A) = number of unique strings in A</pre>
We are given two strings <b>a</b> and <b>b</b> and their set of substrings <b>A</b> and <b>B</b> respectively. We wish to find
<pre>U(A) + U(B) - U(A <b>intersection</b> B)</pre>
Which is equal to
<pre>U(A) + U(B) - (U(A <b>union</b> B) - U(A) - U(B))</pre>
Or
<pre>2*U(A) + 2*U(B) - U(A <b>union</b> B)</pre>
#EXPLANATION
Let us see how to find the number of unique substrings of a string by using the LCP array.
* If we consider <b>all the prefixes of all the suffixes</b>, we would be considering the entire set of sub-strings.
* The LCA array helps us determine how many prefixes to <b>ignore</b> for each suffix
* We of course do not ignore any prefix of the first suffix. These are all valid and unique substrings.
<pre>
Let s be given string
indexes in s are 1-based
Let S be the suffix array
S stores the list of 1-based indexes
that represent the start position of the suffix of s
Let L be the LCP array
L(i) is the longest common prefix between
the suffixes starting from S(i) and S(i-1)
Thus, L is only defined from 2 to |s|
uniq_sub_strings = |s| - S[1] + 1
// thus we count all prefixes of the first suffix
for i = 2 to N
uniq_sub_strings += |s| - S[i] + 1 - L[i]
</pre>
Let us try this with an example to see why this is correct.
<pre>
s = abaabba
S = [
7, // a
3, // aabba
1, // abaabba
4, // abba
6, // ba
2, // baabba
5 // bba
]
L = [
0, // not defined for L[1]
1, // a is the common prefix
1, // a
2, // ab
0, // nothing
2, // ba
1 // b
]
uniq_sub_strings =
1 + 4 + 6 + 2 + 2 + 4 + 2 = 21
</pre>
Thus, there are <b>21</b> substrings (out of <b>28</b>) that are unique. You can work this out by deducing that the sub-strings that are not unique (and hence weren't counted are)
<pre>
{
a, // prefix of aabba
// since a was counted as prefix of "a"
a, // prefix of abaabba
a, // prefix of abba
ab, // prefix of abba
// since ab was counted as prefix of "abaabba"
b, // prefix of baabba
// since b was counted as prefix of "ba"
ba, // prefix of baabba
// since ba was counted as prefix of "ba"
b // prefix of bba
}
</pre>
Now, you can find <b>U(A)</b> and <b>U(B)</b> by using the above algorithm. The only part that remains is to calculate <b>U(A union B)</b>.
This can be done by considering a string
<pre>c = a + "$" + b</pre>
We can find all the unique substrings of <b>c</b> and reduce from the result, all the strings that contain the character <b>'$'</b>. We know that all the different strings that contain <b>'$'</b> are unique anyway, since <b>'$'</b> is not part of the alphabet.
There are <b>`(|a|+1) * (|b|+1)`</b> substrings of <b>c</b> that contain <b>'$'</b>.
Thus, we can find <b>U(A union B)</b>, the number of unique substrings that exists in either <b>A</b>, or <b>B</b>.
We can find the number of unique sub-strings of either <b>a</b>, or <b>b</b>, but not both, in time <b>O(F(|a|) + F(|b|) + F(|a+b|)</b>, where <b>F(n)</b> is the complexity of calculating the Suffix Array or the LCA Array (which ever is more). In the best implementation, <b>F(n) = O(n)</b>. But the problem may be solved even with <b>F(n) = O(n log n)</b>.
#SETTER'S SOLUTION
Can be found [here][3].
#TESTER'S SOLUTION
Can be found [here][4].
[1]: http://www.codechef.com/problems/TASTR
[2]: http://www.codechef.com/COOK32/problems/TASTR
[3]: http://www.codechef.com/download/Solutions/COOK32/Setter/TASTR.cpp
[4]: http://www.codechef.com/download/Solutions/COOK32/Tester/TASTR.cpp
[5]: http://en.wikipedia.org/wiki/Suffix_array
[6]: http://en.wikipedia.org/wiki/LCP_array
[7]: http://en.wikipedia.org/wiki/Suffix_tree