PROBLEM LINK:
Author and Editorialist: Hussain Kara Fallah
Tester: Michael Nematollahi
PREREQUISITES:
sorting
PROBLEM:
A string is called special if the set of all of its non-empty different subsequences IS equal to the set of all of its non-empty substrings. Refer to the problem statement for more detailed explanation.
Now given a string S with length up to 10^6. Count the number of its distinct special non-empty substrings.
EXPLANATION:
We should figure first what are the possible forms of a special string?
Clearly, a string consisting of 1 letter only is special (like “aaaa…”). What about more than 1 letter?
If the string consists of more than 2 letters it’s not special. Consider our string’s letters from left to right. Take the first appearing letter and the last appearing letter (if we consider the first appearance of each letter, the last letter that appears).
In case (“aabbbccaa”) the first is ‘a’ and the last is ‘c’
In case (“abcccddbb”) the first is ‘a’ and the last is ‘d’
Now take the subsequence consisting of all occurrences of both letters. In the first case it’s “aacc”, in the second it’s “add”. This subsequence doesn’t appear as a substring, because there’s at least one extra different letter that appeared before our “last appearing” letter. So as a result, the substring containing all occurrences of both letters will contain the extra letter.
(Note these are only detailed examples, but the observation can be applied to any string).
Now we are left with strings consisting of 2 letters. A string of 2 letters is special if it is only like (“aaaa…bbbb…”) (the first letter repeated X times followed by the second letter repeated Y times.
Let’s assume that it doesn’t follow the mentioned form. (Let’s suppose that our string consists of ‘a’ and ‘b’ only). According to our assumption, then there must be at least one letter ‘b’ with at least one ‘a’ to the left and at least one ‘a’ to the right. So if we take the subsequence of all occurrences of letter ‘a’ it’s not present as a substring.
Now how to answer our question?
First let’s break our string into the minimum number of disjoint parts, and each part consisting of consecutive identical letters.
For example “aaabbccddeff” will be broken into {“aaa”,“bb”,“cc”,“dd”,“e”,“ff”}.
The single letter case is easy, we maintain the largest group occurred for each letter and add up their sizes.
Now, the 2 letters case can be achieved from taking 2 adjacent groups (let’s say the first one size is x and the second is y so we can have x*y special substrings.
But we need to count only the distinct ones. Let’s maintain for each pair of letters (let_1,let_2) , a vector containing pairs (x,y) such that some point in our string there was a sequence of x letters of let1 followed by y letters of let2 (consecutively). For each pair, maintain all occurrences of matching consecutive groups. In the previous example, some possible groups are (“aaa”,bb") , (“dd”,“e”) , and so.
For each pair of letters (let1,let2) let’s count the number of distinct special substrings formed only by these letters (and let1 is the first letter). Let’s consider our letters’ vector (the letters pair vector).
Let’s suppose that we have 2 pairs (x1,y1) , (x2,y2) such that (x1 >= x2 and y1 >= y2), then clearly the second pair is redundant, because all substrings formed from the groups of our second pair can be formed from the groups of the first. Let’s remove all redundant pairs (Hint: sort them according to x values and maintain decreasing y values).
Now for every 2 pairs in our vector (x1,y1) , (x2,y2) if (x1 > x2) then it implies that (y1 < y2). Now, all pairs are sorted according to the value of x. Let’s take two consecutive pairs as mentioned, what substrings should we add from the first pair that won’t occur from any other pair in the vector? It’s clearly *(x1-x2)y1. (Assuming the first pair is (x1,y1) and the second is (x2,y2)).
We do the mentioed process for all pairs of letters.
Complexity O(|S| \, log \, |S|)