CHSTR - Editorial

balajiganapath · June 14, 2015, 2:53pm

PROBLEM LINK:

Author: Vasia Antoniuk
Tester: Mahbubul Hasan and Sunny Aggarwal
Editorialist: Balajiganapathi Senthilnathan
Russian Translator: Sergey Kulik
Mandarian Translator: Minako Kojima

DIFFICULTY:

Medium

PREREQUISITES:

Z function, combinatorics

PROBLEM:

Given a string and a query k, output the number of ways we can choose k equal substrings from the string.

SHORT EXPLANATION

Let cnt[i] be the number of different substrings that occur in the string exactly i times. Then the answer for a particular k ( \le n) will be :
\sum_{i = k}^{n} cnt[i] * \binom{i}{k} we can calculate cnt[i] using suffix array/Z function.

EXPLANATION:

Let cnt[i] be the number of different substrings that occur in the string exactly i times.

For example, for the string “ababa”, the array will be:
cnt[1] = 4 → {“ababa”, “abab”, “baba”, “bab”} occur exactly once
cnt[2] = 4 → {“aba”, “ab”, “ba”, “b”} occur exactly twice
cnt[3] = 1 → {“a”} occur exactly thrice
cnt[4] = 0
cnt[5] = 0

Now suppose we want the answer for k = 2, i.e. number of ways to choose 2 equal strings. How do we calculate this using the cnt array?

cnt[1] is of no use as we need 2 equal string.
cnt[2] → There are 4 different substrings that occur 2 times. So, we can add 4 to the answer.
cnt[3] → There is only 1 substring that occurs 3 times. But we need only 2 times, so we can choose any 2 of the 3. So, \binom{3}{2} = 3

Summing them, for k = 2 we get 7.

Now let us focus on calculating the cnt array.
To calculate this let us loop through all suffixes of S.

For the suffix P = S[i..N], let us calculate the array Z[i..N] where Z[i] is the maximal equal substring starting from i that matches a prefix of S[i..N].

For example, for the above string consider i = 1, the whole string is the suffix
“ababa”.
Here the Z array would be
Z[1] = 5 → maximum prefix matching of “ababa” and “ababa”
Z[2] = 0 → maximum prefix matching of “ababa” and “baba”
Z[3] = 3 → maximum prefix matching of “ababa” and “aba”
Z[4] = 0 → maximum prefix matching of “ababa” and “ba”
Z[5] = 1 → maximum prefix matching of “ababa” and “a”

How is this array useful for calculating cnt?

From the above array we can conclude that we have 1 substring which can be choosen atleast 3 times. How? Because there are 3 entries in the above array that are greater than or equal to 1.
Similarly we can deduce for 2 (2 times), 3 (2 times), 4 (1 time ) and 5(1 time).
So we increment cnt[3] once, cnt[2] twice and cnt[1] twice.

We do this for all suffix of S. The only thing left is to observe that cnt[i] now actually has how many different strings can be choosen atleast i times. We want it to be exactly i times. This is simple: we just subtract from cnt[i] (cnt[i + 1] + cnt[i + 2] … + cnt[N]).

We can compute the Z function in O(n) time. See here for how to do that.

Time Complexity:

There are N suffixes and we take O(N) time for each suffix so total time is:
$O(N^2).

AUTHOR’S AND TESTER’S SOLUTIONS:

Author’s solution
Tester’s solution

rajatgoyal · June 15, 2015, 3:08pm

Can we solve it without z algorithm?

geek_geek · June 15, 2015, 3:15pm

I tried using a Trie data structure , and finding out cnt array in O(N^2). and the use DP to store the answers for 1 to n and the print appropriate answer or 0 if query is greater than N.
But i am getting TLE for three cases in Subtask 3?
is anymore optimizations required ???

geek_geek · June 15, 2015, 3:16pm

I am using a Trie data structure and using that finding out how the cnt array in O(N^2) .and using DP to answer the queries in O(1) time .I am still getting TLE for SUbtask 3 except two cases … i am using FAST IO is any further optimizations required??

ohil17yo36 · June 15, 2015, 3:35pm

Even I had initially used trie data structure to solve the problem and I am pretty sure my complexity was O(3*(n^2)). I got only one of the last subtask set accepted and rest all tle…The link to my solution is CodeChef: Practical coding for everyone . Is it possible that the constant(3) multiplied to n^2 in the big Omega notation lead to TLE?? Does that happen often??
Plz someone answer.

dragonslayerx · June 15, 2015, 3:37pm

Suffix Array + RMQ can also be used for solving the problem.

tapasjain01 · June 15, 2015, 3:38pm

This question was very similar to ANUSAR - Editorial - editorial - CodeChef Discuss.

evil999man · June 15, 2015, 3:53pm

I solved it using suffix trie and got AC 100 pts . Link : RKRiZX - Online C++0x Compiler & Debugging Tool - Ideone.com

rachitnaruzu · June 15, 2015, 3:53pm

Even I was using Trie at first O(n^2), then realized that i’m using pointers to create trie thats why i was getting tle, then i went for suffix array (O(nlognlogn) creation time) pretty fast as compared to trie, O(1) time query, got accepted. Array for solving queries took O(n^2) creation time (same as in case of trie), and also removed unnecessary mod (caused me 2 TLEs), overall O(n^2) on trie was pretty much slower than the the O(n^2) on suffix array. code with Trie took 5 - 6s to solve 5000 length string with 10^5 queries (which i had randomly generated) on my pc, and same input took approx 1 second in case of suffix array.

sourabh123 · June 15, 2015, 4:09pm

My solution passed subtask 1, and task no 4, 5 from subtask 2. For others it gave WA.
Can anyone suggest what can be problem with the solution. link

I’ve used lcp array to calculate number of occurance, of each distinct substring, in the string.
And used this for calculating answer.

jackal02 · June 15, 2015, 4:20pm

The problem can be solved with a Trie in O(n^2) but you need to add at one step all the substrings that start at position i by changing all the nodes on the path in the trie and compute cnt[i] inside the insert function. I used the bits in a int variable to know which sons I already created for the current node because if you initialize all the nodes with NULL you’ll get TLE.

dejko95 · June 15, 2015, 4:37pm

I tried so solve using hash, hashing all substrings, firstly with length 1, then 2, and so on. The rest of algorithm was the same. But somehow, it was WA on the last test case only. I tried with different hash functions, but didn’t help. Anyone has idea why? Can I see test cases now after the competition ended?

anh1l1ator · June 15, 2015, 5:02pm

Instead of using Z function ,I used a map instead and it caused me TLE in the third subtask well the complexity being N^2*logN + i think a big constant factor is hidden in implementation of maps!!

geek_geek · June 15, 2015, 5:37pm

Can this question be solved using Polynomial Hashing and using maps?

bholagabbar · June 16, 2015, 2:35am

Could anyone explain, in a little detail, how this can be solved using Suffix Trees? I read the StackOverflow post on how to construct a suffix tree. What next?

alankar63 · June 16, 2015, 3:38am

It was possible to get it ACed with trie too…,
i too had this problem but then removed this line

for(int x=0;x< ALPHABET_SIZE;x++)

q->link[x] = NULL;

from the below code as there is no need to waste time assigning NULL to those pointers as it is done automatically.

struct node* create_node() {

struct node *q = (struct node*)malloc(sizeof(struct node));
for(int x=0;x<ALPHABET_SIZE;x++)
    q->link[x] = NULL;
q->data = -1;
return q;

}

i think those who were getting tle will get AC with this optimisation.

nishaant · June 16, 2015, 8:23pm

Please can anyone help me understanding suffix array method…i couldn’t get it how to use suffix array…

thunderclash · June 16, 2015, 10:05pm

I haven’t heard about the trie DS till now, but I think I used a similar approach, and after a lot of submissions, I finally got AC 100pts.

I don’t know anything about suffix arrays either, I guess I’ll read about them now.

Basically I went through the entire string and saved the indices of all a’s,b’s,c’s…z’s separately in vectors.
Then, I would pick any one vector, and for each index look at index+1, and again store all indexes of a’s,b’s,c’s…z’s. If the number of alphabets is more than 1, only then will it be used, so whenever any such vector had more than 1 alphabet, then I would make an entry in a multiset of the number of alphabets.

Example: You start with the vector of all 1st a’s. Now you look at index+1 for each a, and find say 3 a’s 2 b’s 1 c
this means you have 3 strings of the form ‘aa’ , two of the form ‘ab’ and 1 of the form ‘ac’
The contribution we care about is only from where we have 3 equal strings ‘aa’ and 2 equal ‘ab’, we don’t care about ac,
further, we don’t need to recursively use the vector containing only this one c, because it will only lead to a string which will never match any other, as we only have 1 ‘ac’, we will always have at most 1 ‘aca’,‘acb’,‘acc’ etc.

Using this approach, I was able to calculate the required multiset, in which there were entries such as 2 2 2 3 3 4 4 etc, denoting the number of equal strings of various types.

I further shortened this to a simple vector which had pair of values, the actual number and number of occurrences, eg: in 2 2 2 3 4 4
we have entries for vector as pairs
(2 3)
(3 1)
(4 2)

As I had used a multiset, these values were already sorted, even in the created vector.

After this I simply used the logic of inverse modulo and saving inverse factorials to compute nCr. I later precomputed all values till 5000! % MOD and stored them in an array, and then when required computed r, n-r inverse factorials if they weren’t already computed and stored.

I was also sorting the given queries, so as to answer the same query value without having to recompute, like if k=3 is given 5 times in all, then it will be computed once, and then the result will be reused.

I was getting AC in all tasks except 1 of the last subtask.

To my surprise, in that task, if I would just terminate the program before the actual nCr calculating part, and only let the entire multiset and vector creation take place, it was taking a running time of 0.08 s, but after the nCr calc, it was giving TLE.

I tried using putchar_unlocked and using getchar for string input instead of scanf/printf, but the running times were almost exactly the same.

For a little more shocking experience, in one of the statements of the nCr calculating function, there were 3 terms being multiplied, and if I would just comment out 1 of them, it was giving WA with a running time of about 0.8s but no TLE.

Anyway, after a lot of tweaking around, I finally managed to get AC by removing 1 % operation in a certain loop. I was stunned that such a small change got my submission accepted! but well, at least it did get submitted finally.

Here’s my final solution CodeChef: Practical coding for everyone

njr07121977 · June 16, 2015, 10:15pm

Editorials are supposed to provide an opportunity for us to learn. Not point to a third party link! Wake up guys!

nirmit_goyal · June 17, 2015, 12:43am

@antoniuk1, Sir, your Z transform function should fill the vector z till ‘n’, as given in the example of editorial, also the program is giving compilation error