AMR14F - Editorial

PROBLEM LINK:

Editorialist: Adury Surya Kiran

DIFFICULTY:

HARD.

PREREQUISITES:

Bit-Masking

PROBLEM:

Given a String with hexadecimal characters, find the number of pairs of substrings which have atleast one non empty common substring.
Let us assume the length of the string is N and number of distinct characters possible be K through out the editorial. In this question N is given to be atmost 10^5 and K to be 16. Let the string be named S.

QUICK EXPLANATION:

We can see that for each substring, if we take set of distinct characters in it, it will be one of 2^K combinations of subsets, of the K possible characters. Now for each particular subset we can find the number of substrings which have this subset of distinct characters in O(N * K) using first occurance table. After that we can find number of pairs of substrings which don’t share atleast one distinct character in their subsets and subtract it from the all possible pairs to get the final answer using a technique similar to FFT in O(K * 2^K) steps.

EXPLANATION:

It is simple to notice that for a pair of substrings to have non empty common substring, it is sufficient to have atleast one character in common, because each character is also a substring of length one. So we just need to look at the subsets of distinct characters while checking a pair of substrings. For simplicity we can represent a subset of distinct characters as an integer bitmask, where i’th bit is set to 1 if and only if i’th character is present in the subset.

Naive Solution:
For each pair of substrings, check if they have a common character or not. This will take O(N^4) steps because there are (N*(N+1))/2 substrings in the string and for comparing each pair we can just keep a boolean array of size K and check if any character is present in both substrings in O(K). We can speed up the O(K) part to O(1) by using integer bitmask and using ‘&’ (Bitwise AND) operator in C++ or JAVA to check if two bitmasks have common bit. This will take roughly around 10^20 operations for each test case and certainly has no use because roughly we can do only 10^8 operations per sec.

O(2^(2*k) + N^2) solution :
As there are only 2^K distinct subsets, if we can count the number of substrings present in S for each integer bitmask of subsets in a count[] array, our final answer will be sum of count[x]*count[y] for all x and y such that Bitwise AND of x and y is non zero. Simple way of finding count[] array is iterate through all substrings in O(N^2) time and increase the count[] of respective bitmask by one. This take roughly around 10^10 operations per sec per each test case and is also very slow for this question.

O(2^k + N*K) solution:
There are two parts in this solution also, similar to the above solution.
First Let us try to improvise the part where we were calculating the count[] array of the bitmasks. Let us say, we are calculating the count[] of bitmasks of all the substrings which start at some particular index. Now as we traverse towards right increasing the ending index of the substring, the bitmask of the subset of distinct characters changes only when a new character which is not present in the current subset of characters is encountered. So the bitmask variable changes at K indices only, let us call them super indices. We can see that all those substring which end between any two consecutive super indices have same bitmask.

So we need to find FOC[i][j] array which represents First Occurence of Character j in the suffix starting at position i (the super indices), which can be calcualted in O(N*K) using simple recurrence relation explained below. It is easy to see that FOC[i][j] will be equal to FOC[i + 1][j] for all j except the character which is present at position i.

The pseudocode for this part is :

for j = 0 to K - 1 {
    foc[n][j] = N
}
for i = N - 1 to 0 {
    for j = 0 to K - 1 {
        foc[i][j] = foc[i + 1][j]
    }
    foc[i][id(S[i])] = i
}
Here in the above psuedocode id() is a function which returns integer index of any character.
Now to count the bitmasks of all the substrings starting at index i, we need to sort all the foc[i][j]s and iterate through them and increase the count of the distinct bitmaks which can be atmost K in number.

The psuedocode for this will be :

// count[b] stores count of number of occurences of bitmask b
// arr[] is a array of pair of integers which stores the foc[i][j] and j

for i = 0 to N - 1 {
    for j = 0 to K - 1 {
        arr[j].first = foc[i][j]
        arr[j].second = j
    }
    arr[K].first = N // Dummy Character
    arr[K].second = K
    sort(arr, arr + K + 1)
    b = 0 // integer variable which stores bitmask
    for j = 0 to K - 1 {
        b = b + (1 << arr[j].second) // setting the respective bit to 1 in the bitmask
        count[b] += arr[j + 1].first - arr[j].first
    }
}
Now that we got the count array very fastly, we only need to know how to compute the sum of products of kind count[x] * count[y] such that Bitwise AND of x and y is non zero. Rather than finding the number of pairs of substrings which have atleast one character in common, we can easily find the number of pairs of substrings which do not have any character in common and subtract this number from total number of pairs to get the final answer.

So to find the number of pairs of substrings which do not have any character in common, we have to find the sum of products of the kind count[x] * count[y] such that bitwise AND of x and y is zero. If we see carefully the values of count[y] which get multiplied to count[x] are all the subsets of (2^K - 1 - x). For this for every z we need to compute sum of values of count[w] such that the subset of characters represented by w is itself a subset of the subset of characters represent by z, to state it simply (bitwise AND of z and w) = w. There is a standard way of doing this part in O(K*2^K).

Let us take 2^K values and apply the following steps:

  1. Divide the values intor two parts of size 2^(K-1). Half of the values have MSB (Most Significant Bit) as 1 and the other half of them have 0.
  2. Treat the two halves as two separate arrays of size 2^(K-1) and for each index calculate the sum of values of all the subset indices.
  3. Now for each index in the half which has MSB as 1, add the value which is present at the respective index in the half which has MSB 0.

So how does the above algo work?
Let us say K = 3 here and let’s look how the sum of values at subsets of 5 is being calculated.
5 = 101(in binary).
We have 8 indices (0 to 7) to be operated on, we split them into two halves (0 to 3) and (4 to 7). Let us call them arrays A[0…3] and B[0…3], here A[0…3] has values from (0 to 3) and B[0…3] has values from (4 to 7) of count[] array. Now repeat the process for both the arrays A and B recursively.
Now sum_count[5] = sum_A1 + sum_B1, or to generalize sum_count[x]=sum_A[x] for x < 4 and sum_count[x] = sum_A[x-4] + sum_B[x-4] for x >= 4.
What we are doing here is, we are removing the MSB and calculating the sum of values at all the subset positions of rest of K-1 bits and adding the values at respective positions of both arrays for MSB = 1, because both of them are subsets.

Using the following psuedocode we can do this iteratively:
for i = 0 to (2^K - 1) {
sum_count[i] = count[i]
}
for j = 0 to K - 1 {
for i = 0 to (2^K) - 1 {
if ( (i & (1<<j)) != 0){
sum_count[i] += sum_count[i - (1<<j)]
}
}
}
I want the readers to analyse this psuedocode to find how it is computing the values as explained in the above three steps iteratively. Analysing from the code is also a good exercise.

So finally our answer will be equal to (N*(N+1)/2)^2 - sum of (count[x]sum_count[2^K-1-x]) for all x in range (0 to 2^K-1) which can be calculated by using a loop.
Our complexity boils down to O(N
K) for computing the count[] array and O(K2^K) for calculating the sum_count[] array.
Substituting the values given in this question we get O(N
K + K2^K) to take around 210^6
which could run in under time for 200 test cases.
Please note that all the above operations are needed to be calculated modulo 10^9 + 7.

3 Likes