How to solve this problem based on strings and queries.

arpit728 · January 16, 2017, 6:11pm

Can anyone explain how to solve this problem.

meooow · January 18, 2017, 1:44am

This problem can be solved in two steps

STEP 1: Preprocessing
If you see this problem, you’ll realize that in order to answer the range queries from L to R, we should be able to efficiently find out how many occurrences of a particular character are there between L and R. If the reason for this is not yet clear, don’t worry it will be soon. So how do we do this? As long as the data is immutable, we can build a prefix sum array of the occurrence of a character. Suppose we are only concerned with some character c. Then our prefix sum array would be

Let S be the input string of length N
pre = array of length N+1
pre[0] = 0
for i in [0..N-1]
    pre[i+1] = pre[i]
    if S[i]==c
        pre[i+1] = pre[i+1] + 1

Now if we want to know how many times c appears between L and R (1-indexed), we just calculate (pre[R]-pre[L-1]). Since we have a 26 letter alphabet here and we are concerned with each of them, we need 26 such arrays, one for each character.

STEP 2: Answering queries
For each query, we are required to find out the minimum number of character we need to delete in the string S between L and R, so that each distinct character occurring in the range occurs equal number of times. First let’s find out how many times each character appears in the given range, and put them in an array f. This is easily done as said above.
Now we have an array f of 26 values. If each distinct character appear equal number of times, all values in this f array will be either a constant value k (the number of times each character appears) OR 0. Note that we are only allowed to delete characters. How do we achieve that? We can do that by choosing a particular character c that appears k times, and removing all characters that occur less than k times and making sure all characters that occur more than k times now occur only k times This amounts to choosing a value k in the array and setting all values less than k to 0, and bringing down all values greater than k to k.
But how do we know which is the optimum k? We don’t… so we need to try it for all 26 values of k and pick the one that results in least number of characters deleted.

Let L, R be the query 
f = array of size 26
f contains number of occurrence of each character in S[L..R]
m = INFINITY
for i in [0..25]
    d = 0
    k = f[i]
    for j in [0..25]
        if f[j] < k
            d = d + f[j]
        else
            d = d + (f[j] - k)
    m = min(m, d)
m is our required answer

If a is the alphabet size then preprocessing takes \mathcal{O}(aN) and answering each query takes \mathcal{O}(a^2).
It is possible to answer queries in \mathcal{O}(a\; log\; a) but since a is quite small, this is unnecessary.
My implementation is here.

I hope this was easy enough to understand. Feel free to ask if something is not clear

arpit728 · January 30, 2017, 8:39pm

@meoow

Very nice explanation, thank you.

What would be the approach if a is larger 1<a<100000

meooow · January 30, 2017, 10:13pm

@arpit728 that’s way too high considering both N and Q are ≤10⁵. However the efficient way to answer queries is by sorting the f array and iterating over it. For any index i we know that f[j]≤f[i] when j < i, and f[j]≥f[i] when j > i so we do not need the inner loop.

sort f in increasing order
s = sum of all elements in f
m = INFINITY
for i in [0..a-1]
    keep = f[i] * (a-i)
    remove = s - keep;
    m = min(m, remove)
m is our required answer

The sorting is most costly, so the complexity per query is \mathcal{O}(a\;log\;a).