ISOARRAY - Editorial

likecs · December 22, 2017, 1:24am

Problem Link

Practice

Contest

Setter: Trung Nguyen

Tester: Hasan Jaddouh

Editorialist: Bhuvnesh Jain

Difficulty

MEDIUM - HARD

Prerequisites

Sqrt Decomposition, Hashing, Online Queries

Problem

You are given an array of size N and you want to process Q online queries on it. For each query, you need to find whether the multiset formed from 2 subarrays in each query are isomorphic or not.

Explanation

For the editorial, assume that f(Z) corresponds to the mapping of Z from the first subarray to the second subarray as defined in the problem statement.

Let us first understand when would the multisets formed from the 2 subarrays be isomorphic. It is clear that the number of occurrences of Z in first subarray and the number of occurrences of f(Z) should be same. For exmaple - If the 2 multisets are {1, 2, 3, 3} and {4, 2, 5, 5}. Then we have a mapping of f(1) = 4, f(2) = 2, f(3) = 5. i.e. the multisets are isomorphic. But none of the above subarrays are isomorphic to {1, 3, 5, 6}. This is the necessary and sufficient condition for checking the isomorphism. Thus, if we are able to build the frequency array of the numbers present in the subarray and compare the 2 frequency lists efficiently, we can solve the problem.

The first thing to notice is that the size of frequency list of the subarrays can be the order of length of the subarray (as the numbers can be unique in the subarray). Thus, even building the frequency list even in order length of subarray would be costly per query and timeout. This is where hashing comes into play. Before proceeding forward, I recommend you to read this blog by “rng_58” and understand it thoroughly.

The blog clearly mentions a way to check with high probability whether 2 multisets (here we are considering the multisets of the frequencies) are isomorphic or not. So, assume the frequency multiset of the subarray looks like {f_1, f_2, .. f_x}. We need to be efficiently calculate the function \prod_{i=1}^{i=x}(f_i + R), where R is a randomly chosen number in the range [0, Mod). We also need to calculate the above function online and then just check whether the 2 hashed values of the 2 subarrays are same or not.

To efficiently calculate the above function, we will rely on the technique of sqrt decomposition. The basic idea of sqrt decomposition of arrays is to split the array into sqrt blocks, calculate the function of each sqrt block efficiently and iterate over the edges points (if they exist) in each query. But here the function we want to evaluate deals on the frequency of numbers which is difficult to update with each passing block as in normal sqrt decomposition.

The algorithm for the sqrt decomposition part will be as follows:

Split the array into blocks of sizes K.
Calculate the frequency of each number inside each block.
We know that there are now \frac{N}{K} blocks each of size K in the array. We will also compute the frequency and function for every contiguous block. This part leads to a memeory usage of O(N * \frac{N}{K}).
To query for a given range [l, r], let blk denote the block to which the index belongs. If both l and r belong to the same block, then we can do a brute force. Else, we get the required function value from (blk[l]+1) to (blk[r]-1). Now, we need to update the answer for the elements whose frequency might be affected, i.e. lying in blk[l] and blk[r] and in the given range. For this, we simply traverse, the elements, calculate their frequencies and remove the contribution from their old frequency value in range (blk[l]+1) to (blk[r]+1) (This was also pre-computed in step 2, leading to O(1) operation. In all, the query can be efficiently be performed in O(K).

Some caveats to the above solution:

To remove the contribution of previous frequency value, we need to find the modular inverse, which adds a O(\log(MOD)) factor to each query and can lead to “TLE”. To avoid this, we can pre-compute all the modular inverse beforehand.
To find the frequency of elements in blk[l] and blk[r], we would either need a map or set for each query. Or otherwise we would need to set the frequency of entire range of numbers to 0, which is a costly operation, order O(n). To avoid this, we see that after every query operation, only elements whose frequency were changed from 0, need to be set back to 0. Since they are O(K) of them, this can be handled efficiently using a vector and global frequency table initialized to 0.

Last and most important part of this problem: Choosing the value of K. See as in normal questions, generally \text{N ~ Q}. But here the constraints are bit different. Also, another step computing the pairwise function for contiguous blocks is also done. So, the value of K must be chosen carefully. We can see that following are the complexities:

Pre-Computation = 2 * N * \frac{N}{K} + {(\frac{N}{K})}^{2} \approx 2 * N * \frac{N}{K}

Query part = 2 * Q * K

Thus, we should chose K, such that both the above parts are equal i.e K = \frac{N}{sqrt(Q)}. Not choosing K carefully can lead many solutions to TLE.

Another thing to note that in hashing problems, sometimes the number of mod operations can also increase the constant factor in your code, sometimes leading to TLE. This is also the case here. So, it is important that you try to keep the number of Mod operations (\%) to O(N + Q).

For more details, refer to the author’s or tester’s approach. The author has the same approach but used a different hashing mechanism.

If you had some other approach, feel free to share the same.

Time Complexity

O((N + Q) * Block), per test case, where Block = size of block chosen.

Space Complexity

O(N * Block)

Solution Links

Setter’s solution

Tester’s solution

Editorialist’s solution

lebron · December 26, 2017, 1:41am

Thanks for an editorial!

I got a question which isn’t exactly related to the idea of solution, but to a difference in hashing used by setter and editorialist.

Idea by setter/tester sounds to me like “every number has random key given, two multisets are most likely equal if sums of keys for these multisets are equal”. I’m actually more used to this kind of hashing for such problems; and it also doesn’t require you to precompute modulo inverses Now the question is - since this hashing is quite different from the one described in blog by Makoto, can somebody share some proofs/statements about how good it is? I’m used to thinking about it as “value of the sum is pretty much random, so let’s say collision chance is 2^(-64)”, but it is far from being strict proof.

chemthan · December 26, 2017, 2:21am

"Let P(x_1, ..., x_k) be a non-zero (multi-variable) polynomial on a finite field F_MOD. If the variables are chosen independently and uniformly at random, the probability that P(x_1, ..., x_k) = 0 is at most D/MOD, where D is the total degree of the polynomial."

Assumption, two multisets are a₁, a₂,..., a_k and b₁, b₂,..., b_d. Let P(x(a₁), x(a₂), ..., x(a_k), x(b₁), x(b₂),..., x(b_d)) = x(a₁) + x(a₂) + ... + x(a_k) - x(b₁) - x(b₂) - ... - x(b_d).
Also, I see mine is not much different from normal string hashing :)