Please share Approach to solve DISTRING of OCT18 using hashing.

Problem link. I have seen many solutions uses hashing which is far easier than the official intended solution. So, can someone please share how to solve it using hashing?


1 Like

I used hashing as part of the solution: when N < M

for row we go trough all cells in that row and calculate a rolling hash for that cell and all before.
hash[i][j] = matrix[i][j]*(a^i) + grid[i][j-1] mod b where hash[-1][*] = 1. For a and b I had chosen 80471168 and 500050000739.

next I go trough all ranges (left,right) of columns.
for each of these ranges I go trough all rows from top to bottom. for each row I check if I have seen the string matrix[row] … matrix[row][right] before. If I haven’t seen it yet I add (row+1)\cdot(M-row) to the total. Otherwise I add (row-prevRow)\cdot(M-row) to the total.
Checking if I have seen the string before uses the rolling hashes. Calculate rowHash = hash[row][right]-hash[row][left-1]. the rowHash is compared against all previous hashes using a unordered_map (Which coincidentally uses hashing).

this part of the complete solution is O(N^2\cdot M)
The other part (N \geq M) doesn’t use hashing at all.

1 Like

Here is what I used.

1: Hashing

The big thing with hashing is that it is possible to compare two strings s_1 and s_2 of length m in O(\log(m)). Basically finding out if s_1 \lt s_2 or s_1 \gt s_2 or s_1=s_2.

This is done by first precalculating hashing for the two strings. With the hashing done it is possible to use binary search to find out their LCP. With the LCP found just check the next character.

Using this method in DISTRING allows sorting the rows of the matrix in O(n \log(n) \log(m)). This is so quick that it is possible to do it m times, once for every column.

2: A “brute” algorithm

There is an easy O(m^2n \log(n) \log(m)) algorithm to solve this problem. The idea is to only look at the sub-matrices starting at column x_1 and ending at column x_2. Use the hash-based comparisons to sort the rows on the interval x_1 to x_2, taking O(n \log(n) \log(m)) time. From this it is pretty simple to figure out the sum of strengths of all submatrices beginning at x_1 and ending at x_2.

3: The improved algorithm

It is possible to get a quicker running algorithm by taking care of all x_2 at the same time at a cost of something like an extra factor of O(\log(n)). This is a bit complicated to do but is basically just a speedup of the brute algorithm, trading a O(m) for O(\log(n)). The main idea is to use that the larger the x_2 is, the fewer rows will be identical, and changes can only happen something like O(n \log(n)) times. This is a bit tricky to explain and I suggest you to analyse how to do the speedup yourself.

In the end I got the algorithm running in something like O(m n \log(n)^2 \log(m)). The algorithm is a bit of a mess but the hashing part is both very simple and easy to use.


I use rolling hash as explained by @spaanse. The only difference is that I believe in taking more than one base and mod. What I think is that it is difficult to find collisions with 2/3 base and mod?
My Soln (20 pts). That’s the reason why I keep 3 small primes and 3 large in my template so when required I can directly uncomment that portion.
Other difference between @spaanse and my soln is that I did that for all cases.

To use rolling hash I converted every no in 6bit base 26 alphabets. and then used rolling hash.

Can anyone plzz tell the difference between rolling hash and normal hash . Why the rolling hash is preferred over the hash?

“Other difference between @spaanse and my soln is that I did that for all cases.” Did you mean that u used hashing for N >= M also? If Yes then love to know the approach!

Will you please elaborate the 3rd part a bit more?