Unordered_set vs set : is set better?

anon25504910 · November 26, 2019, 1:16pm

I learnt something very practically useful today. I suggest you head over to this problem and try it out before reading further.

As you may have noticed, its a fairly easy problem, but if you did what I did, you surely would have got a TLE once : use unordered_set. That’s right, for around 100000 elements unordered set took more than 5 minutes on my machine! And then I switched to set and wham! 17 milliseconds.
It was quite surprising for me, since set also has to maintain the order of the elements. I read up a few things and I found something surprising : unordered set in stl uses a hash of \text{hash}(x) = x, at least in the beginning. And apparently, it’s hashing algorithm is not efficient at all: There may be many collisions and each require a complete rehash. There seem to be only two solutions to this: One, write your own hash function. I was going to do this with bloom filters but no, too much work. Or two, lose a factor of \log n if you can afford it and use ordered set since it is the worst case time complexity, not the probabilistic one.
I dont know why it is the worst case, but I’m guessing it doesnt use hashing; maybe fibonacci heaps or something to maintain order-ness as well as \log n query.
Either way, I thought this was something useful I learnt. Definitely better than getting to know this amidst a two hour contest.

Edit:
I learnt it from a few places, notably:

Embarrassingly poor performance of regular point sets with std::unordered_set | by Matthijs Sypkens Smit | Plaxis | Medium
unordered map - insert is taking long long time in an Unordered_map (C++) with ULONG as key and unknown bucket count - Stack Overflow
c++ - Why is std::unordered_map slow, and can I use it more effectively to alleviate that? - Stack Overflow : even though this is on unordered_map the same applies to unordered_set : not very cache friendly and higher collision probability than well written custom hashers.

I.e. unordered_set is good, but for large data with lot of expected duplicates (like in above) it has a poor performance.
In the second link, even though it is on unordered_map, a reply given below states that many stl implementations start with a \text{hash}(x) = x function.

Edit 2:
Post continues. See reply below.

admin · November 26, 2019, 3:15pm

Thanks for sharing this It’ll be great if you can also include links to references, for example, from where you read that \text{hash}(x) = x. Also, the title could be more informative.

anon25504910 · November 26, 2019, 4:01pm

Did that.

ssjgz · November 26, 2019, 4:05pm

I never use std::unordered_set precisely because of its poor worst-case performance: as shown in the link provided here by @raisinten, some Problem Testers will design their tests to exploit this and make your life difficult

anon25504910 · November 27, 2019, 6:10am

I’ll just add something more to this conversation. All of what i say is available in the codeforces article whose link is there in the post whose link is there in @ssgz 's reply (yes you read that right).
Basically all hash functions in stl use std:: hash. This initializes the hash function to the identity function in case of int or long long. As we can see, if the first insert is around 10^9 its quite bad. So whenever the hash value collides or something like the above happens, a rehash occurs. Whats the new hash? It’s in general the modulo ring wrt a fixed prime. If another collision occurs then another bigger prime and so on. There are two issues with this:

the primes are deterministic i.e. on platforms like codeforces you’re guaranteed to be hacked.
a more real life problem is when you have data of a very specific kind. For example if my input is always odd then all hash values are even - 50% buckets arent ever touched! Even worse, there are examples where > 90% buckets are unused.

So what hash should you use? You could of course randomize the prime wrt which you make your ring, but as mentioned in the referred article it can still be broken. You could use md5,sha256 etc but these are secure but not fast. Hence i suggest you spend some time and write a custom hash, and use it henceforth in your competitions. And if you’re op enough, maybe you can include some good stuff too like linear probing, round robin, bloom filters etc. A lengthy investment but only one-time and useful.
Verdict : don’t use unordered_set, especially if hacking is supported on the platform
I typed this from my phone, so please excuse formatting mistakes.

kshitij_789 · December 20, 2019, 2:39pm

in this question Problem - F - Codeforces
ordered_map tle’d and unordered map ac’ed

Submission #67286963 - Codeforces (accepted)

can anybody try to uphack it

and also @anon25504910 can you provide some good cases to hack unordered map in cf

kshitij_789 · December 20, 2019, 3:24pm

Rahul - Codeforces uphacked the submission
can you provide the test case
thanks

rahul_g · December 20, 2019, 3:25pm

Hacked.
https://codeforces.com/contest/1277/hacks/602089/test

anon25504910 · December 20, 2019, 3:54pm

I don’t have any testcases to hack as such, but for almost any problem with unordered set you can refer to that link i added which describes how to get the worst case linear behaviour out of unordered set.