I learnt something very practically useful today. I suggest you head over to this problem and try it out before reading further.
As you may have noticed, its a fairly easy problem, but if you did what I did, you surely would have got a TLE once : use unordered_set. That’s right, for around 100000 elements unordered set took more than 5 minutes on my machine! And then I switched to set and wham! 17 milliseconds.
It was quite surprising for me, since set also has to maintain the order of the elements. I read up a few things and I found something surprising : unordered set in stl uses a hash of \text{hash}(x) = x, at least in the beginning. And apparently, it’s hashing algorithm is not efficient at all: There may be many collisions and each require a complete rehash. There seem to be only two solutions to this: One, write your own hash function. I was going to do this with bloom filters but no, too much work. Or two, lose a factor of \log n if you can afford it and use ordered set since it is the worst case time complexity, not the probabilistic one.
I dont know why it is the worst case, but I’m guessing it doesnt use hashing; maybe fibonacci heaps or something to maintain order-ness as well as \log n query.
Either way, I thought this was something useful I learnt. Definitely better than getting to know this amidst a two hour contest.
I.e. unordered_set is good, but for large data with lot of expected duplicates (like in above) it has a poor performance.
In the second link, even though it is on unordered_map, a reply given below states that many stl implementations start with a \text{hash}(x) = x function.
Thanks for sharing this It’ll be great if you can also include links to references, for example, from where you read that \text{hash}(x) = x. Also, the title could be more informative.
I never use std::unordered_set precisely because of its poor worst-case performance: as shown in the link provided here by @raisinten, some Problem Testers will design their tests to exploit this and make your life difficult
I’ll just add something more to this conversation. All of what i say is available in the codeforces article whose link is there in the post whose link is there in @ssgz 's reply (yes you read that right).
Basically all hash functions in stl use std:: hash. This initializes the hash function to the identity function in case of int or long long. As we can see, if the first insert is around 10^9 its quite bad. So whenever the hash value collides or something like the above happens, a rehash occurs. Whats the new hash? It’s in general the modulo ring wrt a fixed prime. If another collision occurs then another bigger prime and so on. There are two issues with this:
the primes are deterministic i.e. on platforms like codeforces you’re guaranteed to be hacked.
a more real life problem is when you have data of a very specific kind. For example if my input is always odd then all hash values are even - 50% buckets arent ever touched! Even worse, there are examples where > 90% buckets are unused.
So what hash should you use? You could of course randomize the prime wrt which you make your ring, but as mentioned in the referred article it can still be broken. You could use md5,sha256 etc but these are secure but not fast. Hence i suggest you spend some time and write a custom hash, and use it henceforth in your competitions. And if you’re op enough, maybe you can include some good stuff too like linear probing, round robin, bloom filters etc. A lengthy investment but only one-time and useful.
Verdict : don’t use unordered_set, especially if hacking is supported on the platform
I typed this from my phone, so please excuse formatting mistakes.
I don’t have any testcases to hack as such, but for almost any problem with unordered set you can refer to that link i added which describes how to get the worst case linear behaviour out of unordered set.