Currently we are publishing the author’s solution of the problem. Let us discuss about possibly strategies that you used in the problem. Later we will merge them into the editorial to create a more detailed version.
Author’s solution of the problem
There were no special prerequisites that were needed in order to solve this problem as even the simplest strategy (don’t store anything and respond with -1 -1 to any query) was enough to score some points. And this was the starting strategy for a lot of contestants. It is pretty obvious that there are better strategies, so this one will probably get you something close to 0. One small observation: it is better to buy the cheapest HDD before all events happen. This way, you will also have some crashes and those are free. You will get a much better score (still close to zero though) if you buy something first. This might seem counterintuitive and it was a nice thing to discover during testing.
I will describe some of my ideas first.
One basic strategy would be to implement RAID 1 (i.e. keep a (secondary) copy of each (primary) HDD and basically store each piece of information twice). This has a clear disadvantage: you have to buy twice as many HDDs than you need. For this strategy is better to but HDDs from the second category (expensive but with low costs for reading/writing) as you’ll do a lot of copying. One can notice that it is better to copy data to secondary storage as soon as possible, because of a possible crash. Crashes are ‘solved’ by resyncing (copy from primary/secondary to secondary/primary) the two pairing HDDs.
When to buy a new HDD is also important. There is no point in buying them all in the very beginning. You can buy another one as soon as you believe that you won’t have enough storage space for the next event (i.e. you have less than 1000 - or even 500 if you are willing to risk a little).
Other RAID levels could also be implemented but it is much harder and one could not expect RAID 6 to be requested by a task in this kind of contest. You can learn more about RAID levels by reading the Wikipedia page: https://en.wikipedia.org/wiki/Standard_RAID_levels
Another strategy, harder to code but also adaptive, is to decide if you need to store everything twice. If you have a lot of crashes (10% or 15%) it might be a good idea to store data twice. For 1% crashes you might want to pay a penalty (a couple of times) than to buy twice as many HDDs and do all the copying associated with RAID 1.
How can you detect that you have only 1% crashes? By running some simpler strategy (the one above?) on a few hundred events and compute the probabilities for each type of event. This is why there were only 6 types and described in the Test generation section.
I’d like to use this chance to invite a couple of contestants to explain their approaches.
Some (funny) notes about this problem:
there were 5375 (3450 AC) submissions for this problem (when this editorial was written)
the judge is (currently) 531 lines long and was easily hacked a couple of times by our tester @kingofnumbers
the problem originally contained another type of operation, fragmentation (i.e. store only a fraction of a client’s data or store multiple parts on multiple HDDs). This op can be currently simulated by using a high-capacity HDD and was deleted because the judge was getting way too complicated.
@mugurelionut’s solution is 1546 lines long (I believe it is the longest for this problem)
We had to increase the number of submissions to 500 because @allllekssssa managed to exhaust the standard set of 200 submissions.
Okay. My solution (68.6 pts., #10 on scoreboard for this problem):
First we buy all our 1050 hard disks. We choose the type with cheapest read/write costs, but if two disks have similar read/write costs (within 65%) but one disk has higher capacity, we choose that one. The next tiebreaker is disk price.
After that there are two strategies we use to encode data:
If the number of stores (so far) is greater than the number of indexes (so far) or the crash percentage is less than 5%, we store the data on the nearest available hard disk.
Otherwise (provided we have at least two separate hard disks available), we store the data twice - on the main disk and backup disk.
I’ve also tried to implement crash backup. In basic terms, when a disk crashes, we backup a random hard disk onto the crashed disk. Unfortunately, those solutions ended up inferior.
First of all I bought a fixed number of hard disks (about 1000). The hard disk which were bought were mostly of 3 or 4 different types depending on the “avg_cost”. (It is defined in the node class depending on the capacity, read, write and cost of hard-disk). Also, I identify one storage heavy hard-disk for backup purpose (using maximum 50 hard-disk for backup, when required).
For storing information into hard-disk, i use the best fit model (Similar to ones used by many operating systems in memory management). For this, i store the intervals from where the hard-disk is partitioned. This ensures most of the space in the hard-disk is used and there is minimu wastage. This is because it was mentioned we can store all client data in one hard-disk only and not in multiple ones (i.e no fragmentation of data).
Now, i used heuristics like if the hard-disk has large space or the number of clients for which is stores data is greater than fixed limit, we require a back-up of data. (That too only of data which is not backup till now. This is done maintaining a boolean array which tells which positions are backed-up). This helps to maintain RAID - 1 as every data is backed up exactly once more.
Now, when querying for data, we try to get data from which the cost will be minimum, in case the data is stored on multiple (here 2) disks.
Now, I had 2 solutions, without any backing up data and with backing up data. The overall score on the backed up data was lower.
Buying hard-disk on the fly. i.e when storage is less. This might we useful in cases where most of the hard-disk are wastage as total data to be stored is quite less.
Finding better heuristics choosing of hard-disk.
I forgot to implement RAID-1 correctly as in case of crash, i should have copied data back from secondary to primary. (This might lead to less score).
In case the crashes are less, then we could have used the backup hard-disk for storage purpose and maintaining list of such positions.
Implementing a better hierarchy of data storage where backup data is stored in fragments in tree like structure and we try to minimise the failure numbers. (This is based on research talk i heard somewhere but can’t find the link now).
I am glad to explain my solution ( 83% of score). It uses most facts from the editorial above.
First we will buy all 1050 disks, on that way we are reducing probability for crashing hard disk with a lot of data. I used only one type of hard disks ( hard disk with capacity bigger than 200 and smallest value X, X= reading speed + writing speed + average price for one unit of HDD ). Somehow I didn’t find way to use more HDDs, possible it costs me better score.
After it, we should think how random generator works for queries of second type ( reading one unit of some user ). First it chooses one user with equal probability, then it chooses one unit of his data with equal probability. So the best strategy is saving users with smaller amount of data - on that way we would have more users saved on hard disks and probability of matching is bigger. This part can be done with multiset/pririrty_queue and saving lengths of all stored data of different users. When we want to put new data, first check if we have free space on some HDD ( if we have just write there). if we do not have, check whether we stored data of some user with bigger length, remove that data and write this one on the place ( if legnths are nearly same we do not need to change something, probability is nearly same and everything costs ).
Third and most important observation is RAID, at beginning I had half regular HDDs and other copies. But with changing constants I got that 160-170 pairs of copy HDDs is the optimal value.
At the end we could predict what is kind of every testcase. We can caluclate number of queries of each type after first 100/200 moves and then choose number of ‘copy’ disks ( if we have around 1% of crashes do not save data twice, if we have 10%-15% crashes we will have nearly 200 copies ).
I hope someone will find this approache interesting and helpful.
At the and I want to thank setter for this real life problem and congratulation to the winners which obviously have bigger experience in solving approximate tasks !
I am not mad that people got ~26 points for coding the trivial strategy. If you are able to read and understand that statement you definitely deserve some points. I am just surprised the -1 -1 idea worked so good when compared with much more complex strategies.