SEASORT2 - Editorial

shangjingbo · March 22, 2014, 8:06am

PROBLEM LINK:

Practice
Contest

Author: Sergey Nagin
Tester: Mahbubul Hasan
Editorialist: Jingbo Shang

DIFFICULTY:

Challenge

PREREQUISITES:

Heuristic, Greedy, Mixure of Methods, Test Generation Plans

PROBLEM:

Sort a sequence only using reverse operations. Minimize the cost function, S/N+Q, where Q is the total operations, N is the length of the sequence, and S is the sum of lengths of intervals in all reverse operations.

EXPLANATION:

This is a challenge problem. The most important thing in solving challenge problem is to find some approximation/partial methods using the combination of search, greedy, dynamic programming, etc…, because this challenge problem is NP-hard. Also, it is important to write a local test system to help you find the best strategy (almost all top ranked guys wrote their own test codes).

In this problem, there are two parts in the objective function: S/N and Q. Because N is a constant once the input is given, we need to find a trade-off between minimizing S and Q.

Test Cases

And, for the challenge problem, usually we can have the generation plans of test data. But, this time, the plan is omitted. Therefore, we need to get some senses of them in some special ways. For example, we can check, whether N lies in a range [N_min, N_max]. If not, make our submission to TLE or RE, or something you want. And then, try different ranges, and finally collect the information about the generation plans.

For the generation plans, you guys can have a look at the kgcstar’s code and mugurelionut’s code. Specifically, in kgcstar’s code, the solving methods are highly specialized for different files. And the comments in mugurelionut’s code tell us that

// Structure of the tests:
// 1 test   : 6 <= N <= 10
// 39 tests : 9000 < N <= 10000
//     - 4 tests: 1000 <= ndif < 1050
//     - 4 tests: 250 <= ndif <= 300
//     - 11 tests: 100 < ndif <= 175
//     - 4 tests: 80 < ndif <= 100
//     - 4 tests: 60 < ndif <= 80
//     - 4 tests: 30 < ndif <= 60
//     - 8 tests: ndif <= 10

where, ndif is the number of different numbers in the input sequence.

Methods

After got some senses about the test cases, we need to find some strategies.

Note that there is a constraint that Q<=N. It is easy to achieve that, if we put numbers to their own position from left to right.

And then, if there are some numbers are already putted on correct places, i.e. [1…l] and [r… N] are correct, we can clearly reduced it to a problem of [l + 1 … r - 1].

So, the most straightforward algorithm is to greedily choose a smallest/largest number and put it to its correct position. Here, you can try different heuristic methods to choose the proper interval to reverse.

Further, you can not only put the smallest/largest number, but also try some complex methods to put some internal numbers to their correct positions, or divide the whole intervals into some separated intervals and then combine the results.

Because the winners’ codes are tooooo complex to read, I would like to invite @kgcstar, @mugurelionut to explain their great methods.

AUTHOR’S AND TESTER’S SOLUTIONS:

Author’s solution can be found here.
Tester’s solution can be found here.

samjay · March 22, 2014, 4:03pm

I want to add another simple method, which I think many top 20 competitor’s used:

Loop over the resting interval and group the same numbers together in the direction they are expected to end

Example: 1 2 5 3 7 3 8 3 (solving it small to bigger, 1 and 2 are already in place)

intuitive method from editorial:
1 2 5 3 7 3 8 3
1 2 3 5 7 3 8 3 // 1
1 2 3 3 7 5 8 3 // 3 
1 2 3 3 3 8 5 7 // 4

bringing the 3's 1 by 1 together:
1 2 5 3 7 3 8 3
1 2 5 3 7 3 3 8 // 1
1 2 5 3 3 3 7 8 // 2
1 2 3 3 3 5 7 8 // 3

For this simple case our total window is only 6 vs 8

Plus the fact that you can take advantage of the fact when 2 same numbers are already together. Than you don’t have to reverse. In that case we win an entire window. (according to how the score is calculated)

kcahdog · March 22, 2014, 5:13pm

@shangjingbo @samjay @mugurelionut The editorial places a lot of weightage on determining the nature of the test cases in the input file. Could you please some details or resources on how this is done ? What sort of code is used to do this ? It would be great if you could provide some hints.

kuruma · March 22, 2014, 5:33pm

Hello @kcahdog,

At least the part of:

“Therefore, we need to get some senses of them in some special ways. For example, we can check, whether N lies in a range [Nmin, Nmax]. If not, make our submission to TLE or RE, or something you want. And then, try different ranges, and finally collect the information about the generation plans.”

I think I can help you.

The good thing about online judges, is that usually the test data sets which are generated are static, in the sense that the I/O files will never change once created (the ones against your code is judge might change in this problem format, but the file’s contents, is static, as far as I know).

You can use this to your advantage…

By using assert statements in your code to check for the values of Nmin and Nmax, as described in the editorial you can usually get a good “insight” about how the files are built.

This happens because when an assertion fails to be verified, the code exits with runtime error.

So, suppose you choose to assert something like:

1000 <= N <= 10000

and you get a runtime error at a given time.

This means that somewhere in the test data you might have test case files where either:

N < 1000;
N > 10000;

And this knowledge alone might somehow help you to locally generate better test cases… Or at least, test cases which are more approximated to the test cases agains which your code is being judged.

To locally generate test cases, well, it’s not that different of generating test cases for a problem…@mugurelionut did it and used his own local test generator to improve his scores… You can see the flag LOCAL defined in his code… Possibly he would turn it on while testing his code locally and then set it to 0 when submitting on codechef

Regarding the solutions itself, I tried to read and understand his solution and it seems to me that he uses what @samjay said in his method GroupGreedy, although I’m not sure of it and also there are many things which I don’t fully understand…

Bruno

mugurelionut · March 22, 2014, 7:07pm

I will discuss in this answer how to determine information about test cases. First of all you need to think what test parameters you want to find out. N may be one of them, but in this problem, after giving it a bit of thought, you will see that the number of distinct numbers in the input sequence is also very important. So let’s say you come up with a condition you want to test, let’s say: (N >= 9000 and the number of distinct numbers <= 10). Testing if the condition is true on some test case (or false on all test cases) has already been discussed - you can generate a Runtime error or Time limit exceeded if the condition is true, and solve the test case correctly otherwise. The result of your submission will tell you which situation occurred. However, sometimes it’s more useful to just count on how many test cases a condition is true. In this case you make an initial submission (which correctly solves all test cases) and see its time T. The submission should also be sufficiently fast (i.e. it should be at least 0.2 sec faster than the time limit). Then, you make another submission in which, if the condition you are interested in is true, you make your code wait for an extra X sec, by looping and doing nothing (say X=0.1 sec or, better, X=0.2 sec). Then you see the new total time T’ of your submission. Obviously, we have T’=T+k*X (with some small variations possible). Since we know T, T’ and X, we can easily compute k, which is the number of test cases on which the condition you tested is true.

That’s what I did in order to find out the structure of the test cases (which I included as a comment in my source code, and which was also included by the editorialist in the editorial). I didn’t go all the way to find out the exact values of N and the number of distinct numbers for each of the 40 test cases, but that would have certainly been possible (with many more submissions). However, if you dig through @kgcstar’s source, you will definitely find out all this information for each test case

The goal of finding test parameters is to be able to generate offline test cases which are similar to the official test cases (there is no other way to do it when the test generation method is not given). Then, once you have test cases which are similar to the official test cases, you can try to optimize your solution on these test cases. That’s what I did. Of course, I could never be sure how close the test cases I generated were to the official test cases. Sure, they had about the same values for N and the number of distinct numbers, but there are many ways of generating a test case with a given value of N and a given number of distinct numbers. Let’s say we have N=10000 and we want to have 10 distinct numbers in the input sequence only. I generated the sequence randomly, which meant that the 10 distinct values were more or less evenly distributed throughout the sequence and that there was more or less the same number of values of each type. However, I didn’t know if the official test cases were also randomly generated. They could have been, for instance, almost sorted or almost unsorted. Or they may have, let’s say, many more values equal to 1 than values equal to 10. Of course, I could have checked other conditions to better understand if the test cases were random or not (e.g. the difference between the smallest number of times and the largest number of times each value occurs), but it would have taken me even more time and submissions.

What I find interesting is that after finding information about the test cases in his initial submissions, @kgcstar optimized his solution entirely offline (he barely made any submissions before the end of the contest to try to optimize his score on the 10 test case for which the score during the contest was computed). That means he really trusted that the test cases he generated were very close to the official test cases (I didn’t check, but maybe he extracted more information about the test cases, which allowed him to be very sure of this). I didn’t trust my own tests that much, so I kept making submissions throughout the contest to try to improve my score. It’s true that usually score improvements on my own test cases also lead to score improvements on the 10 cases for which the score was computed during the contest, which was a good sign that my test cases were similar enough to the official ones.

kaidul · March 23, 2014, 5:02pm

Tester’s Solution link is not working!

shangjingbo · March 24, 2014, 5:19am

Thanks very much! A perfect answer to teach contestants compete in the challenge problems.

shangjingbo · March 24, 2014, 5:20am

mugurelionut has posted his methods.

samjay · March 29, 2014, 2:17am

Thanks for sharing. I still think it’s a time expensive job to find these things out. While I would rather enjoy finding a fine/general algorithm based on the constraints given and not for the 40 testcases in particular.