Issue with challenge problems

What happened?

In the May long contest, contestants were able to make many submissions and extract data about test cases. Few users say this is against rules, others say only the smart ones were able to do it and hence it is part of fair game. I start this thread to let you all know about my discussion with codechef team and to seek your view on this.

Here is what I have to tell about Challenge problems. I had a discussion with codechef team regarding the same and we have concluded with the following. I request all of you to respond with your views on this, as this is a common problem and not related just one contest.

Why should we avoid solutions that try to get information from test cases and optimize accordingly?

Testing is never complete. Though we are testing the solutions on only 20-40 test files, our goal is to test the solutions in most robust way. Solutions should be good with given CONSTRAINTS and never depend on given TEST DATA.

For the same reasons codechef introduced the concept of Partial Score. Partial scoring is introduced to make sure that the solutions are more robust and can handle any kind of test data with given constraints. Now if the users are able to extract details about test data and make optimizations accordingly, then the actual purpose of partial scoring is useless.

That being said, I would like to say that the solutions to challenge problems should not be optimized based on the test data. I propose the following :

  1. We should not show the total runtime. It is better to show maximum run time of all the test cases. We have 2 advantages with this. One is that the user cannot simply force run for larger time and extract information about number of test cases. Second is many a times users asks questions like “If the time limit is 2 seconds how is a submission time shown as 15 seconds”, this is because we are showing the sum of all time taken, by taking maximum (or even average) we can remove this ambiguity as well. Note that it is still possible to extract test cases (I think) but not easy.

Issues found: None from user point of view.

  1. This idea is not great. To reduce post size and save time of readers. I am removing it and adding at end of post.

  2. Too many test files : This is simple method and still works. Contestants are able to extract data as the number of data sets are low. What if we follow the following method. Say we wanted to have N data sets. Let us create 5N data sets. Submitted programs will run on all 5N data sets, score for N/5 (20% of N) first data sets will be given, and AC only if it passes all 5N data sets. As generally we use N=20 data sets, 5N is 100 and it is very difficult to extract information about all those data files. Now after the contest we will choose N random files out of 5*N files, and remove other extra files, and rejudge all the submissions on those N files. What happens with this method is, the user cannot extract data for all 100 test cases it is very hard, even if they start extracting for 20-30 cases, he cannot be sure if they will be included in final tests. This method is fair because : Initial partial score is on same test files for all the users, final full score is on same test files for all users. Only thing is that the setter has to generate a lot of test cases. As most of setters use automated generators, I think number of tests will not matter much. Another issue I see is, it will take a lot of time for each submission to be tested during contest as it has to run on 100 files. Rejudge after the contest wont be a problem as we will scale down to N tests only.

Issues Found: There is an software limitation of only 64 test files. However codechef team reported that it is not advised to use more than 20 test files.

The following idea is proposed by Gedi Zheng

Separate the first 20% of test data and the full test data, and limit the frequency of full data submission.
We can have two submit button for the challenge problem. One for the first 20% of test data, which only return the score and verdict on 20% of the test data. The other for the full test data, which only return verdict during contest. And we limit the frequency of submission for full data to something like once every 4 hours, so that the maximum number of total submissions can be made on the full test during will be limited to 60. It’ll be very difficult to get much information about the full test data, while it’ll be enough for contestants to ensure that their program can get AC on full data.

Issues Found: What if the user is getting AC on 20% test data. And when he submits it on larger file he is getting WA? As we are limiting the number of submissions, this case can be really tricky.
Also what should be the limit on larger data set? Some thing like 40 will be too less? some thing like 80 will be enough to extract data?

The following ideas is proposed by mugurelionut

  1. generate (part of) each test case randomly at each run (this has been done in the past)

  2. make the output of each test case sufficiently large so that it cannot be found entirely with 500 submissions

Issue: What can we do with interactive problems? Also how does it stop user from extracting details about test data?

  1. make the judge answer interactive queries adaptively (e.g. in order to try to generate the worst case situation for each contestant on the fly)

I agree with what Gedi Zheng suggested. That seems like a good solution. Please leave your comments on this.

Removed content

  1. Always AC : Let us say the problem is to minimize score. Select some value MAX, which is large enough score for a test case. It has to be higher than the score generated by very simple brute force submission. Now if a test file generates any verdict other than AC, we will add MAX as score of that file instead of showing the verdict. This way always the verdict is AC only, and at the same time the wrong submissions will not have any good score. There are a lot of things we need to discuss about. Note that the user will never know if he is solving all the test data correctly, his submission may give WA to some test cases. As we will display score of only 20% test data (If we are using this method, we should display score for 40% test data, it is better). So for the rest of test data, even if the program is wrong and gives WA or RTE, he will never know. It is still okay. The tops should be able to write good programs which will not fail. Even if their program is failing on one or 2 cases, it is okay as we will add MAX as score and we will take average later. Good thing about this method is that the user cannot extract any details about hidden test data. One last doubt may be that, say a user did very well and managed to get low score on all test data, but getting WA on one test data. Even then he managed to get global best average score. In this situation the winner of challenge problem actually has a bug and gets WA on one test case. If we are okay to accept this as this method will make the programs more robust, we can think of implementing this method. Also because we want the solutions to follow the constraints, and we are not allowing them to extract any information about test data, we should provide with clear test generation plans.

Issues Found: Turned out to be a bad idea. The user will never know if his solution is correct or wrong and very hard to solve it.
And also what if the best solution is failing on one test case. Can we accept it as the best solution?

7 Likes

IMHO, i really like the following ideas :

  • show maximum run time among test cases instead of the sum of run times (not harmful, easier to understand)
  • way more test cases (20 is really too few. make it not constant between each month ?)
  • the Gedi Zheng’s idea : REAL partial scoring, but keep the ability to know if the code gets AC on the full set

Moreover, these propositions do not need the current judge to be modified (or just slightly).

Just need to add a button to test on a different set of test cases, and give less information for them.

This whole set of features seem really great !

Edit:

the always-AC possibility seems really confusing. beginners would probably be lost. i’m not very fond of it.

Edit 2:

the “issue found” for the Gedi Zheng’s idea is not an issue. :slight_smile: if you get AC on the 20% and WA on the whole set, it’s because your code solves only partially the problem : that is to say it’s wrong ! you have to make it work.

3 Likes

I like the idea of Gedi Zheng. Maybe instead of having a duration constraint between two full submissions, we can have a constraint on the number of full submissions. Some people start working on challenge problem very late, the duration based constraint will allow them fewer full submissions. On the other hand, it may force them to work on the challenge problem earlier.

I also like the idea of having more test cases and then choosing a random subset of it as final test data. The only issue is that it will take much longer to process each submission.

7 Likes

I dont agree with any half measures . I want final test cases not to be evaluated till the contest is running . If a person fails on one or more of the final test cases , then either the submission can be rejected or some max amount (for minimization problem) or some min amount ( for maximization problem ) can be awarded .

Since everyone tries multiple approaches to solve the challenge problem , getting unlucky during final tests should not be an issue .

Maybe as a first step we could reduce the number of submissions even further (e.g. 250 instead of 500?) and at the same time increase the number of test cases (more test files and more test cases per file, e.g. 40 test files with 10 test cases each?). This should also come with proper explanations of test generation plans. This wouldn’t prevent extracting some information about the test data, but much less information could be extracted this way.

2 Likes

Just a comment about the number of test cases. As far as I know, there is a limitation of the system to at most 64 test files (the first time I submitted a problem for a Codechef contest I tried to add more files than this and it wasn’t possible). Of course, there could be multiple tests per file, which would still mean many test cases (although in general I don’t particularly like this, it has also been used in previous challenge problems).

3 Likes

ok, good to know. do we know if this limitation has practical reasons behind it (time required to judge a program, for instance), or is it just a software limitation which could eventually be raised up ?

I have added about an issue. This issue was raised by codechef team. Please check it above.

Don’t you mean until the contest is *over? I don’t agree with your last statement either about getting unlucky during final tests should not be an issue. But I do agree with your overall idea. Perhaps create two separates test sets. One for the contest and another for final evaluation after the contest is over. The final set could be test during the contest without displaying the points for those tests to make sure the submission gets AC in every case.

1 Like