PROBLEM LINKSDIFFICULTYCHALENGE PREREQUISITESAdHoc, Brute Force PROBLEMYou are given N servers, out of which a vast majority are honeypots. You have to guess the passwords at these servers. You know that the servers mix the passwords with some random salt, and then compare SHA1 hashes of both the strings. The score is equal to the sum of the square of the number of bits correctly matched. Honeypot servers do not affect the score at all. EXPLANATIONFirstly, you can try to find which servers are honeypot servers. This can be done by first getting a reference score by using some a set of passwords initially; then
Since we know that the density of honeypot servers is high, we can be smarter and try to find blocks of honeypot servers together. In case we find the score did change for the block of servers we were hoping to be honeypot, we can then try them one by one. After finding the honeypot servers, we must now try to find the passwords. The chances of finding the exact password in this problem for any server are extremely narrow. Because of the nature of SHA1, you cannot even hope to change a string ever so slightly to try and get a better result. Your best bet is to generate random strings and see which one gets you closer to better and better scores. One method is to only generate random strings and then try to run the corpus of random strings for each server (of course not for honeypots) while keeping all the other strings constant to try and maximize the score. This is what the tester's solution does. You have to be careful that you should not exceed the number of attepts you have. SETTER'S SOLUTIONCan be found here. TESTER'S SOLUTIONCan be found here.
This question is marked "community wiki".
asked 16 May '13, 05:56
showing 5 of 6
show all

My score during the contest (the winning score) was approx. 9461.81, but I optimized my solution even further afterwards and reached a score of approx. 9480.06 (I was keeping this last optimization idea in case I needed it during the contest, but in the end it wasn't necessary to implement it then). I will explain just the fully optimized solution (the changes from my contest solution are quite small). First of all I should mention that my solution is fully deterministic (no randomization involved) and the score differences come from the fact that the test data is randomly generated at each run. The only candidate strings I used in my solution are composed of the letters a, b and c  imagine them as base 3 numbers, where a=0, b=1 and c=2. Actually, I only use strings of length 12 corresponding to consecutive base 3 numbers (the most significant position is the first one; the least significant position  the last one). However, I do not consider these base 3 numbers starting from 0, but rather starting from a fixed offset (I used offsets like 29, 77, 25, etc. for various classes of test cases in my solution). At some point during the contest I also tried random strings and other bases and offsets, but the score differences were quite large  so I can say that a good part of my score is due to the candidate strings I used, but I have no idea why these strings lead to significantly better scores than others (and these betters scores are consistent, almost independent from the actual test data, meaning that there might be some intrinsic property of these strings which makes them work). Ok.. now let's get to the actual algorithm. First of all, we want to find the honeypots in order to ignore them afterwards. I started by asking a base question (a question is a full query, meaning that a string is asked for each of the N servers and an answer is read back)  let its answer be Sbase. Then I grouped the servers into NH (almost) equallysized blocks (I also tried other numbers of blocks, but this seemed to work better). Then I considered each block at a time. Let's assume that the current block consists of G servers. I ask a question by modifying the strings of only the servers in the block (i.e. incrementing their base 3 number by 1 compared to the base question) and read the answer S back. If S=Sbase then I assumed that all the servers in the block are honeypot servers. If S != Sbase then I considered each server in the block independently. Let the servers be Se(1), Se(2), ..., Se(G). Let's assume that we reached the server Se(i). I ask a question consisting of the base strings, except that the string of Se(i) is set to the base 3 number from the initial question for the block. Let S2 be the answer. If S2=Sbase then I assumed Se(i) is a honeypot server and moved on. If S2!=Sbase then I stored this "guess" somewhere and marked Se(i) as not being a honeypot server. Moreover, we know now that the server Se(i) contributes with a value S2Sbase to the answer S, so we decrement S by (S2Sbase). If the new updated value of S is Sbase then we can stop and all the servers Se(i+1), ..., Se(G) are honeypot servers. If we reach the server Se(G) then we do not need to ask a question for it  the current value of S would be the answer to the asked question, so we have a "guess" for Se(G). After this step all the honeypot servers are identified and we have an extra "guess" (compared to the base question) for each nonhoneypot server. The total number of asked questions for each block is between 1 and G  thus, the total number of questions asked so far is at most N (but it can be significantly smaller than N)  since T>N we are sure to have enough questions available so far. The next step consists of estimating the initial number of guessed bits for each nonhoneypot server. For each of these servers we have two different values now: Sbase and Sguess. SguessSbase is equal to the difference y^2  x^2, where x is the initial number of guessed bits (in the base question) and y is the number of guessed bits in the second question. If there is only one possible value of x for which a possible value y exists, then we found with total certainty the initial number of guessed bits for this server. In general x may not be uniquely determined. We will ask further questions for the nonhoneypot servers for which the initial number of bits is not known with certainty: we will stop asking questions for a nonhoneypot server i when the set of guesses uniquely determines the initial number of bits for the server i. Because we have a limited number of questions, we will prioritize the servers based on the most likely number of maximum number of bits guessed so far (even if we have multiple possible initial numbers of bits, the one closest to the initial average number of bits is considered more likely) and the number of questions asked for this server so far (lower numbers indicate higher priority). Now we know the initial number of bits for each nonhoneypot server and we also have a "best number of bits" for each of them (i.e. the maximum number of bits guessed so far, either in the base questions or in the other questions we asked). The next step consists of improving the guessed number of bits for the nonhoneypot servers. The main idea is to try not to waste questions. For instance, if we have two nonhoneypot servers, one for which we guessed 80 bits (at best) and the other for which we guessed 100 bits (at best), it is much more likely to find an improvement (more guessed bits) for the server with 80 guessed bits (i.e. a question regarding this server is much more likely to improve our score instead of a question concerning the other server). Thus, as long as we have questions left, we will choose the K servers (K <= KMAX) with the smallest number of maximum number of guessed bits (in case of ties, the servers with the smallest number of questions asked so far are chosen). We use the base strings, except that for each of the K servers we use their next base 3 numbers, and we ask the question. Let the answer be S. Since we know the initial number of bits guessed for each server, we will know the sum of the squares of the numbers of bits of the K servers in the current answer  let this number be Sum. By itself, Sum doesn't help us find the number of guessed bits for each of the K servers Se(1), ..., Se(K), because Sum can be obtained in multiple ways as a sum of K squares between 0 and 160. Let vmax[K][Sum] be the maximum number of bits x such that Sum can be obtained as a sum of K squares of numbers from 0 to 160 and one of them is x. If vmax[K][Sum] <= maximum number of guessed bits for Se(1) then we do not need to ask any further questions, because there will be no improvement for any of the K servers. Otherwise, we will ask a question in which only Se(1)'s string is changed compared to the base, thus finding the number of guessed bits for Se(1). After this we update Sum and compare vmax[K1][Sum] to the maximum number of guessed bits for Se(2), and so on... Note that the servers were ordered in increasing order of their priority (i.e. Se(1) has the smallest number of maximum number of guessed bits so far, etc.). You can easily notice that when K servers are selected, at least 1 and at most K questions are asked  except for the first question, all the other questions are asked only if they have a good probability of improving the maximum number of guessed bits for one of the servers. I used KMAX=3 (I tried other values of KMAX, from 1 to 10, but 3 produced the best results). Finally, because I mentioned "good probability", note that it is unlikely for the number of bits to be too small or too high. Thus, when computing vmax[K][Sum] I only considered squares of numbers between BMIN and BMAX (I used BMIN=69 and BMAX=106, instead of 0 and 160). This made the strategy skip questions which are very unlikely to improve the score. I also used similar lower and upper bounds when trying to guess the initial number of bits for each nonhoneypot server. Ok.. actually, for K=1 I considered the whole range of bits, from 0 to 160 (I didn't want to guess the full 160 bits and ignore the answer because it was outside the range [BMIN,BMAX] :) ). At the end we ask a final question with the string producing the maximum number of bits for each server  this last question is the one producing the maximum score for the test case. In order to "squeeze" the best score possible out of my solution I tried different parameters for various types of tests (first of all, I guessed the T value for 20 test cases, but this took a long time ; so for the rest I simply used some ranges I came up with)  the idea was to simply differentiate between the tests as good as possible in order to be able to tune the parameters independently for each group of tests (ideally, each group should consist of just 1 test). In the end, the only parameter that I tuned was the initial offset for the base 3 numbers. answered 17 May '13, 13:32
3
Amazing, what choosing good candidates strings can do. Random is not so good after all. Is it like an SHA1 insight? Diff: 8803.53pts => 9100.45pts, changing the offset in base 3 using my algorithm
(18 May '13, 17:57)
@samjay: So did you just change the strings you used in your solution from random to consecutive base 3 strings ? (and you kept everything else the same ?) If that's the case, then yes, it seems that random is not the best choice. My guess is that strings corresponding to consecutive numbers "cover" more subsets of bits when applying SHA1 than just using random strings  but that's just a wild guess (and very likely wrong) ; however, I don't know why base 3 works better than other bases (in my solution that was the case).
(19 May '13, 00:07)
I know you have guessed but still, "strings corresponding to consecutive numbers "cover" more subsets of bits when applying SHA1 than just using random strings" any proof or logical support that says this statement is strong enough?
(19 May '13, 00:15)
@bugkiller: Nope, no proof whatsoever. I just thought that when using strings corresponding to consecutive numbers the SHA1 hashes would be more different between them (i.e. "cover" more subsets of bits) than when using random strings. Note that I did not analyze in details how SHA1 works, but a small change in the string is expected to generate a large change in the SHA1 hash  that's all I had.
(19 May '13, 00:21)
@mugurelionut: Yes, from completely random. I must admit that I've done ~8950pts with completely random too, but just one attempt of both gave that big difference already. Also, I don't tried your trick yet to have offsets/case (which I thought was pretty funny you really succeeded in doing that). But for me now, changing the one offset I use, also makes extremely big differences(29 wins so far).Also for the tests I did so far, just starting with 'aaaaaaaaaaaa' gave better results than a random extra offset. But haven't read a thing about the SHA1 algorithm yet, only use it in apps =)
(19 May '13, 00:23)
@mugurelionut >> I had initially the same idea of using "aaaaaaaaaaaa" as a password, plus you have used all passwords of length 12, that again increases the chance of "more" matching bits rather than choosing random length between 8 and 12 or any length less than 12, I guess. Would you please elaborate more the thing how you split the total number of guesses? I do not get the way you have treated them as blocks and then attacking each of the block.
(19 May '13, 00:36)
Espl. this part "If S=Sbase then I assumed that all the servers in the block are honeypot servers. If S != Sbase then I considered each server in the block independently."
(19 May '13, 00:43)
@mugurelionut Dude have you got a brain in there or some kind of AI factory ?? :D just reading this entire thing gives me the chills.. motivates beginners (like me) to work hard to reach that level !! <3 love it !
(19 May '13, 14:30)
showing 5 of 8
show all

Hey, SRM Hackathon held in SRM Institute of Science and Technology, (Tamil Nadu) would be a great opportunity for you to horne your technical and development skills. It is a National level Hackathon with great mentors, judges and teams coming up every year. They are going to host their third edition this 26th and 27th September. You can register for it on this link https://srmhacks.hackerearth.com. The registration is completely free and there are many exciting prizes worth INR 3 lakhs and innumerable learning and connecting experiences to be earned by every participant It's surely worth giving a try :) answered 18 Aug '18, 22:16

@mugurelionut : Can you please explain your solution to this problem .
I don't understand  since a salt is appended to the strings, and we don't know the salt, how can we know the SHA1 hashes of the strings?
@neil_812 I have the same question. To quote @gamabunta: " we can ignore any string whose SHA1 hash matches a lot of bits from this string". But due to the unknown salt this may not give the intended result. Please clarify.
We just dont know the SHA1 but we try to calculate the number of matched bits after every try using x^2  y^2 = score1  score2 where x are the current matched bits. We could run a loop for y from 0 to 160 and a loop for x from y+1 to 160 and check the condition and also check that previously we had matched y bits or not.
supporting @vineetpaliwal ... all top scorers to this problem should explain their logic and expecting reply from @anton_lunyov to this post :)
@balajiganapath indeed. This was my stab at understanding something (if anything at all) from the top submissions. I am going over them again and will update the last section of the editorial.