CHALENGE - Editorial

PROBLEM LINKS

Practice
Contest

DIFFICULTY

CHALENGE

PREREQUISITES

Ad-Hoc, Brute Force

PROBLEM

You are given N servers, out of which a vast majority are honeypots. You have to guess the passwords at these servers. You know that the servers mix the passwords with some random salt, and then compare SHA-1 hashes of both the strings.

The score is equal to the sum of the square of the number of bits correctly matched. Honeypot servers do not affect the score at all.

EXPLANATION

Firstly, you can try to find which servers are honeypot servers.

This can be done by first getting a reference score by using some a set of passwords initially; then

  • Change a single password to some other string
  • If the score changes, then that server is not honeypot
  • Otherwise the server is honeypot

Since we know that the density of honeypot servers is high, we can be smarter and try to find blocks of honeypot servers together. In case we find the score did change for the block of servers we were hoping to be honeypot, we can then try them one by one.

After finding the honeypot servers, we must now try to find the passwords.

The chances of finding the exact password in this problem for any server are extremely narrow. Because of the nature of SHA-1, you cannot even hope to change a string ever so slightly to try and get a better result.

Your best bet is to generate random strings and see which one gets you closer to better and better scores.

One method is to only generate random strings and then try to run the corpus of random strings for each server (of course not for honeypots) while keeping all the other strings constant to try and maximize the score.

This is what the tester’s solution does. You have to be careful that you should not exceed the number of attepts you have.

SETTER’S SOLUTION

Can be found here.

TESTER’S SOLUTION

Can be found here.

4 Likes

My score during the contest (the winning score) was approx. 9461.81, but I optimized my solution even further afterwards and reached a score of approx. 9480.06 (I was keeping this last optimization idea in case I needed it during the contest, but in the end it wasn’t necessary to implement it then). I will explain just the fully optimized solution (the changes from my contest solution are quite small).

First of all I should mention that my solution is fully deterministic (no randomization involved) and the score differences come from the fact that the test data is randomly generated at each run. The only candidate strings I used in my solution are composed of the letters a, b and c - imagine them as base 3 numbers, where a=0, b=1 and c=2. Actually, I only use strings of length 12 corresponding to consecutive base 3 numbers (the most significant position is the first one; the least significant position - the last one). However, I do not consider these base 3 numbers starting from 0, but rather starting from a fixed offset (I used offsets like 29, 77, 25, etc. for various classes of test cases in my solution). At some point during the contest I also tried random strings and other bases and offsets, but the score differences were quite large - so I can say that a good part of my score is due to the candidate strings I used, but I have no idea why these strings lead to significantly better scores than others (and these betters scores are consistent, almost independent from the actual test data, meaning that there might be some intrinsic property of these strings which makes them work).

Ok… now let’s get to the actual algorithm. First of all, we want to find the honeypots in order to ignore them afterwards. I started by asking a base question (a question is a full query, meaning that a string is asked for each of the N servers and an answer is read back) - let its answer be Sbase.

Then I grouped the servers into N-H (almost) equally-sized blocks (I also tried other numbers of blocks, but this seemed to work better). Then I considered each block at a time. Let’s assume that the current block consists of G servers. I ask a question by modifying the strings of only the servers in the block (i.e. incrementing their base 3 number by 1 compared to the base question) and read the answer S back. If S=Sbase then I assumed that all the servers in the block are honeypot servers. If S != Sbase then I considered each server in the block independently. Let the servers be Se(1), Se(2), …, Se(G). Let’s assume that we reached the server Se(i). I ask a question consisting of the base strings, except that the string of Se(i) is set to the base 3 number from the initial question for the block. Let S2 be the answer. If S2=Sbase then I assumed Se(i) is a honeypot server and moved on. If S2!=Sbase then I stored this “guess” somewhere and marked Se(i) as not being a honeypot server. Moreover, we know now that the server Se(i) contributes with a value S2-Sbase to the answer S, so we decrement S by (S2-Sbase). If the new updated value of S is Sbase then we can stop and all the servers Se(i+1), …, Se(G) are honeypot servers. If we reach the server Se(G) then we do not need to ask a question for it - the current value of S would be the answer to the asked question, so we have a “guess” for Se(G).

After this step all the honeypot servers are identified and we have an extra “guess” (compared to the base question) for each non-honeypot server. The total number of asked questions for each block is between 1 and G - thus, the total number of questions asked so far is at most N (but it can be significantly smaller than N) - since T>N we are sure to have enough questions available so far.

The next step consists of estimating the initial number of guessed bits for each non-honeypot server. For each of these servers we have two different values now: Sbase and Sguess. Sguess-Sbase is equal to the difference y^2 - x^2, where x is the initial number of guessed bits (in the base question) and y is the number of guessed bits in the second question. If there is only one possible value of x for which a possible value y exists, then we found with total certainty the initial number of guessed bits for this server. In general x may not be uniquely determined. We will ask further questions for the non-honeypot servers for which the initial number of bits is not known with certainty: we will stop asking questions for a non-honeypot server i when the set of guesses uniquely determines the initial number of bits for the server i.

Because we have a limited number of questions, we will prioritize the servers based on the most likely number of maximum number of bits guessed so far (even if we have multiple possible initial numbers of bits, the one closest to the initial average number of bits is considered more likely) and the number of questions asked for this server so far (lower numbers indicate higher priority).

Now we know the initial number of bits for each non-honeypot server and we also have a “best number of bits” for each of them (i.e. the maximum number of bits guessed so far, either in the base questions or in the other questions we asked).

The next step consists of improving the guessed number of bits for the non-honeypot servers. The main idea is to try not to waste questions. For instance, if we have two non-honeypot servers, one for which we guessed 80 bits (at best) and the other for which we guessed 100 bits (at best), it is much more likely to find an improvement (more guessed bits) for the server with 80 guessed bits (i.e. a question regarding this server is much more likely to improve our score instead of a question concerning the other server).

Thus, as long as we have questions left, we will choose the K servers (K <= KMAX) with the smallest number of maximum number of guessed bits (in case of ties, the servers with the smallest number of questions asked so far are chosen). We use the base strings, except that for each of the K servers we use their next base 3 numbers, and we ask the question. Let the answer be S. Since we know the initial number of bits guessed for each server, we will know the sum of the squares of the numbers of bits of the K servers in the current answer - let this number be Sum. By itself, Sum doesn’t help us find the number of guessed bits for each of the K servers Se(1), …, Se(K), because Sum can be obtained in multiple ways as a sum of K squares between 0 and 160. Let vmax[K][Sum] be the maximum number of bits x such that Sum can be obtained as a sum of K squares of numbers from 0 to 160 and one of them is x. If vmax[K][Sum] <= maximum number of guessed bits for Se(1) then we do not need to ask any further questions, because there will be no improvement for any of the K servers. Otherwise, we will ask a question in which only Se(1)'s string is changed compared to the base, thus finding the number of guessed bits for Se(1). After this we update Sum and compare vmax[K-1][Sum] to the maximum number of guessed bits for Se(2), and so on… Note that the servers were ordered in increasing order of their priority (i.e. Se(1) has the smallest number of maximum number of guessed bits so far, etc.). You can easily notice that when K servers are selected, at least 1 and at most K questions are asked - except for the first question, all the other questions are asked only if they have a good probability of improving the maximum number of guessed bits for one of the servers. I used KMAX=3 (I tried other values of KMAX, from 1 to 10, but 3 produced the best results).

Finally, because I mentioned “good probability”, note that it is unlikely for the number of bits to be too small or too high. Thus, when computing vmax[K][Sum] I only considered squares of numbers between BMIN and BMAX (I used BMIN=69 and BMAX=106, instead of 0 and 160). This made the strategy skip questions which are very unlikely to improve the score. I also used similar lower and upper bounds when trying to guess the initial number of bits for each non-honeypot server.

Ok… actually, for K=1 I considered the whole range of bits, from 0 to 160 (I didn’t want to guess the full 160 bits and ignore the answer because it was outside the range [BMIN,BMAX] :slight_smile: ).

At the end we ask a final question with the string producing the maximum number of bits for each server - this last question is the one producing the maximum score for the test case.

In order to “squeeze” the best score possible out of my solution I tried different parameters for various types of tests (first of all, I guessed the T value for 20 test cases, but this took a long time ; so for the rest I simply used some ranges I came up with) - the idea was to simply differentiate between the tests as good as possible in order to be able to tune the parameters independently for each group of tests (ideally, each group should consist of just 1 test). In the end, the only parameter that I tuned was the initial offset for the base 3 numbers.

22 Likes

At first note that if we change one string in the query and the score does not change, it does not mean that the server is honeypot. It is true only in about 95.5% cases. So often we could miss one or two normal servers. Earlier I do additional tries in this case to identify all normal servers.

But in the end I discovered very cool approach that allows to find all normal servers in O(K * log(N/K)) tries, where K is the number of normal servers.

We have some initial try {q0[1],...,q0[N]} with score S0.
Then we do another random try {q1[1],...,q1[N]} with score S[1,N] and compare their results.
If they differ then we have some normal servers. So we make try {q0[1],...,q0[N/2], q1[N/2+1],...,q1[N]}.
Let it score be S[1,N/2].
Then score of try {q1[1],...,q1[N/2], q0[N/2+1],...,q0[N]} equals to S[N/2+1, N] = S[1,N] - S[1,N/2] + S0.
And we proceed recursively to each of the segments [1, N/2] and [(N+1)/2, N].

So at each moment we are in some segment of servers [L, R]
and know the score S[L,R] of try {q0[1],...,q0[L-1], q1[L],...,q1[R], q0[R+1],...,q0[N]}.
If S[L,R] = S0 then potentially all servers from L-th and R-th are honeypots and we stop the recursion.
Otherwise if R = L + 1 we definitely know that L-th server is normal and we do corresponding processing of scores using squares difference approach described above.
Otherwise we do a try {q0[1],...,q0[L-1], q1[L],...,q1[M], q0[M+1],...,q0[N]}.
Let it score be S[L, M]. Here M = (L+R)/2.
Then as above score of flipped try {q0[1],...,q0[M], q1[M+1],...,q1[R], q0[R+1],...,q0[N]}
is S[M+1, R] = S[L, R] - S[L, M] + S0.
And we proceed recursively to segments [L, M] and [M+1, R].

If after this recursion we did not identify all normal servers we could apply this step again but for efficiency it is better to delete known normal servers. Actually in my solution I do this in loop while not find all normal servers. Even if we have to identify only one normal server we should spend only O(log H) ties which is not very much.

This approach allows me to jump from 9150 to 9250 in average and from 9190 to 9273 at maximum.

8 Likes

Article on SHA1 collisions here

2 Likes

SRM Hackathon


Hey, SRM Hackathon held in SRM Institute of Science and Technology, (Tamil Nadu) would be a great opportunity for you to horne your technical and development skills. It is a National level Hackathon with great mentors, judges and teams coming up every year. They are going to host their third edition this 26th and 27th September. You can register for it on this link https://srmhacks.hackerearth.com.
The registration is completely free and there are many exciting prizes worth INR 3 lakhs and innumerable learning and connecting experiences to be earned by every participant
It’s surely worth giving a try :slight_smile:

@mugurelionut : Can you please explain your solution to this problem .

7 Likes

I don’t understand - since a salt is appended to the strings, and we don’t know the salt, how can we know the SHA-1 hashes of the strings?

4 Likes

@neil_812 I have the same question. To quote @gamabunta: " we can ignore any string whose SHA-1 hash matches a lot of bits from this string". But due to the unknown salt this may not give the intended result. Please clarify.

We just dont know the SHA1 but we try to calculate the number of matched bits after every try using x^2 - y^2 = score1 - score2
where x are the current matched bits.
We could run a loop for y from 0 to 160 and a loop for x from y+1 to 160 and check the condition and also check that previously we had matched y bits or not.

supporting @vineetpaliwal … all top scorers to this problem should explain their logic
and expecting reply from @anton_lunyov to this post :slight_smile:

1 Like

@balajiganapath indeed. This was my stab at understanding something (if anything at all) from the top submissions. I am going over them again and will update the last section of the editorial.

1 Like

Amazing, what choosing good candidates strings can do. Random is not so good after all. Is it like an SHA-1 insight?
Diff: 8803.53pts => 9100.45pts, changing the offset in base 3 using my algorithm

4 Likes

I also implemented this approach, but while doing this I took at each step a new random string for each server in the interval. And if it was better, remember it, hoping to obtain a better score when T is small. And when the interval(L,R) contains a non-honeypot server, keep trying the intervals (L, (L+R)/2), ((L+R)/2),R) until one also finds one (countering the ~5% case as much as possible).
But your approach, using only two strings/server is a big win comparing to mine.

Don’t forget u don’t need to find the lastone, you can just group all resting servers if there is only one to find left.

@samjay: So did you just change the strings you used in your solution from random to consecutive base 3 strings ? (and you kept everything else the same ?) If that’s the case, then yes, it seems that random is not the best choice. My guess is that strings corresponding to consecutive numbers “cover” more subsets of bits when applying SHA-1 than just using random strings - but that’s just a wild guess (and very likely wrong) ; however, I don’t know why base 3 works better than other bases (in my solution that was the case).

I know you have guessed but still, “strings corresponding to consecutive numbers “cover” more subsets of bits when applying SHA-1 than just using random strings” any proof or logical support that says this statement is strong enough?

@bugkiller: Nope, no proof whatsoever. I just thought that when using strings corresponding to consecutive numbers the SHA-1 hashes would be more different between them (i.e. “cover” more subsets of bits) than when using random strings. Note that I did not analyze in details how SHA-1 works, but a small change in the string is expected to generate a large change in the SHA-1 hash - that’s all I had.

@mugurelionut: Yes, from completely random. I must admit that I’ve done ~8950pts with completely random too, but just one attempt of both gave that big difference already.
Also, I don’t tried your trick yet to have offsets/case (which I thought was pretty funny you really succeeded in doing that). But for me now, changing the one offset I use, also makes extremely big differences(29 wins so far).Also for the tests I did so far, just starting with ‘aaaaaaaaaaaa’ gave better results than a random extra offset.

But haven’t read a thing about the SHA-1 algorithm yet, only use it in apps =)

@mugurelionut >> I had initially the same idea of using “aaaaaaaaaaaa” as a password, plus you have used all passwords of length 12, that again increases the chance of “more” matching bits rather than choosing random length between 8 and 12 or any length less than 12, I guess. Would you please elaborate more the thing how you split the total number of guesses? I do not get the way you have treated them as blocks and then attacking each of the block.

Espl. this part “If S=Sbase then I assumed that all the servers in the block are honeypot servers. If S != Sbase then I considered each server in the block independently.”

@mugurelionut Dude have you got a brain in there or some kind of AI factory ?? :smiley: just reading this entire thing gives me the chills… motivates beginners (like me) to work hard to reach that level !! <3 love it !