Unofficial Editorial - August Challenge 2019 Div 2

vijju123 · August 13, 2019, 2:07pm

I post setter’s notes of editorials I was unable to complete so that nobody loses their enthusiasm.

ssjgz · August 13, 2019, 2:13pm

Ohhhhh … I honestly didn’t see that, sorry. Yes, that looks like it will be very helpful!

PS - for ZOMBCAV, the editorial gives O(N) time, which seems to neglect the O(N log N) from sorting the radiation levels and zombie healths (?)

Although, we can get a properly O(N) time solution with a little bit of thought, I think - exercise for the reader (what is the max value of a radiation level? What, then, would be the max zombie health if all zombies could be killed? Can we use this fact to check if zombie healths are a permutation of radiation levels in O(N)?)

vijju123 · August 13, 2019, 2:15pm

Right, setter’s solution is O(NLogN). Oversight from me - I got confused a bit as O(N) was possible and did not pay attention to setter’s solution a lot

anon62928982 · August 13, 2019, 2:43pm

@viju123 how soon can we expect the editorials for CHGORAM and CSTREE? For a beginner like me an explanation from complete scratch is more helpful, though I appreciate the efforts of @ssjgz and @tomsyd.

vijju123 · August 13, 2019, 2:46pm

CSTREE was the hardest problem of the set. Although I will try to explain it the best I can, you wont benefit much if you don’t fulfil the pre requisites.

For starters, try to grasp Kirchoff’s law of spanning tree and cayley’s formula (generalised one as well).

anon62928982 · August 13, 2019, 2:58pm

What exactly are the pre-requisites? Just Kirchhoff’s Law and Cayley’s formula? Though it was my third contest on Codechef I have some fair bit of experience solving problems and I was able to solve 5 problems from Div2(till KS1) and want to give a shot to the remaining problems.

alei · August 13, 2019, 3:03pm

No, the hardest was SYNBAC it has less ACs than the hardest, and even people that solved it got 0.01 points.

vivek_1998299 · August 13, 2019, 3:34pm

True:disappointed_relieved: spent 5 full days for 0.02 pts…can you tell the idea behind solution that got >1 score?

alei · August 13, 2019, 4:47pm

I have some general ideas, but I’m not sure how much they scores. The tie-break problem is kind of a real world problem (It probably has applications in bioinformatics).

If we know the original genome T (T is defined in test generation section), then we have to find a way to transform G to T using reversals and mutations. Using only reversals is known as “sort reversals” there are some studies about it http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/SortRevPres.pdf

The problem is that we don’t know T, we can use the smallest string that contains all proteins (this problem is similar to the DNA sequencing, for which there are some algorithms that use eulerian paths). Also there are some details because the same amino acid can be generated in distinct ways.

Another idea is to find the best position to divide P in chunks, generate each chunk with mutations and then move the chunks with reversals.

I noticed that the solution of wjli generate all proteins. I didn’t read his code in detail though.

vijju123 · August 13, 2019, 5:15pm

Plagiarism alert . These notes are too familiar hehehe

BTW, on a lighter note, you’re too happy that the tie breaker problem turned out to be hardest
Hehehehe XD

taran_1407 · August 13, 2019, 7:39pm

Agreed, Challenge was the hardest problem. Spent 19 submissions and 5 and a half day to get first AC, and overall 73 submissions to get 0.0098 score.

This is in contrast to 39 submissions on all remaining 10 problems combined within three and a half day (Including 23 submissions on TSEQ alone).

thang040602 · August 14, 2019, 2:18am

Can you explain how to calculate the result knowing the number of numbers bigger than the current node and the number of numbers smaller than the current node. I did know how to use fenwick tree and dfs to find the number of numbers bigger and smaller than the current node but I couldn’t find the formula for the result

tomsyd · August 14, 2019, 9:02am

Suppose tot_small, the total number of nodes smaller than the current node and tot_big the number of nodes bigger than the current node. tot_small and tot_big are trivial to calculate because the statement says that all nodes are numbered from 1 to N.

Now for each subtree of the current node, we will calculate the number of nodes bigger and smaller than the current node. Let’s call them big and small. Here, we will have 3 cases to do, p2 = 1, p2 = 2, p2 =3.

If p2 == 1, than we will add (tot_big - big) * big to the current answer.
If p2 == 2 than we will add (tot_big - big) * small + (tot_small - small) * big to the current answer.
If p2 == 3 than we will add (tot_small - small) * small to the current answer.

Now we need to add to the answer : current answer / 2.

Hope that this helps

thang040602 · August 14, 2019, 9:17am

yeah now I understand it. How simple it is. I quite regret that I didn’t figure that out during the contest. Anw thanks for the formula

wjli · August 15, 2019, 11:26pm

Thank you for the mention. Great problem setting alei! I am waiting for the official editorial on SYNBAC

I had a lot of fun improving the solution by 10,000-fold over 9 days but it’s a bit disappointed that my scorebard leadership was taken in the final day.

Here is my unofficial editorial:

If you take a look at the “Test Generation” section on how P1, P2, … Pm is chosen, you can see they are all from the original genome T. T is later scrambled (with reversal and mutation) to be G in the input.

This mean you can always converge P1, P2, … Pm together into one string (regardless of G). In practice, I took several steps:

Step 1:

Hash the 10-character header of each Pi; then go over each 10-character substring of Pi and find if it matches any of the header. If so, join these two strings. (10 was chosen over experiments to minimize hash collision).

For example, if you have one protein AAAABCDEFGHIJKL… and another protein ABCDEFGHIJKLZZZZ… you can connect them into one string of AAAABCDEFGHIJKLZZZ…

Since hash is used, the complexity is linear to the total length of all proteins (about 0.03 seconds in my experience)

The result is when M = 4096, I always ended up with 3 strings. (They were generated off T with 1 nucleotide offset. i.e. every protein starts with 1st, 4th, 7th, 10th nucleotide gets connected using the hash in this step; same for 2nd, 5th, 8th, etc. and 3rd, 6th, 9th, etc.)

When M = 1024, I ended up with 10-15 strings.

Just doing Step 1 will allow 100x improvement over the naive method of connecting the 50 shortest proteins.

Step 2:

Then we need to further consolidate these 3-15 strings into 1. For each two unconnected strings A and B, at every position of A, I have to expand every character (or amino acid) into all possible combination of codons (expressed as a sequence of nucleotides), and then check if I can match the sequence of nucleotides with any possible codon combination of header of B.

This step is pretty complex. It’s very much brute-force by expanding every single position into every possible combination. The time complexity in theory is the square of sum of all string lengths. Luckily, since we are only talking about an alphabet of 4x4x4=64 codons translating into 20 amino acids, the time complexity it’s not too bad. I estimate it’s linear with a reasonable constant factor.

With this step we connected all proteins into one target string T within the total length of G (or 32K letters)

Step 2 will further improve score by a factor of 5-10 over Step 1.

Step 3:

We generated target string T without considering G at all. Now we need to find the best way to transform G into T with minimal cost.

I started from the beginning of T, then try to find the longest appearance of T’s header in string G. For example:

T = ABCDEFG…
G = AAA…ABCDEFG…ZZZ

Then I can use two reversals to transform G to match the header of T.

G0 = AAA…[ABCDEFG…ZZZ]
G1 = [AAA…ZZZ…GFEDCBA]
G2 = [ABCDEFG…ZZZ…AAA]

Note thist technique works even if ABCDEFG is in the middle of T.

Since one side of the reversal is always the end of G, I only need to pay K.

Similarly, I also try to find the reversal of T’s header. For example:

T = ABCDEFG…
G = AAA…GFEDCBA…ZZZ

In this case, I just need to transform using one reversal:

G0 = [AAA…GFEDCBA]…ZZZ
G1 = [ABCDEFG…AAA]…ZZZ

There are also some incremental improvements by allowing up to 1 mutation. For example:

T = ABCDEFG…
G = AAA…GF[Z]DCBA…ZZZ

Which can be transformed by adding one mutation step:

G0 = [AAA…GFZDCBA]…ZZZ
G1 = [ABCDZFG…AAA]…ZZZ
G2 = ABCD[E]FG…AAA…ZZZ

In practice I have to set a minimal match size of 4 before attempting the reversal technique above. Otherwise, I will just use plain-vanilla mutation to match G to T and move on to the next position in T.

That’s it. A lot of fun trying different ways to improve the score. I never imagined there can be an 10,000x improvement over the naive method.

Here is a lightly commented C++ solution:

https://www.codechef.com/viewsolution/25885715

Any question, just let me know.

admin · August 16, 2019, 1:38am

Thanks a lot for explaining your approach, @wjli! Much appreciated.