Unofficial Editorial - August Challenge 2019 Div 2

Here’s my solution for CHGORAM: CodeChef: Practical coding for everyone

Contains heavily-commented code, plus a high-level overview of the approach used and why it works.

It’s probably about the most detailed resource on this question until the official Editorial comes out :confused:

Let’s take the exemple of the question :

1
4
2 1 3
1 2
2 3
2 4

My algorithm for this exemple :

Take a node (let’s say 2). Add 2 to the indexed set.

Then, I will go through all of the unvisited neighbours of 2. Add them (node 1,3 and 4) to the indexed set and calculate* the number of answers for them. After this we can calculate* the number of answers for 2.

(*) We can see that we can calculate the number of solutions for a current node (taking it as the middle), knowing the number of nodes bigger than the current node and smaller than the current node.

My solution CodeChef: Practical coding for everyone
(functions : function isStar is for checking if the graph is a star and function star is to calculate the star case in O(n))

Did the setter’s notes not help?

Ah - I didn’t realise there were any (yet) for CHGORAM - whereabouts are they? :slight_smile:

1 Like

I post setter’s notes of editorials I was unable to complete so that nobody loses their enthusiasm. :slight_smile:

1 Like

Ohhhhh … I honestly didn’t see that, sorry. Yes, that looks like it will be very helpful! :slight_smile:

PS - for ZOMBCAV, the editorial gives O(N) time, which seems to neglect the O(N log N) from sorting the radiation levels and zombie healths (?)

Although, we can get a properly O(N) time solution with a little bit of thought, I think - exercise for the reader :slight_smile: (what is the max value of a radiation level? What, then, would be the max zombie health if all zombies could be killed? Can we use this fact to check if zombie healths are a permutation of radiation levels in O(N)?)

Right, setter’s solution is O(NLogN). Oversight from me - I got confused a bit as O(N) was possible and did not pay attention to setter’s solution a lot :stuck_out_tongue:

1 Like

@viju123 how soon can we expect the editorials for CHGORAM and CSTREE? For a beginner like me an explanation from complete scratch is more helpful, though I appreciate the efforts of @ssjgz and @tomsyd.

1 Like

CSTREE was the hardest problem of the set. Although I will try to explain it the best I can, you wont benefit much if you don’t fulfil the pre requisites.

For starters, try to grasp Kirchoff’s law of spanning tree and cayley’s formula (generalised one as well).

1 Like

What exactly are the pre-requisites? Just Kirchhoff’s Law and Cayley’s formula? Though it was my third contest on Codechef I have some fair bit of experience solving problems and I was able to solve 5 problems from Div2(till KS1) and want to give a shot to the remaining problems. :slight_smile:

1 Like

No, the hardest was SYNBAC :wink: it has less ACs than the hardest, and even people that solved it got 0.01 points.

3 Likes

True​:disappointed_relieved::disappointed_relieved: spent 5 full days for 0.02 pts…can you tell the idea behind solution that got >1 score?

I have some general ideas, but I’m not sure how much they scores. The tie-break problem is kind of a real world problem (It probably has applications in bioinformatics).

If we know the original genome T (T is defined in test generation section), then we have to find a way to transform G to T using reversals and mutations. Using only reversals is known as “sort reversals” there are some studies about it http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/SortRevPres.pdf

The problem is that we don’t know T, we can use the smallest string that contains all proteins (this problem is similar to the DNA sequencing, for which there are some algorithms that use eulerian paths). Also there are some details because the same amino acid can be generated in distinct ways.

Another idea is to find the best position to divide P in chunks, generate each chunk with mutations and then move the chunks with reversals.

I noticed that the solution of wjli generate all proteins. I didn’t read his code in detail though.

Plagiarism alert :stuck_out_tongue: . These notes are too familiar hehehe

BTW, on a lighter note, you’re too happy that the tie breaker problem turned out to be hardest :stuck_out_tongue:
Hehehehe XD

Agreed, Challenge was the hardest problem. Spent 19 submissions and 5 and a half day to get first AC, and overall 73 submissions to get 0.0098 score. :expressionless:

This is in contrast to 39 submissions on all remaining 10 problems combined within three and a half day (Including 23 submissions on TSEQ alone).

Can you explain how to calculate the result knowing the number of numbers bigger than the current node and the number of numbers smaller than the current node. I did know how to use fenwick tree and dfs to find the number of numbers bigger and smaller than the current node but I couldn’t find the formula for the result :frowning:

1 Like

Suppose tot_small, the total number of nodes smaller than the current node and tot_big the number of nodes bigger than the current node. tot_small and tot_big are trivial to calculate because the statement says that all nodes are numbered from 1 to N.

Now for each subtree of the current node, we will calculate the number of nodes bigger and smaller than the current node. Let’s call them big and small. Here, we will have 3 cases to do, p2 = 1, p2 = 2, p2 =3.

If p2 == 1, than we will add (tot_big - big) * big to the current answer.
If p2 == 2 than we will add (tot_big - big) * small + (tot_small - small) * big to the current answer.
If p2 == 3 than we will add (tot_small - small) * small to the current answer.

Now we need to add to the answer : current answer / 2.

Hope that this helps :smile:

yeah now I understand it. How simple it is. I quite regret that I didn’t figure that out during the contest. Anw thanks for the formula :smiley:

Thank you for the mention. Great problem setting alei! I am waiting for the official editorial on SYNBAC :slight_smile:

I had a lot of fun improving the solution by 10,000-fold over 9 days but it’s a bit disappointed that my scorebard leadership was taken in the final day.

Here is my unofficial editorial:

If you take a look at the “Test Generation” section on how P1, P2, … Pm is chosen, you can see they are all from the original genome T. T is later scrambled (with reversal and mutation) to be G in the input.

This mean you can always converge P1, P2, … Pm together into one string (regardless of G). In practice, I took several steps:

Step 1:

Hash the 10-character header of each Pi; then go over each 10-character substring of Pi and find if it matches any of the header. If so, join these two strings. (10 was chosen over experiments to minimize hash collision).

For example, if you have one protein AAAABCDEFGHIJKL… and another protein ABCDEFGHIJKLZZZZ… you can connect them into one string of AAAABCDEFGHIJKLZZZ…

Since hash is used, the complexity is linear to the total length of all proteins (about 0.03 seconds in my experience)

The result is when M = 4096, I always ended up with 3 strings. (They were generated off T with 1 nucleotide offset. i.e. every protein starts with 1st, 4th, 7th, 10th nucleotide gets connected using the hash in this step; same for 2nd, 5th, 8th, etc. and 3rd, 6th, 9th, etc.)

When M = 1024, I ended up with 10-15 strings.

Just doing Step 1 will allow 100x improvement over the naive method of connecting the 50 shortest proteins.

Step 2:

Then we need to further consolidate these 3-15 strings into 1. For each two unconnected strings A and B, at every position of A, I have to expand every character (or amino acid) into all possible combination of codons (expressed as a sequence of nucleotides), and then check if I can match the sequence of nucleotides with any possible codon combination of header of B.

This step is pretty complex. It’s very much brute-force by expanding every single position into every possible combination. The time complexity in theory is the square of sum of all string lengths. Luckily, since we are only talking about an alphabet of 4x4x4=64 codons translating into 20 amino acids, the time complexity it’s not too bad. I estimate it’s linear with a reasonable constant factor.

With this step we connected all proteins into one target string T within the total length of G (or 32K letters)

Step 2 will further improve score by a factor of 5-10 over Step 1.

Step 3:

We generated target string T without considering G at all. Now we need to find the best way to transform G into T with minimal cost.

I started from the beginning of T, then try to find the longest appearance of T’s header in string G. For example:

T = ABCDEFG…
G = AAA…ABCDEFG…ZZZ

Then I can use two reversals to transform G to match the header of T.

G0 = AAA…[ABCDEFG…ZZZ]
G1 = [AAA…ZZZ…GFEDCBA]
G2 = [ABCDEFG…ZZZ…AAA]

Note thist technique works even if ABCDEFG is in the middle of T.

Since one side of the reversal is always the end of G, I only need to pay K.

Similarly, I also try to find the reversal of T’s header. For example:

T = ABCDEFG…
G = AAA…GFEDCBA…ZZZ

In this case, I just need to transform using one reversal:

G0 = [AAA…GFEDCBA]…ZZZ
G1 = [ABCDEFG…AAA]…ZZZ

There are also some incremental improvements by allowing up to 1 mutation. For example:

T = ABCDEFG…
G = AAA…GF[Z]DCBA…ZZZ

Which can be transformed by adding one mutation step:

G0 = [AAA…GFZDCBA]…ZZZ
G1 = [ABCDZFG…AAA]…ZZZ
G2 = ABCD[E]FG…AAA…ZZZ

In practice I have to set a minimal match size of 4 before attempting the reversal technique above. Otherwise, I will just use plain-vanilla mutation to match G to T and move on to the next position in T.

That’s it. A lot of fun trying different ways to improve the score. I never imagined there can be an 10,000x improvement over the naive method.

Here is a lightly commented C++ solution:

https://www.codechef.com/viewsolution/25885715

Any question, just let me know.

4 Likes

Thanks a lot for explaining your approach, @wjli! Much appreciated.

1 Like