Thank you for the mention. Great problem setting alei! I am waiting for the official editorial on SYNBAC
I had a lot of fun improving the solution by 10,000-fold over 9 days but it’s a bit disappointed that my scorebard leadership was taken in the final day.
Here is my unofficial editorial:
If you take a look at the “Test Generation” section on how P1, P2, … Pm is chosen, you can see they are all from the original genome T. T is later scrambled (with reversal and mutation) to be G in the input.
This mean you can always converge P1, P2, … Pm together into one string (regardless of G). In practice, I took several steps:
Hash the 10-character header of each Pi; then go over each 10-character substring of Pi and find if it matches any of the header. If so, join these two strings. (10 was chosen over experiments to minimize hash collision).
For example, if you have one protein AAAABCDEFGHIJKL… and another protein ABCDEFGHIJKLZZZZ… you can connect them into one string of AAAABCDEFGHIJKLZZZ…
Since hash is used, the complexity is linear to the total length of all proteins (about 0.03 seconds in my experience)
The result is when M = 4096, I always ended up with 3 strings. (They were generated off T with 1 nucleotide offset. i.e. every protein starts with 1st, 4th, 7th, 10th nucleotide gets connected using the hash in this step; same for 2nd, 5th, 8th, etc. and 3rd, 6th, 9th, etc.)
When M = 1024, I ended up with 10-15 strings.
Just doing Step 1 will allow 100x improvement over the naive method of connecting the 50 shortest proteins.
Then we need to further consolidate these 3-15 strings into 1. For each two unconnected strings A and B, at every position of A, I have to expand every character (or amino acid) into all possible combination of codons (expressed as a sequence of nucleotides), and then check if I can match the sequence of nucleotides with any possible codon combination of header of B.
This step is pretty complex. It’s very much brute-force by expanding every single position into every possible combination. The time complexity in theory is the square of sum of all string lengths. Luckily, since we are only talking about an alphabet of 4x4x4=64 codons translating into 20 amino acids, the time complexity it’s not too bad. I estimate it’s linear with a reasonable constant factor.
With this step we connected all proteins into one target string T within the total length of G (or 32K letters)
Step 2 will further improve score by a factor of 5-10 over Step 1.
We generated target string T without considering G at all. Now we need to find the best way to transform G into T with minimal cost.
I started from the beginning of T, then try to find the longest appearance of T’s header in string G. For example:
T = ABCDEFG…
G = AAA…ABCDEFG…ZZZ
Then I can use two reversals to transform G to match the header of T.
G0 = AAA…[ABCDEFG…ZZZ]
G1 = [AAA…ZZZ…GFEDCBA]
G2 = [ABCDEFG…ZZZ…AAA]
Note thist technique works even if ABCDEFG is in the middle of T.
Since one side of the reversal is always the end of G, I only need to pay K.
Similarly, I also try to find the reversal of T’s header. For example:
T = ABCDEFG…
G = AAA…GFEDCBA…ZZZ
In this case, I just need to transform using one reversal:
G0 = [AAA…GFEDCBA]…ZZZ
G1 = [ABCDEFG…AAA]…ZZZ
There are also some incremental improvements by allowing up to 1 mutation. For example:
T = ABCDEFG…
G = AAA…GF[Z]DCBA…ZZZ
Which can be transformed by adding one mutation step:
G0 = [AAA…GFZDCBA]…ZZZ
G1 = [ABCDZFG…AAA]…ZZZ
G2 = ABCD[E]FG…AAA…ZZZ
In practice I have to set a minimal match size of 4 before attempting the reversal technique above. Otherwise, I will just use plain-vanilla mutation to match G to T and move on to the next position in T.
That’s it. A lot of fun trying different ways to improve the score. I never imagined there can be an 10,000x improvement over the naive method.
Here is a lightly commented C++ solution:
Any question, just let me know.