dna sequence matching problem....

i am getting a trouble finding an approach to solve this problem…

input-output sequences are as follows

input1 : aaagctgctagag

output1 : a3gct2ag2


input2 : aaaaaaagctaagctaag

output2 : a6agcta2ag

input nsequence can be of 10^6 characters and largest continuous patterns will be considered. For example in input2 “agctaagcta” it will not be agcta2gcta but it will be “agcta2”.

any help appreciated.

Extra Examples :

  1. input aabbaabb
    output is aabb2 not a2b2a2b2

  2. input aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb
    output is a9b9a9b9 not aaaaaaaaabbbbbbbbb2

It shows that smaller the encode it is most likely to be an answer.

1 Like

I would read the input string character by character, and look if the next one is the first of a suffix of already processed data. If so, i increment a counter for this chunk, else i just continue. I don’t know if this method is correct, but that would be my first approach.

@rakeshbubli143: why downvoting ?