SSTORY - Editorial

PROBLEM LINK:

Practice
Contest

Author: Bruno Oliveira
Tester: Mahbubul Hasan
Editorialist: Jingbo Shang

DIFFICULTY:

Medium

PREREQUISITES:

Suffix Array, Suffix Automaton, Binary Search, Hash

PROBLEM:

Find the longest common substring (LCS, consecutive) of two given strings S, T.

We have added test cases before the contest ended. Therefore, we extend the contest for one day.

EXPLANATION:

This problem is a classical problem. The setter is inspired by SPOJ LCS. However, in this problem, the only approach which gets AC is Suffix Automaton, so, by allowing different approaches like Hashing and Suffix Arrays to pass in the Codechef problem. And the main purpose of author is to teach new data structures to many new people

There are several solutions such as Suffix Array, Suffix Automaton, Binary Search + Hash. Here, I would like to introduce two of them, Suffix Automaton and Binary Search + Hash.

Binary Search + Hash

It is clear that the length of the LCS can be binary search. That is, if we have a common substring with length L, we can always find common substring with length less than L. Therefore, the binary search framework is as following:

lower = 0, upper = maxlength + 1; // LCS in [lower, upper).
while (lower + 1 < upper) {
    middle = (lower + upper) / 2;
    if (there is some common substring with length middle) {
        lower = middle;
    } else {
        upper = middle;
    }
}
LCS = lower;

So, the key point here is to check whether there is some common substrings with length middle. A usual way is to adopt Hash. For my preference, I would like to use Rabin-Karp hash, which is treat a string like a MAGIC based number. That is,

Hash(S[0..n-1]) = (S[n-1] * MAGIC^0 + S[n-2] * MAGIC^1 + .. + S[n-1-i] * MAGIC^i + ... ) % MOD

The most convenient point here, is that we can use Hash(S[0…i]) to calculate the Hash(S[l…r]) in O(1) time, with O(N) preparation. That is,

Hash(S[l..r]) = (Hash(S[0..r]) - Hash(S[0..l-1]) * MAGIC^(r-l+1)) % MOD

Therefore, we can get all hash values of substring with length middle from two given strings, and then, check whether there is any overlap. This procedure can be done via Hash Table in O(|S|) or Set (Balanced Binary Search Tree) in O(|S|log|S|). Therefore, Binary Search + Hash can solve this problem in O(|S| log|S|) time.

Suffix Automaton (Thanks to Bruno, the setter)

An automaton M is a 5-tuple which is composed in a very broad and general way by:

  1. A set of states, Q;
  2. A special state, defined as q0, also known as the start state;
  3. A set of accept states, A, which is a subset of Q.
  4. A given input alphabet.
  5. A special function which will allow for transitions between states, known as transition function.

While the theory on finite automaton is very vast and somewhat complex, from the point of view of our problem, we just need to know how to “map” all of the above concepts to what can be called a String-matching Automaton, which is just like a regular automaton with some added information in between the states and with additional characteristics which rely on the texts we intend to match, and also, of course, on the alphabet which is being to used to compose the texts themselves.

Below is a simple drawing of an automation, which accepts strings which end in the pattern “GCAGAGAG” (contains it as a suffix). 8 is the accept state. 0 is the start state. The arrows show the transition function.
alt text

The main idea here is that the automaton will only reach its accepting state (the one on the node 8, circled twice), when the entire pattern is found on the string.

Then, the high level idea, to actually make a matcher based upon an automaton possible is that we will use an “online” algorithm in the sense that we process each character individually by applying the transition function mentioned earlier. When we reach an accepting state, we can simply output the shift with which the accepting state occurred in the original string, and we will have the exact location of the matching.

An automaton in practice

After this very brief introduction about automatons, we are now ready to see some code, which is actually quite simple if the above explanation is taken into account. The main support structure for our automaton is what a state is, which can be defined in C++ as:

struct state {
    int len, link ;
    map < char , int > next; // if the alphabet is small, we can also use the array of constant size.
} ;

where we have the total length of the automaton so far, the link which is responsible to simulate the “connection” between nodes and a map<char, int> next which is responsible for denoting the “next” state in the automaton, composed obviously of a character and a link. Taken into account that the construction of the automaton is done “online”, by adding characters one by one, we can actually do it by extending the existing automaton with a specific character. That is done in the function called, sa_extend, presented below and which stands for suffix automaton extend:

void sa_extend ( char c ) {
    int cur = sz ++ ;
    st [ cur ] . len = st [ last ] . len + 1 ;
    int p ;
    for ( p = last ; p ! = - 1 && ! st [ p ] . next . count ( c ) ; p = st [ p ] . link )
        st [ p ] . next [ c ] = cur ;
    if ( p == - 1 )
        st [ cur ] . link = 0 ;
    else {
        int q = st [ p ] . next [ c ] ;
        if ( st [ p ] . len + 1 == st [ q ] . len )
            st [ cur ] . link = q ;
        else {
            int clone = sz ++ ;
            st [ clone ] . len = st [ p ] . len + 1 ;
            st [ clone ] . next = st [ q ] . next ;
            st [ clone ] . link = st [ q ] . link ;
            for ( ; p ! = - 1 && st [ p ] . next [ c ] == q ; p = st [ p ] . link )
                st [ p ] . next [ c ] = clone ;
            st [ q ] . link = st [ cur ] . link = clone ;
        }
    }
    last = cur ;
}

Now, all there is left to be done is to show how to initialize the automaton, done in the function sa_init by adding a “fake” state which links to nowhere, since the automaton initially only has its start state q0:

void sa_init ( ) {
    sz = last = 0 ;
    st [ 0 ] . len = 0 ;
    st [ 0 ] . link = - 1 ;
    ++ sz ;

` }

To find the LCS of two strings with an automaton, we can augment our variables such that we maintain a current length and a current state and simply choose to perform a transition between states if no match is found, or, we might choose to augment and update a link in order to “tell” to the automaton that we can keep on extending our links because we are in the middle of the matching.

In pseudo-code:

lcs( s, t ) {
    sa_init ( ) ;
    for ( i=0 ; i < length(s) ; ++ i )
        sa_extend ( s [ i ] ) ;
    int v = 0 , l = 0 ,
    best = 0 , bestpos = 0 ;
    for ( i=0 ; i < length (t) ; ++ i ) {
        while ( v && ! st [ v ] . next . count ( t [ i ] ) ) {
            v = st [ v ] . link ;
            l = st [ v ] . length ;
       }
       if ( st [ v ] . next . count ( t [ i ] ) ) {
            v = st [ v ] . next [ t [ i ] ] ;
            ++ l ;
       }
       if ( l > best )
            best = l, bestpos = i ;
   }
   return best_substring ;
}

In summary, we can solve this problem using Suffix Automaton in O(|S|) time.

AUTHOR’S AND TESTER’S SOLUTIONS:

Author’s solution can be found here.
Tester’s solution can be found here.

22 Likes

Can any one post an example with 4 to 6 letters?
For Suffix Automation part.

1 Like

I am quite surprised that editorialist did not mention the suffix array approach. I think suffix array approach is more intuitive. First of all sort all suffixes using suffix array, then you need to do maintain a sliding window of suffixes which belong to 2 different strings, Let the current window be [i, j] then current longest common substring is lcp (suffix at i and suffix at j). Take the maximum over all the windows.

4 Likes

Kudos to problem setter for this nice problem, It teaches me use of suffix array, hashing + binary search + suffix automaton, Quite a lot of stuff =D

4 Likes

I used suffix arrays to solve this question and my solution sustained all retests successfully. The algorithm is:

Construct the suffix array sorted lexicographically.
Construct the lcp array from the suffix array’s consecutive entries.
Mark the entries in lcp array whose suffixes start from different strings.
Find the max length of possible lcs (longest common substring) from the lcp array (equal to the max value of marked entries in lcp array).
Traverse over lcp array again and for all marked entries equal to max value, construct the lcs from the suffix array entries. Insert this lcs into a set. This is a set of possible lcs.
Now traverse over the second string, generate all substrings of length equal to max possible lcs length ( This can be done in O(n) ), search them in the set of possible lcs and break as soon as you find a match. The first match is your answer. This is the lcs which occurs first in the second string.

Link to the solution: CodeChef: Practical coding for everyone

If number of possible lcs are ‘k’, then the complexity of my solution is O(nlgk).

Basically any possible mistakes are avoided because all possible lcs are generated and the one which occurs first in second string is printed always.

Learnt a lot from this question, thanks to @kuruma :slight_smile:

3 Likes

Hello @all,

It’s a joy that I see so many talented programmers enjoying all my problems :slight_smile: Especially people who are definitely more skilled than I am!!

I have also started to delve a bit into Suffix Arrays with this problem and I hope that with time and practice I can start applying these concepts on other contests where I will be just another contestant :slight_smile:

Thanks to everyone for all your enthusiasm and kind words, it means a lot to me,

Bruno Oliveira

16 Likes

@author
I appreciate other approaches mentioned in this editorial but can you provide some tricky cases where the usual Suffix Array solution would fail. I have been continuously getting WA with that approach.

Thanks.

3 Likes

I did something a bit different. What I did was create the generalized suffix tree of s1 and s2, and look for a node with greatest depth that had a leaf from s1 and a leaf from s2 (breaking ties minimizing the index of the node, which is when it was first added). This runs in linear time overall (linear time for creating the suffix tree, then linear time for a DFS). I believe this is the “standard” “textbook” way of doing it with suffix trees, since I remember talking about this in my computational biology class.

7 Likes

Can you give a little details of

int q = st [ p ] . next [ c ] ;
if ( st [ p ] . len + 1 == st [ q ] . len )
st [ cur ] . link = q ;

can any one provide some tricky test case

I used the suffix array, lcp approach. And then searched for the max lcp between consecutive elements in suffix array such that they come from different strings.
I got WA if anyone can help me out where I went wrong, I will be grateful. :slight_smile:
Here is my effort
http://www.codechef.com/viewplaintext/3611533

I tried it using Rabin-Karp hash but am not able to get my solution accepted.
So can anybody explain tester’s solution?
What are vis[], head[], nextcell[]… etc. used for ??
What do mark and cnt denote ??
How do insert() and check() work ??
It is very difficult to decipher such a code without a single comment.

How can I get all the test cases for this problem? In general does codechef publish test cases for the challenge problems?

I got AC after considering these two testcases

abcadlfghijkl

abcdlfabcde

Answer is 3 abc and not 3 dlf

similarly ,

abcmdefghijkl

abcdefabcdl

Answer is 3 abc and not 3 def

3 Likes

sa_extend in suffix automaton is diificult to understand.

1 Like

Please provide a c++ code link for binary search and hashing solution …

I go through this sa_extend() function but i am getting difficulties to understand it.

can anyone please elaborate it in more specific way.

Thanks

can any one please explain me what is happening in sa_extend?
It would be a great help.

any one please explain extendSA() function

Is Suffix Automaton explained in any book/resource in more formal way rather than code ?
I am confused because we don’t have fixed string to create an automaton.