You are not logged in. Please login at www.codechef.com to post your questions!

×

# SSTORY - Editorial

Author: Bruno Oliveira
Tester: Mahbubul Hasan
Editorialist: Jingbo Shang

Medium

# PREREQUISITES:

Suffix Array, Suffix Automaton, Binary Search, Hash

# PROBLEM:

Find the longest common substring (LCS, consecutive) of two given strings S, T.

We have added test cases before the contest ended. Therefore, we extend the contest for one day.

# EXPLANATION:

This problem is a classical problem. The setter is inspired by SPOJ LCS. However, in this problem, the only approach which gets AC is Suffix Automaton, so, by allowing different approaches like Hashing and Suffix Arrays to pass in the Codechef problem. And the main purpose of author is to teach new data structures to many new people

There are several solutions such as Suffix Array, Suffix Automaton, Binary Search + Hash. Here, I would like to introduce two of them, Suffix Automaton and Binary Search + Hash.

## Binary Search + Hash

It is clear that the length of the LCS can be binary search. That is, if we have a common substring with length L, we can always find common substring with length less than L. Therefore, the binary search framework is as following:

lower = 0, upper = maxlength + 1; // LCS in [lower, upper).
while (lower + 1 < upper) {
middle = (lower + upper) / 2;
if (there is some common substring with length middle) {
lower = middle;
} else {
upper = middle;
}
}
LCS = lower;

So, the key point here is to check whether there is some common substrings with length middle. A usual way is to adopt Hash. For my preference, I would like to use Rabin-Karp hash, which is treat a string like a MAGIC based number. That is,

Hash(S[0..n-1]) = (S[n-1] * MAGIC^0 + S[n-2] * MAGIC^1 + .. + S[n-1-i] * MAGIC^i + ... ) % MOD

The most convenient point here, is that we can use Hash(S[0..i]) to calculate the Hash(S[l..r]) in O(1) time, with O(N) preparation. That is,

Hash(S[l..r]) = (Hash(S[0..r]) - Hash(S[0..l-1]) * MAGIC^(r-l+1)) % MOD

Therefore, we can get all hash values of substring with length middle from two given strings, and then, check whether there is any overlap. This procedure can be done via Hash Table in O(|S|) or Set (Balanced Binary Search Tree) in O(|S|log|S|). Therefore, Binary Search + Hash can solve this problem in O(|S| log|S|) time.

## Suffix Automaton (Thanks to Bruno, the setter)

An automaton M is a 5-tuple which is composed in a very broad and general way by:

1. A set of states, Q;
2. A special state, defined as q0, also known as the start state;
3. A set of accept states, A, which is a subset of Q.
4. A given input alphabet.
5. A special function which will allow for transitions between states, known as transition function.

While the theory on finite automaton is very vast and somewhat complex, from the point of view of our problem, we just need to know how to "map" all of the above concepts to what can be called a String-matching Automaton, which is just like a regular automaton with some added information in between the states and with additional characteristics which rely on the texts we intend to match, and also, of course, on the alphabet which is being to used to compose the texts themselves.

Below is a simple drawing of an automation, which accepts strings which end in the pattern "GCAGAGAG" (contains it as a suffix). 8 is the accept state. 0 is the start state. The arrows show the transition function.

The main idea here is that the automaton will only reach its accepting state (the one on the node 8, circled twice), when the entire pattern is found on the string.

Then, the high level idea, to actually make a matcher based upon an automaton possible is that we will use an “online” algorithm in the sense that we process each character individually by applying the transition function mentioned earlier. When we reach an accepting state, we can simply output the shift with which the accepting state occurred in the original string, and we will have the exact location of the matching.

### An automaton in practice

After this very brief introduction about automatons, we are now ready to see some code, which is actually quite simple if the above explanation is taken into account. The main support structure for our automaton is what a state is, which can be defined in C++ as:

struct state {
int len, link ;
map < char , int > next; // if the alphabet is small, we can also use the array of constant size.
} ;

where we have the total length of the automaton so far, the link which is responsible to simulate the "connection" between nodes and a map<char, int=""> next which is responsible for denoting the "next" state in the automaton, composed obviously of a character and a link. Taken into account that the construction of the automaton is done “online”, by adding characters one by one, we can actually do it by extending the existing automaton with a specific character. That is done in the function called, sa_extend, presented below and which stands for suffix automaton extend:

void sa_extend ( char c ) {
int cur = sz ++ ;
st [ cur ] . len = st [ last ] . len + 1 ;
int p ;
for ( p = last ; p ! = - 1 && ! st [ p ] . next . count ( c ) ; p = st [ p ] . link )
st [ p ] . next [ c ] = cur ;
if ( p == - 1 )
st [ cur ] . link = 0 ;
else {
int q = st [ p ] . next [ c ] ;
if ( st [ p ] . len + 1 == st [ q ] . len )
st [ cur ] . link = q ;
else {
int clone = sz ++ ;
st [ clone ] . len = st [ p ] . len + 1 ;
st [ clone ] . next = st [ q ] . next ;
st [ clone ] . link = st [ q ] . link ;
for ( ; p ! = - 1 && st [ p ] . next [ c ] == q ; p = st [ p ] . link )
st [ p ] . next [ c ] = clone ;
st [ q ] . link = st [ cur ] . link = clone ;
}
}
last = cur ;
}

Now, all there is left to be done is to show how to initialize the automaton, done in the function sa_init by adding a "fake" state which links to nowhere, since the automaton initially only has its start state q0:

void sa_init ( ) {
sz = last = 0 ;
st [ 0 ] . len = 0 ;
st [ 0 ] . link = - 1 ;
++ sz ;

` }

To find the LCS of two strings with an automaton, we can augment our variables such that we maintain a current length and a current state and simply choose to perform a transition between states if no match is found, or, we might choose to augment and update a link in order to “tell” to the automaton that we can keep on extending our links because we are in the middle of the matching.

In pseudo-code:

lcs( s, t ) {
sa_init ( ) ;
for ( i=0 ; i < length(s) ; ++ i )
sa_extend ( s [ i ] ) ;
int v = 0 , l = 0 ,
best = 0 , bestpos = 0 ;
for ( i=0 ; i < length (t) ; ++ i ) {
while ( v && ! st [ v ] . next . count ( t [ i ] ) ) {
v = st [ v ] . link ;
l = st [ v ] . length ;
}
if ( st [ v ] . next . count ( t [ i ] ) ) {
v = st [ v ] . next [ t [ i ] ] ;
++ l ;
}
if ( l > best )
best = l, bestpos = i ;
}
return best_substring ;
}

In summary, we can solve this problem using Suffix Automaton in O(|S|) time.

# AUTHOR'S AND TESTER'S SOLUTIONS:

Author's solution can be found here.
Tester's solution can be found here.

This question is marked "community wiki".

asked 18 Mar '14, 20:41

161446375
accept rate: 0%

19.7k350498541

5

Problems like these are the reason I love Codechef long contest. Educative problem backed by a superb editorial.

(18 Mar '14, 22:30) 3★
2

Those 2 pages of wrong answers taught me a lot of new things :) by the HARD way.. Grats to bruno!

(19 Mar '14, 00:31) 5★

 17 Hello @all, It's a joy that I see so many talented programmers enjoying all my problems :) Especially people who are definitely more skilled than I am!! I have also started to delve a bit into Suffix Arrays with this problem and I hope that with time and practice I can start applying these concepts on other contests where I will be just another contestant :) Thanks to everyone for all your enthusiasm and kind words, it means a lot to me, Bruno Oliveira answered 18 Mar '14, 23:16 3★kuruma 17.7k●72●143●209 accept rate: 8%
 6 I did something a bit different. What I did was create the generalized suffix tree of s1 and s2, and look for a node with greatest depth that had a leaf from s1 and a leaf from s2 (breaking ties minimizing the index of the node, which is when it was first added). This runs in linear time overall (linear time for creating the suffix tree, then linear time for a DFS). I believe this is the "standard" "textbook" way of doing it with suffix trees, since I remember talking about this in my computational biology class. answered 19 Mar '14, 02:20 7★lg5293 511●2●14 accept rate: 10%
 4 I am quite surprised that editorialist did not mention the suffix array approach. I think suffix array approach is more intuitive. First of all sort all suffixes using suffix array, then you need to do maintain a sliding window of suffixes which belong to 2 different strings, Let the current window be [i, j] then current longest common substring is lcp (suffix at i and suffix at j). Take the maximum over all the windows. answered 18 Mar '14, 22:01 2.5k●52●136●170 accept rate: 20% 2 I agree with you. The setter want me to emphasize the Suffix Automaton. So I omit the Suffix Array solution. (18 Mar '14, 23:53) 3 We can still have the other approach mentioned. The Editorial should be as elaborate as possible and may capture multiple ways of solving the problem. It is also made a community wiki so that anyone can edit/add to it. (20 Mar '14, 22:29) admin ♦♦0★
 4 Kudos to problem setter for this nice problem, It teaches me use of suffix array, hashing + binary search + suffix automaton, Quite a lot of stuff =D answered 18 Mar '14, 22:02 2.5k●52●136●170 accept rate: 20%
 3 I used suffix arrays to solve this question and my solution sustained all retests successfully. The algorithm is: Construct the suffix array sorted lexicographically. Construct the lcp array from the suffix array's consecutive entries. Mark the entries in lcp array whose suffixes start from different strings. Find the max length of possible lcs (longest common substring) from the lcp array (equal to the max value of marked entries in lcp array). Traverse over lcp array again and for all marked entries equal to max value, construct the lcs from the suffix array entries. Insert this lcs into a set. This is a set of possible lcs. Now traverse over the second string, generate all substrings of length equal to max possible lcs length ( This can be done in O(n) ), search them in the set of possible lcs and break as soon as you find a match. The first match is your answer. This is the lcs which occurs first in the second string. Link to the solution: http://www.codechef.com/viewsolution/3549875 If number of possible lcs are 'k', then the complexity of my solution is O(nlgk). Basically any possible mistakes are avoided because all possible lcs are generated and the one which occurs first in second string is printed always. Learnt a lot from this question, thanks to @kuruma :) answered 18 Mar '14, 23:08 783●5●10●20 accept rate: 0% @rishul_nsit can you please explain this with help of one example. thankx in advance (16 Apr '14, 19:48) zealf4★
 3 @author I appreciate other approaches mentioned in this editorial but can you provide some tricky cases where the usual Suffix Array solution would fail. I have been continuously getting WA with that approach. Thanks. answered 19 Mar '14, 01:03 201●2●8 accept rate: 33% Try going trough this link, maybe it will help you out: http://discuss.codechef.com/questions/39834/wa-in-sstory-suffix-arrays (19 Mar '14, 01:23) kuruma3★ @kuruma Thanks but my code works on all the test cases mentioned there. Can you provide some tricky case? (19 Mar '14, 01:52)
 2 I got AC after considering these two testcases abcadlfghijkl abcdlfabcde Answer is 3 abc and not 3 dlf similarly , abcmdefghijkl abcdefabcdl Answer is 3 abc and not 3 def answered 23 Mar '14, 00:15 3★hatim009 464●2●11●22 accept rate: 8% One can create many testcases based on above cases..... (23 Mar '14, 00:40) hatim0093★
 1 sa_extend in suffix automaton is diificult to understand. answered 29 Mar '14, 04:56 14●2 accept rate: 0%
 0 Can any one post an example with 4 to 6 letters? For Suffix Automation part. answered 18 Mar '14, 21:55 2★jangwa 172●2●9 accept rate: 10%

## Can you give a little details of

int q = st [ p ] . next [ c ] ; if ( st [ p ] . len + 1 == st [ q ] . len ) st [ cur ] . link = q ;

answered 19 Mar '14, 13:48

2★jangwa
17229
accept rate: 10%

Try to look at my code along with explanation :)

(19 Mar '14, 14:01) 3★
 0 can any one provide some tricky test case answered 19 Mar '14, 17:57 4★mca_123 15●1 accept rate: 0%
 0 I used the suffix array, lcp approach. And then searched for the max lcp between consecutive elements in suffix array such that they come from different strings. I got WA if anyone can help me out where I went wrong, I will be grateful. :) Here is my effort http://www.codechef.com/viewplaintext/3611533 answered 19 Mar '14, 20:57 99●3●5●9 accept rate: 0% 1 Check your solution for this test case s1="aecd" and s2="acad" .. the answer should be a but your solution gives c.. (20 Mar '14, 01:39)
 0 I tried it using Rabin-Karp hash but am not able to get my solution accepted. So can anybody explain tester's solution? What are vis[], head[], nextcell[]... etc. used for ?? What do mark and cnt denote ?? How do insert() and check() work ?? It is very difficult to decipher such a code without a single comment. answered 20 Mar '14, 00:31 1●1●2●3 accept rate: 0%
 0 How can I get all the test cases for this problem? In general does codechef publish test cases for the challenge problems? answered 22 Mar '14, 20:23 1 accept rate: 0%
 0 Please provide a c++ code link for binary search and hashing solution ... answered 01 Apr '14, 19:48 3★yoyosonu 31●1●1●4 accept rate: 0% 1 Hi! You can see tester's solution :D (01 Apr '14, 20:23) kuruma3★
 0 I go through this sa_extend() function but i am getting difficulties to understand it. can anyone please elaborate it in more specific way. Thanks answered 19 Apr '14, 00:39 404●3●9●15 accept rate: 23%
 0 can any one please explain me what is happening in sa_extend? It would be a great help. answered 26 May '15, 15:12 15●1●9 accept rate: 0%
 0 any one please explain extendSA() function answered 27 May '15, 00:12 15●1●9 accept rate: 0%
 0 Is Suffix Automaton explained in any book/resource in more formal way rather than code ? I am confused because we don't have fixed string to create an automaton. answered 10 Mar '16, 12:54 1●1 accept rate: 0%
 0 Another solution using Suffix Array can be found here : http://www.roman10.net/2012/03/16/suffix-array-part-3-longest-common-substring-lcs/ answered 28 Mar '17, 12:23 3★rohitjv -1●1●1●2 accept rate: 0% This algorithm itself won't work. We need to separate the 2 strings with a special character. (28 Mar '17, 12:33) rohitjv3★ http://lpcs.math.msu.su/~pritykin/csr2008presentations/starikovskaya.pdf (28 Mar '17, 12:42) rohitjv3★
 0 answered 30 Mar '17, 04:59 1 accept rate: 0%
 0 How to find LCS given Suffix Automaton? If the first character of the pattern doesn't exist in the text, how would the state change? Btw, my solution with Suffix Arrays gets TLE https://www.codechef.com/viewsolution/14290941. answered 22 Jun '17, 13:29 4★asgarj 1 accept rate: 0%
 toggle preview community wiki:
Preview

### Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

Markdown Basics

• *italic* or _italic_
• **bold** or __bold__
• image?![alt text](/path/img.jpg "title")
• numbered list: 1. Foo 2. Bar
• to add a line break simply add two spaces to where you would like the new line to be.
• basic HTML tags are also supported
• mathemetical formulas in Latex between \$ symbol

Question tags:

×15,482
×2,557
×1,023
×342
×90
×20
×7

question asked: 18 Mar '14, 20:41

question was seen: 12,096 times

last updated: 23 Jul '17, 12:31