PROBLEM LINK:Author: Bruno Oliveira DIFFICULTY:Medium PREREQUISITES:Suffix Array, Suffix Automaton, Binary Search, Hash PROBLEM:Find the longest common substring (LCS, consecutive) of two given strings S, T. We have added test cases before the contest ended. Therefore, we extend the contest for one day. EXPLANATION:This problem is a classical problem. The setter is inspired by SPOJ LCS. However, in this problem, the only approach which gets AC is Suffix Automaton, so, by allowing different approaches like Hashing and Suffix Arrays to pass in the Codechef problem. And the main purpose of author is to teach new data structures to many new people There are several solutions such as Suffix Array, Suffix Automaton, Binary Search + Hash. Here, I would like to introduce two of them, Suffix Automaton and Binary Search + Hash. Binary Search + HashIt is clear that the length of the LCS can be binary search. That is, if we have a common substring with length L, we can always find common substring with length less than L. Therefore, the binary search framework is as following:
So, the key point here is to check whether there is some common substrings with length middle. A usual way is to adopt Hash. For my preference, I would like to use RabinKarp hash, which is treat a string like a MAGIC based number. That is,
The most convenient point here, is that we can use Hash(S[0..i]) to calculate the Hash(S[l..r]) in O(1) time, with O(N) preparation. That is,
Therefore, we can get all hash values of substring with length middle from two given strings, and then, check whether there is any overlap. This procedure can be done via Hash Table in O(S) or Set (Balanced Binary Search Tree) in O(SlogS). Therefore, Binary Search + Hash can solve this problem in O(S logS) time. Suffix Automaton (Thanks to Bruno, the setter)An automaton M is a 5tuple which is composed in a very broad and general way by:
While the theory on finite automaton is very vast and somewhat complex, from the point of view of our problem, we just need to know how to "map" all of the above concepts to what can be called a Stringmatching Automaton, which is just like a regular automaton with some added information in between the states and with additional characteristics which rely on the texts we intend to match, and also, of course, on the alphabet which is being to used to compose the texts themselves. Below is a simple drawing of an automation, which accepts strings which end in the pattern "GCAGAGAG" (contains it as a suffix). 8 is the accept state. 0 is the start state. The arrows show the transition function. The main idea here is that the automaton will only reach its accepting state (the one on the node 8, circled twice), when the entire pattern is found on the string. Then, the high level idea, to actually make a matcher based upon an automaton possible is that we will use an “online” algorithm in the sense that we process each character individually by applying the transition function mentioned earlier. When we reach an accepting state, we can simply output the shift with which the accepting state occurred in the original string, and we will have the exact location of the matching. An automaton in practiceAfter this very brief introduction about automatons, we are now ready to see some code, which is actually quite simple if the above explanation is taken into account. The main support structure for our automaton is what a state is, which can be defined in C++ as:
where we have the total length of the automaton so far, the link which is responsible to simulate the "connection" between nodes and a map<char, int=""> next which is responsible for denoting the "next" state in the automaton, composed obviously of a character and a link. Taken into account that the construction of the automaton is done “online”, by adding characters one by one, we can actually do it by extending the existing automaton with a specific character. That is done in the function called, sa_extend, presented below and which stands for suffix automaton extend:
Now, all there is left to be done is to show how to initialize the automaton, done in the function sa_init by adding a "fake" state which links to nowhere, since the automaton initially only has its start state q0:
` } To find the LCS of two strings with an automaton, we can augment our variables such that we maintain a current length and a current state and simply choose to perform a transition between states if no match is found, or, we might choose to augment and update a link in order to “tell” to the automaton that we can keep on extending our links because we are in the middle of the matching. In pseudocode:
In summary, we can solve this problem using Suffix Automaton in O(S) time. AUTHOR'S AND TESTER'S SOLUTIONS:Author's solution can be found here.
This question is marked "community wiki".
asked 18 Mar '14, 20:41

Hello @all, It's a joy that I see so many talented programmers enjoying all my problems :) Especially people who are definitely more skilled than I am!! I have also started to delve a bit into Suffix Arrays with this problem and I hope that with time and practice I can start applying these concepts on other contests where I will be just another contestant :) Thanks to everyone for all your enthusiasm and kind words, it means a lot to me, Bruno Oliveira answered 18 Mar '14, 23:16

I did something a bit different. What I did was create the generalized suffix tree of s1 and s2, and look for a node with greatest depth that had a leaf from s1 and a leaf from s2 (breaking ties minimizing the index of the node, which is when it was first added). This runs in linear time overall (linear time for creating the suffix tree, then linear time for a DFS). I believe this is the "standard" "textbook" way of doing it with suffix trees, since I remember talking about this in my computational biology class. answered 19 Mar '14, 02:20

Kudos to problem setter for this nice problem, It teaches me use of suffix array, hashing + binary search + suffix automaton, Quite a lot of stuff =D answered 18 Mar '14, 22:02

I used suffix arrays to solve this question and my solution sustained all retests successfully. The algorithm is: Construct the suffix array sorted lexicographically. Construct the lcp array from the suffix array's consecutive entries. Mark the entries in lcp array whose suffixes start from different strings. Find the max length of possible lcs (longest common substring) from the lcp array (equal to the max value of marked entries in lcp array). Traverse over lcp array again and for all marked entries equal to max value, construct the lcs from the suffix array entries. Insert this lcs into a set. This is a set of possible lcs. Now traverse over the second string, generate all substrings of length equal to max possible lcs length ( This can be done in O(n) ), search them in the set of possible lcs and break as soon as you find a match. The first match is your answer. This is the lcs which occurs first in the second string. Link to the solution: http://www.codechef.com/viewsolution/3549875 If number of possible lcs are 'k', then the complexity of my solution is O(nlgk). Basically any possible mistakes are avoided because all possible lcs are generated and the one which occurs first in second string is printed always. Learnt a lot from this question, thanks to @kuruma :) answered 18 Mar '14, 23:08
@rishul_nsit can you please explain this with help of one example. thankx in advance
(16 Apr '14, 19:48)

@author I appreciate other approaches mentioned in this editorial but can you provide some tricky cases where the usual Suffix Array solution would fail. I have been continuously getting WA with that approach. Thanks. answered 19 Mar '14, 01:03
Try going trough this link, maybe it will help you out: http://discuss.codechef.com/questions/39834/wainsstorysuffixarrays
(19 Mar '14, 01:23)
@kuruma Thanks but my code works on all the test cases mentioned there. Can you provide some tricky case?
(19 Mar '14, 01:52)

I got AC after considering these two testcases abcadlfghijkl abcdlfabcde Answer is 3 abc and not 3 dlf similarly , abcmdefghijkl abcdefabcdl Answer is 3 abc and not 3 def answered 23 Mar '14, 00:15

sa_extend in suffix automaton is diificult to understand. answered 29 Mar '14, 04:56

Can any one post an example with 4 to 6 letters? For Suffix Automation part. answered 18 Mar '14, 21:55

Can you give a little details ofint q = st [ p ] . next [ c ] ; if ( st [ p ] . len + 1 == st [ q ] . len ) st [ cur ] . link = q ; answered 19 Mar '14, 13:48

I used the suffix array, lcp approach. And then searched for the max lcp between consecutive elements in suffix array such that they come from different strings. I got WA if anyone can help me out where I went wrong, I will be grateful. :) Here is my effort http://www.codechef.com/viewplaintext/3611533 answered 19 Mar '14, 20:57
1
Check your solution for this test case s1="aecd" and s2="acad" .. the answer should be a but your solution gives c..
(20 Mar '14, 01:39)

I tried it using RabinKarp hash but am not able to get my solution accepted.
So can anybody explain tester's solution? answered 20 Mar '14, 00:31

How can I get all the test cases for this problem? In general does codechef publish test cases for the challenge problems? answered 22 Mar '14, 20:23

Please provide a c++ code link for binary search and hashing solution ... answered 01 Apr '14, 19:48

I go through this sa_extend() function but i am getting difficulties to understand it. can anyone please elaborate it in more specific way. Thanks answered 19 Apr '14, 00:39

can any one please explain me what is happening in sa_extend? It would be a great help. answered 26 May '15, 15:12

Is Suffix Automaton explained in any book/resource in more formal way rather than code ? I am confused because we don't have fixed string to create an automaton. answered 10 Mar '16, 12:54

Another solution using Suffix Array can be found here : http://www.roman10.net/2012/03/16/suffixarraypart3longestcommonsubstringlcs/ answered 28 Mar '17, 12:23
This algorithm itself won't work. We need to separate the 2 strings with a special character.
(28 Mar '17, 12:33)
(28 Mar '17, 12:42)

answered 30 Mar '17, 04:59

How to find LCS given Suffix Automaton? If the first character of the pattern doesn't exist in the text, how would the state change? Btw, my solution with Suffix Arrays gets TLE https://www.codechef.com/viewsolution/14290941. answered 22 Jun '17, 13:29

Problems like these are the reason I love Codechef long contest. Educative problem backed by a superb editorial.
Those 2 pages of wrong answers taught me a lot of new things :) by the HARD way.. Grats to bruno!