You are not logged in. Please login at www.codechef.com to post your questions!

×

LYRC - Editorial

6
3

PROBLEM LINKS

Practice

Contest

DIFFICULTY

MEDIUM

PREREQUISITES

String Matching, Aho-Corasick, Dynamic Programming

PROBLEM

You are given a dictionary D, of W words.

You are given a set of N strings, S, to search for the frequency of occurance of the W words.

EXPLANATION

This is a classical problem in multiple string searching.

If the task gave a single word, then using Knuth-Morris-Pratt would have yeilded linear time complexity. But, repeating the same approach for each word will lead to O(summa(|strings in D|) + W * summa(|strings in S|)) computations, which is too much for the time limit.

The indended solution requires the use of the Aho-Corasick string matching algorithm. The idea is that you build a data structure for all the given set of words in O(summa(|strings in D|)) computations. This can then be used to search for each occurance of any of the words in the strings in S.

This is primarily an extension of the Knuth-Morris-Pratt structure's failure function, where you allow transitions across different strings on failure - called the suffix arc in the wikipedia article.

The wikipedia article is quite informative and contains further links to help you with implementing this datastructure, including, links to implementation in C.

The time complexity of using the Aho-Corasick algorithm to find each match is O(summa(|strings in D|) + summa(|strings in S|) + summa(frequency of each word)).

The sum of the frequencies of the words is quite large.

count the frequency of words

Thus, we cannot look for individual matches and add them up. We do a clever dynamic programming on the Aho-Corasick data structure.

  • It is important to understand the meaning of the dictionary suffix arc (the reference .fail on each node in the tester's code) to understand the remaining discussion.

Aho-Corasick iterates through the matches by backtracking through the data structure while following only the dictionary suffix arcs. Since they lead to matched dictionary words, the complete backtrack prints every match.

The data structure, along with the dictionary suffix arcs, creates a directed acyclic graph which is expected to be traversed to print each match. Since we don't want to print each match

  • we can defer the backtracking by only storing the count of times each node was reached
  • lazily update the counts on each node by adding them upwards by following the dictionary suffix arcs
  • finally, print the counts on each node representing a matched word

example

Let us say that there are two words

  • ab
  • cab

The Aho-Corasick data structure looks like

There is one dictionary suffix arc in light green.

We process the string acabbab. When we have processed the first 4 characters, the standard algorithm expects to print cab and then follow the dictionary suffix arc and print ab.

But, we only store a count of 1 on the node b that is reached at cab. We refrain from backtracking at this point. We continue and eventually store 1 when we end up on the node b that is reached at ab.

Later, we follow the links and add up the count values by visiting the graph bottom-up in breadth first manner. This means that we add the 1 at the b in cab to the 1 at the b in ab because they hßave a dictionary suffix arc between them.

Refer to the tester's solution for details.

SETTER'S SOLUTION

Can be found here.

TESTER'S SOLUTION

Can be found here.

This question is marked "community wiki".

asked 12 Aug '13, 16:01

gamabunta's gravatar image

1★gamabunta
2.4k128183169
accept rate: 14%

edited 22 Dec '13, 02:03

yashkumar18's gravatar image

5★yashkumar18
82661224

@gamabunta: Just fixed the setter/tester solution links.

(12 Aug '13, 23:13) tyrant2★

the practice and contest links aren't working correctly..

(15 Aug '13, 23:04) ks_dharanish2★

The practice and contest links are link LYRC

(17 Aug '13, 09:14) tusharmakkar083★

Please fix the practice and contest links to below respectively. http://www.codechef.com/submit/LYRC http://www.codechef.com/AUG13/problems/LYRC

(31 Aug '13, 00:18) shiwakant4★

I tried using Aho-Corasick algorithm but got TLE , perhaps because I didn't optimize it . Suffix trees and Knuth-Morris-Pratt also gave me TLE . Finally I did it using Suffix Arrays and it got accepted .

link

answered 12 Aug '13, 17:18

vineetpaliwal's gravatar image

6★vineetpaliwal
12.4k47107171
accept rate: 12%

5

@vineetpaliwal : it seems you have used DC3 algorithm for suffix array generation.could you pls write idea for this algorithm.i am pretty sure many more people along with me will get benifited.

(12 Aug '13, 18:11) sandeepandey2★

Well , suffix array solution is :

1 . build suffix array in O(N) // EDITED

2 . for each pattern use binary search in sorted suffix array for find first occurance in suffix and last occurance in suffix then solution for this pattern is last - first + 1 .

3 . here is implementation

link

answered 12 Aug '13, 17:20

contesant's gravatar image

5★contesant
11014
accept rate: 0%

edited 15 Aug '13, 19:47

I tried that approach but it timed out.

(12 Aug '13, 18:06) bondoc734★

it's DC3 algorithm for suffix array construction and only need o(n) to build.

(12 Aug '13, 18:13) zeulb6★

It teased me that even DC3 gave TLE n I didn't know y.

(12 Aug '13, 18:41) bondoc734★

yes zeulb is right implementation is O(n) but O(nlogn) is enough when n = 5,000,000 = 100 * 50,000 .

(12 Aug '13, 19:34) contesant5★

I also got AC using suffix array. O(n log n) suffix array construction timed out for me but O(n) DC3 algorithm worked.

(12 Aug '13, 20:20) n2n_5★

can you give me a link to your code please?

(13 Aug '13, 02:35) bondoc734★

I also got AC with O(n log n) suffix array. The important observation for me is to create a suffix array for each of the 100 strings, instead of a single suffix array for their concatenation. If not, I get TLE.

(15 Aug '13, 01:54) kevinsogo7★
showing 5 of 7 show all

So the suffix automaton solution will time out?

link

answered 12 Aug '13, 16:18

panditkipython's gravatar image

3★panditkipython
11
accept rate: 0%

As a matter of fact a large number of accepted solutions use Suffix Automaton.

They manage to sneak well into the time limit by binary searching for a word in the sequence of sorted suffixes.

(12 Aug '13, 16:22) gamabunta1★

Could you link some of the solutions which used Suffix Automaton ? Tks

(12 Aug '13, 16:29) tungnk19934★

my solution is giving SIGSEGV in practice link. I'm freeing memory too.

(12 Aug '13, 16:30) panditkipython3★

you need to impement suffix arrray in O(NlogN) then you can get ac with suffix arrays.

(12 Aug '13, 17:16) contesant5★

can't it be done by purely suffix tree (w/o bsearch etc) ?? http://bit.ly/163JdyG (3rd slide)

link

answered 12 Aug '13, 16:58

amitbaranwal53's gravatar image

3★amitbaranwal53
10114
accept rate: 0%

i implemented boyre moore algo .. and i was getting wa .. i dont know on which test case is my solution failing .. pl give me a test case on which it fails and ans for that test case..

http://www.codechef.com/viewsolution/2525073

thanks :)

link

answered 13 Aug '13, 17:57

gauravdrocker's gravatar image

3★gauravdrocker
91246
accept rate: 0%

edited 13 Aug '13, 17:58

can u elaborate aho corasick lil bit more......

link

answered 20 Aug '13, 00:20

akashverma_123's gravatar image

3★akashverma_123
1197
accept rate: 3%

Can somebody, who knows c++ better explain ?

http://www.codechef.com/viewsolution/6842375

http://www.codechef.com/viewsolution/6842373

"#define where _end gives" runtime error, but without Accepted. Why is that?

link

answered 06 May '15, 05:53

madiyar's gravatar image

6★madiyar
211
accept rate: 0%

toggle preview
Preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported
  • mathemetical formulas in Latex between $ symbol

Question tags:

×15,852
×2,212
×1,298
×643
×27
×12

question asked: 12 Aug '13, 16:01

question was seen: 8,439 times

last updated: 06 May '15, 05:53