What is the best algorithm to find the most occurrence of a word in text?

tyrant · August 28, 2013, 5:31am

Does anyone know what is the best algorithm to find the most occurrence of a word in text (very large), so hash map is not a solution I’m looking for. I’m thinking of either “Trie” or “Suffix Array” but I haven’t had a clear idea how to solve it. Any idea or suggestion would be greatly appreciated.

For example: This is is is an example.
Then the most occurrence is “is”.

betlista · August 29, 2013, 9:41pm

What about this pseudo code:

c = first char
while c not text end
    c = get char
    if c terminates word
        get actual word from trie
        increment counter and check if it is max
    else
        move in trie pointer using c

so in your example

after this word

ROOT - t - h - i - s (1) // <- most occurences

after first is

ROOT + t - h - i - s (1) // <- most occurences
     - i - s (1)

after second is

ROOT + t - h - i - s (1)
     - i - s (2) // <- most occurences

…

at the EOF

ROOT + t - h - i - s (1)
     + i - s (3) // <- most occurences
     + a - n (1)
     - e - x - a - m - p - l - e (1)

edit: but this works only if you can split the text to words… If your question is about substrings instead of words, then the question is if you are looking something better than substrings of length 1 (just characters).

avinashmail · June 11, 2014, 8:24pm

you can use KMP or Aco-Corasick both used for pattern matching, if u have large number of pattern match with a a input text of large size u can use Aco-Corasick

japoorv · June 11, 2014, 8:35pm

I think KMP is the best one with O(n+m) refer to Needle in a haystack problem on SPOJ

redd · June 11, 2014, 8:12pm

what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.

redd · June 11, 2014, 8:32pm

the question was asked 1 year ago so i think the op would have got his answer by now. its showing up at the top of the feed because of my cmment “what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.” can you please answer it. also please add some details in ur current answer. i really dont understand how KMP or aho corasick could be used here…

redd · June 11, 2014, 9:03pm

please give some more details.

tad123 · November 29, 2020, 8:24am

For each sentence, for each pair of words <w1,w2> of the sentence, compute the distance (i.e. the number of words between w1 and w2 in the sentence). For instance considering the pair <”this", “example”>, we have two occurrences in s1, 1 occurrence in s2 and 1 occurrence in s3. In s1 distances are 3 and 6, in s2 3 and in s3 1. Finally, the average distance for the pair <”this”, "example”> is 3.25 All pairs, with relative average distance must be written in a table of a relational database

how to solve this problem in python.

which algorithm or methods or function is best solution for this problem ? if there any suitable code please share here.

please help me out for this problem.

Thanks