 # What is the best algorithm to find the most occurrence of a word in text?

Does anyone know what is the best algorithm to find the most occurrence of a word in text (very large), so hash map is not a solution I’m looking for. I’m thinking of either “Trie” or “Suffix Array” but I haven’t had a clear idea how to solve it. Any idea or suggestion would be greatly appreciated.

For example: This is is is an example.
Then the most occurrence is “is”.

What about this pseudo code:

``````c = first char
while c not text end
c = get char
if c terminates word
get actual word from trie
increment counter and check if it is max
else
move in trie pointer using c
``````

so in your example

after this word

``````ROOT - t - h - i - s (1) // <- most occurences
``````

after first is

``````ROOT + t - h - i - s (1) // <- most occurences
- i - s (1)
``````

after second is

``````ROOT + t - h - i - s (1)
- i - s (2) // <- most occurences
``````

at the EOF

``````ROOT + t - h - i - s (1)
+ i - s (3) // <- most occurences
+ a - n (1)
- e - x - a - m - p - l - e (1)
``````

edit: but this works only if you can split the text to words… If your question is about substrings instead of words, then the question is if you are looking something better than substrings of length 1 (just characters).

you can use KMP or Aco-Corasick both used for pattern matching, if u have large number of pattern match with a a input text of large size u can use Aco-Corasick

I think KMP is the best one with O(n+m) refer to Needle in a haystack problem on SPOJ

what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.

the question was asked 1 year ago so i think the op would have got his answer by now. its showing up at the top of the feed because of my cmment “what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.” can you please answer it. also please add some details in ur current answer. i really dont understand how KMP or aho corasick could be used here…

please give some more details.

For each sentence, for each pair of words <w1,w2> of the sentence, compute the distance (i.e. the number of words between w1 and w2 in the sentence). For instance considering the pair <”this", “example”>, we have two occurrences in s1, 1 occurrence in s2 and 1 occurrence in s3. In s1 distances are 3 and 6, in s2 3 and in s3 1. Finally, the average distance for the pair <”this”, "example”> is 3.25 All pairs, with relative average distance must be written in a table of a relational database

how to solve this problem in python.

which algorithm or methods or function is best solution for this problem ? if there any suitable code please share here.

please help me out for this problem.

Thanks