Does anyone know what is the best algorithm to find the most occurrence of a word in text (very large), so hash map is not a solution I’m looking for. I’m thinking of either “Trie” or “Suffix Array” but I haven’t had a clear idea how to solve it. Any idea or suggestion would be greatly appreciated.
For example: This is is is an example.
Then the most occurrence is “is”.
c = first char
while c not text end
c = get char
if c terminates word
get actual word from trie
increment counter and check if it is max
else
move in trie pointer using c
so in your example
after this word
ROOT - t - h - i - s (1) // <- most occurences
after first is
ROOT + t - h - i - s (1) // <- most occurences
- i - s (1)
after second is
ROOT + t - h - i - s (1)
- i - s (2) // <- most occurences
…
at the EOF
ROOT + t - h - i - s (1)
+ i - s (3) // <- most occurences
+ a - n (1)
- e - x - a - m - p - l - e (1)
edit: but this works only if you can split the text to words… If your question is about substrings instead of words, then the question is if you are looking something better than substrings of length 1 (just characters).
you can use KMP or Aco-Corasick both used for pattern matching, if u have large number of pattern match with a a input text of large size u can use Aco-Corasick
the question was asked 1 year ago so i think the op would have got his answer by now. its showing up at the top of the feed because of my cmment “what could be the reason behind not using STL map? someonoe please give me an example where STL map will be inefficient for this problem. thnks.” can you please answer it. also please add some details in ur current answer. i really dont understand how KMP or aho corasick could be used here…
For each sentence, for each pair of words <w1,w2> of the sentence, compute the distance (i.e. the number of words between w1 and w2 in the sentence). For instance considering the pair <”this", “example”>, we have two occurrences in s1, 1 occurrence in s2 and 1 occurrence in s3. In s1 distances are 3 and 6, in s2 3 and in s3 1. Finally, the average distance for the pair <”this”, "example”> is 3.25 All pairs, with relative average distance must be written in a table of a relational database
how to solve this problem in python.
which algorithm or methods or function is best solution for this problem ? if there any suitable code please share here.