hashing, tries etc.
Given dictionary and text string. Text string contains errors in the words of below given kinds. You had to correct
Possible errors in the words:
- Swapping two letters.
- Skipping a letter.
- Insert an irrelevant letter.
- Mistype a letter.
Most common approach for the problem was to compute hashes for the words in text for all possible ways of error and checking
whether the hash exists in the dictionary or not. For this, you can pre-compute set of hashes present in the dictionary.
Simplest way of correcting errors in a word:
- Swapping two letters
You can try creating all possible words with a single swap.
- Skipping a letter
You can correct this error in two ways. First way would to be try inserting each possible character at all possible positions of
current wrong word. This requires a lot of time. One simple optimization was to pre-compute all possible error words that you can create
by making errors in dictionary. You can see that this will give you speed up of size of alphabet.
- Insert an irrelevant letter
You can try deleting each letter from wrong word.
- Mistype a letter
You can try the brute force way of checking all possible changes in current wrong word. Also you can do some optimization on basis
that in English texts, the there are high chances that a consonant is followed by a vowel.
Some people have also used tries for storing the dictionary and correcting the errors. One benefit of trie is reduction in space.
I think a better way would be mix tries and hashing for different kind of error corrections. Most high scoring submissions have
Also one can note that there will be many words in a novel that will be repeated. So we should store answers for those cases so as to
avoid repeated calculations.
Also most common English pronouns, articles are of very small size. Also their count in text will be high too. You can try dealing with them
in a different way. You can even pre-compute information for them. You can also store some common verbs too.
One thing one must note that there may be more than one way of correcting a word, context information is important for correcting errors.
eg. “thes” can be corrected to “this” and “thus”. One can optimize his program with assigning priorities/ probabilities to different words based on various
parameters like frequency of their use.