SPELL - Editorial

dpraveen · November 24, 2014, 3:53am

PROBLEM LINK

contest
practice

Author: Sergey Kulik
Tester: Praveen Dhinwa
Editorialist: Praveen Dhinwa

DIFFICULTY

Challenge

PREREQUISITES

hashing, tries etc.

PROBLEM

Given dictionary and text string. Text string contains errors in the words of below given kinds. You had to correct
those errors.

Possible errors in the words:

Swapping two letters.
Skipping a letter.
Insert an irrelevant letter.
Mistype a letter.

EXPLANATION

Most common approach for the problem was to compute hashes for the words in text for all possible ways of error and checking
whether the hash exists in the dictionary or not. For this, you can pre-compute set of hashes present in the dictionary.

Simplest way of correcting errors in a word:

Swapping two letters
You can try creating all possible words with a single swap.
Skipping a letter
You can correct this error in two ways. First way would to be try inserting each possible character at all possible positions of
current wrong word. This requires a lot of time. One simple optimization was to pre-compute all possible error words that you can create
by making errors in dictionary. You can see that this will give you speed up of size of alphabet.
Insert an irrelevant letter
You can try deleting each letter from wrong word.
Mistype a letter
You can try the brute force way of checking all possible changes in current wrong word. Also you can do some optimization on basis
that in English texts, the there are high chances that a consonant is followed by a vowel.

Some people have also used tries for storing the dictionary and correcting the errors. One benefit of trie is reduction in space.
I think a better way would be mix tries and hashing for different kind of error corrections. Most high scoring submissions have
used tries.

Also one can note that there will be many words in a novel that will be repeated. So we should store answers for those cases so as to
avoid repeated calculations.

Also most common English pronouns, articles are of very small size. Also their count in text will be high too. You can try dealing with them
in a different way. You can even pre-compute information for them. You can also store some common verbs too.

One thing one must note that there may be more than one way of correcting a word, context information is important for correcting errors.
eg. “thes” can be corrected to “this” and “thus”. One can optimize his program with assigning priorities/ probabilities to different words based on various
parameters like frequency of their use.

There are many other strategies and optimization possible. Some natural learning algorithm might be applied too. It seems
like mugurelionut applied N-gram model.

AUTHOR’S AND TESTER’S SOLUTIONS:

Setter’s Solution:
Tester’s Solution:

pranjalranjan · November 24, 2014, 3:55am

@dpraveen And since “is” wasn’t present in the dictionary, like many other words, how it was to be decided which word has to be changed and which not to be?

cyberax · December 3, 2014, 11:07pm

it would have been great to have a paragraph about custom offline test case generation.

pranjalranjan · November 24, 2014, 3:58am

and regarding the priority thing mentioned in the text… How do you assign priority to some 3-4 lakh elements present in the dictionary? And which words do i compare them with to assign the priority?

cyberax · December 3, 2014, 10:57pm

access denied on setter and tester’s solutions