You are on page 1of 2

autocorrect.exe Specifications The idea is to create my own implementation of an autocorrect program.

To start out, there will be no graphic interface and words will not actually be corrected . It will be basic command-line interface with the user typing in a word and a l ist of the most similar words being printed out. The implementation is quite com plex. PREPROCESSING: There are 3 important data structures that will be instantiated at the beginning of the program: (1) A character array of all the letters that compose a word. They are all capital a nd lower-case letters, plus the apostrophe: {a, A, b, B, , z, Z, '}. (2) A map that represents the pairwise distances between the letters of a word on a standard keyboard. For example, the distance between 'a' and 'a' is 0; the dista nce between 'a' and 'd' is 2; the distance between 'd' and 'a' is also 2; the di stance between 'f' and 'm' is 4 because, to get from 'f' to 'm', you have to tra vel down from 'f' to 'v' 1 unit, then right from 'v' to 'm' 3 units. Think of this as a mathematical mapping f:X->Y where X = {(a,a), (a,b), (b,a), , (',')}, Y = {1, 2, , 11}, f((a,a)) f((a,b)) f((b,a)) f((q,')) ... f((',')) = 0, = 5, = 5, = 11, = 0

The map will be instantiated by reading from a text file that contains the infor mation for all 27x27=729 pairwise distances. The text file reads like aa0 ab5 ba5 etcetera, for 729 lines. That information will be put into a map whose key-value pairs are accessible in logarithmic time (std::map in C++). (3) A dynamic array of strings. These will be the 5000 most common words in the Engl ish language according to: http://www.englishclub.com/vocabulary/common-words-50 00.htm. I edit-copied the list from that site and edit-pasted it in a text file. The words will each be extracted and pushed back into a dynamic array of strings (std::vector<string> in C++). That will be my dictionary of words. THE ALGORITHM:

My big idea here is to use the map of distances between characters to create a f unction that measures the distance between words. That way, if the user types in "dcience" then the top suggested word will be "science" because it is the exact same word except for the letter 'd' which is distance 1 from 's' on the keyboar d. The program comes up with that as the most suggested word by checking the dis tance between "dcience" and every one of the 5000 most common words stored in th e dynamic array of strings. All words that are below a certain distance will sho w up as suggested words. They will be appear in a sorted order with the lowest d istances first. MY PROBLEMS: (1) My current problem is figuring out how to define a function that gives a distanc e between any two words. For two words of the same length, it would be easy: jus t sum up the distances between letters at their indices. For example, d("cat", " pet")=7+2+0=9. The problem becomes hard when you try to define a distance betwee n words of different lengths. For example, d("cat", "catalyst")=??????. If one w ord is contained in another, should I define them as having distance 0 from each other? The surplus characters need to somehow factor in to the distance. d("eat ing", "swimming")=???. They both have "ing" endings but that won't be detected i f the distance between words is looking at distances between characters from ind ices left to right. (2) The next problem is figuring out the tolerance for words that will show up as su ggested words. What should be the distance under which words are considered simi lar enough to show up as suggest words? Should it be 3.14? How should it be dete rmined?

You might also like