A while ago I wrote an implementation of the Soundex Algorithm which attempts to assign the same encoding to words which are pronounced the same but spelled differently. In this post I'll cover the Levenshtein Word Distance algorithm which is a related concept measuring the "cost" of transforming one word into another by totalling the number of letters which need to be inserted, deleted or substituted.
The Levenshtein Word Distance has a fairly obvious use in helping spell checkers decided which words to suggest as alternatives to mis-spelled words: if the distance is low between a mis-spelled word and an actual word then it is likely that word is what the user intended to type. However, it can be used in any situation where strings of characters need to be compared, such as DNA matching.
The Levenshtein Word Distance
I usually leave the program's output until near the end of each post but this time I'll put it up front as it is the easiest way of showing how the algorithm works. I'll actually show two program outputs, the first being after the required data structure has been initialized but before the algorithm has been run, and the second after the algorithm has been run.
Program Output - Data structure initialized
-----------------------------
| codedrome.com |
| Levenshtein Word Distance |
-----------------------------
b a n a n a
0 1 2 3 4 5 6
b 1 0 0 0 0 0 0
a 2 0 0 0 0 0 0
n 3 0 0 0 0 0 0
a 4 0 0 0 0 0 0
m 5 0 0 0 0 0 0
a 6 0 0 0 0 0 0
Here we have a hypothetical situation where somebody has typed "banama" and we want to see how close it is to "banana". The source word "banama" is shown down the left and the target word "banana" is along the top. We also have a grid of integers with one more row than the number of letters in the source, and one more column than the number of letters in the target.
Most values are initially set to 0, but the first row of numbers represent the cumulative number of letters which need to be inserted to form the target, and the first column shows the cumulative number of deletions to remove the source. In other words, to transform "banama" into "banana" you could delete six letters and insert six more. That's not the most efficient way though as only one operation needs to be done, the substitution of the "m" with an "n". Let's now look at the program output after the algorithm has been run.
Program Output - Algorithm run
-----------------------------
| codedrome.com |
| Levenshtein Word Distance |
-----------------------------
b a n a n a
0 1 2 3 4 5 6
b 1 0 1 2 3 4 5
a 2 1 0 1 2 3 4
n 3 2 1 0 1 2 3
a 4 3 2 1 0 1 2
m 5 4 3 2 1 1 2
a 6 5 4 3 2 2 1
Minimum cost of transforming "banama" to "banana" = 1
All the other numbers have been filled in, and the number in the bottom right is the total cost of transforming "banama" to "banana" using the minimum number of operations, in this case 1.
The values for each of the 36 numbers initialized to 0 are calculated as follows:
- Find the value diagonally top-left and if the corresponding letters are different add 1 (substitution cost)
- Find the value above and add 1 (deletion cost)
- Find the value to the left and add 1 (insertion cost)
- Set the current value to the lowest of the above three values
(For this project I will use 1 as the "cost" of all three possible operations: substitution, insertion and deletion. However, this is not mandatory and in some circumstances different costs might be considered appropriate. For example in a spell checker you might feel someone is more likely to type the wrong letter than to miss out a letter or type an extra letter. Setting the insert and delete cost to > 1 will therefore make words with the same number of letters more likely to be suggested as alternatives than words where inserts or deletes are required.)
The overall plan of the implemention of Levenshtein's Word Distance should now be clear - given two words we just need to create a suitably sized 2D array, initialize the numbers and then iterate the rows and columns, setting the elements to their final values.
Coding
Create a folder and within it create the following empty files. You can download the source code as a zip or clone/download from Github if you prefer.
- levenshtein.h
- levenshtein.c
- main.c
Source Code Links
Open levenshtein.h in your favourite editor and enter or paste this.
levenshtein.h
#include<stdlib.h> #include<stdio.h> #define INSERT_COST 1 #define DELETE_COST 1 #define SUBSTITUTE_COST 1 //-------------------------------------------------------- // STRUCT levenshtein //-------------------------------------------------------- typedef struct levenshtein { int** grid; char source_word[64]; char target_word[64]; int source_length; int target_length; int minimum_cost; } levenshtein; //-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- levenshtein* lev_init(char source_word[64], char target_word[64]); int lev_calc(levenshtein* lp); void lev_free(levenshtein* lp); void lev_print_grid(levenshtein* lp); void lev_print_cost(levenshtein* lp);
I have #defined the costs of the three possible operations, and then typedef'ed a struct which will hold all the data needed for the project. Finally there are a few function prototypes which I'll describe as they are implemented.
Let's now move on to levenshtein.c.
levenshtein.c part 1
#include<stdlib.h> #include<stdbool.h> #include<stdio.h> #include<string.h> #include<math.h> #include"levenshtein.h" #define GREEN "\x1B[32m" #define RESET "\x1B[0m" //-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- static int min(int a, int b, int c);
After a few #includes I have defined a couple of values for printing and resetting the print colour to green, and then a static function for calculating the minimum of three ints.
levenshtein.c part 2
//-------------------------------------------------------- // FUNCTION lev_init //-------------------------------------------------------- levenshtein* lev_init(char* source_word, char* target_word) { levenshtein* lp = malloc(sizeof(levenshtein)); if(lp != NULL) { *lp = (levenshtein){.grid = NULL, .source_length = strlen(source_word), .target_length = strlen(target_word), .minimum_cost = 0}; strcpy(lp->source_word, source_word); strcpy(lp->target_word, target_word); lp->grid = malloc((lp->source_length + 1) * sizeof(int*)); if(lp->grid != NULL) { for(int r = 0; r <= lp->source_length; r++) { lp->grid[r] = malloc((lp->target_length + 1) * sizeof(int)); if(lp->grid[r] != NULL) { for(int c = 0; c <= lp->target_length; c++) { if(r == 0) { lp->grid[r][c] = c; } else if(c == 0) { lp->grid[r][c] = r; } else { lp->grid[r][c] = 0; } } } else { lev_free(lp); return NULL; } } return lp; } else { lev_free(lp); return NULL; } } else { return NULL; } }
The lev_init function looks rather complex, but basically just creates and populates a levenshtein struct for the supplied words. I'll list the salient points of this function below:
- Firstly we use malloc to grab enough memory for a levenshtein struct, and check it is not NULL
- Then we assign the pointer to an actual struct (including the word lengths isn't strictly necessary but does streamline code requiring the length)
- The source and target words are then set. (These are char arrays so don't need dynamic memory. For industrial-strength code we ought to check the words aren't too long*, or use dynamic memory.)
- We then get memory for a suitably sized array of int pointers, assigned to the struct's grid
- In a loop we assign each int pointer to a the required number of actual ints. These are set to 1, 2, 3... for the first row and column, and 0 to the rest.
- An issue here is we have a block of dynamic memory (the struct) containing more dynamic memory (the int pointers) which in turn contain even more malloc'ed memory, the ints themselves. If the second or third malloc fails to find enough memory we can't leave the previously malloc'ed memory orphaned so call the lev_free function. As we'll see later this only frees up stuff which is not NULL so can safely be called with a partially-allocated struct. In these circumstances we also return NULL, returning the new levenshtein pointer if everything works.
* The longest English word is the chemical name of titin, the largest known protein. It has 189,819 letters. In the unlikely event of anybody misspelling it they would break my code. The "longest word in a major dictionary" is pneumonoultramicroscopicsilicovolcanoconiosis with a measly 45 letters, well within our 64 (well, 63 allowing for the \0 at the end).
levenshtein.c part 3
//-------------------------------------------------------- // FUNCTION lev_calc //-------------------------------------------------------- int lev_calc(levenshtein* lp) { int total_substitution_cost; int total_deletion_cost; int total_insertion_cost; for(int sourceletter = 0; sourceletter < lp->source_length; sourceletter++) { for(int targetletter = 0; targetletter < lp->target_length; targetletter++) { if(lp->target_word[targetletter] != lp->source_word[sourceletter]) { total_substitution_cost = lp->grid[sourceletter][targetletter] + SUBSTITUTE_COST; } else { total_substitution_cost = lp->grid[sourceletter][targetletter]; } total_deletion_cost = lp->grid[sourceletter][targetletter+1] + DELETE_COST; total_insertion_cost = lp->grid[sourceletter+1][targetletter] + INSERT_COST; lp->grid[sourceletter+1][targetletter+1] = min(total_substitution_cost, total_deletion_cost, total_insertion_cost); } } }
Here we calculate the values, iterating the rows (letters of the source) and the columns (letters of the target). The three costs are calculated as described above. Note we only add in the substitution cost if the corresponding letters are different.
Each value is set to the lowest of the three costs using the min function which I'll come to shortly. Note that as we have an extra row and column at the beginning we need to add 1 to the word indexes to set the grid values.
levenshtein.c part 4
//-------------------------------------------------------- // FUNCTION lev_print_grid //-------------------------------------------------------- void lev_print_grid(levenshtein* lp) { if(lp != NULL) { for(int c = 0; c <= lp->target_length; c++) { if(c > 0) { printf(GREEN " %2c " RESET, lp->target_word[c-1]); } else { printf(" "); } } printf("\n\n"); for(int r = 0; r <= lp->source_length; r++) { if(r > 0) { printf(GREEN " %2c " RESET, lp->source_word[r-1]); } else { printf(" "); } for(int c = 0; c <= lp->target_length; c++) { printf(" %2d", lp->grid[r][c]); } puts(" \n"); } } } //-------------------------------------------------------- // FUNCTION lev_print_cost //-------------------------------------------------------- void lev_print_cost(levenshtein* lp) { if(lp != NULL) { printf("Minimum cost of transforming \"%s\" to \"%s\" = %d\n", lp->source_word, lp->target_word, lp->grid[lp->source_length][lp->target_length]); } }
The lev_print_grid function takes a pointer to a levenshtein struct, and firstly prints out the target word in green. We then iterate the rows, first printing out each source letter in green and then the values in the default colour, all this nicely spaced out for clarity.
The lev_print_cost function simply prints out the bottom-right value with a suitable bit of text.
levenshtein.c part 5
//-------------------------------------------------------- // FUNCTION lev_free //-------------------------------------------------------- void lev_free(levenshtein* lp) { if(lp != NULL) { if(lp->grid != NULL) { for(int r = 0; r <= lp->source_length; r++) { if(lp->grid[r] == NULL) { printf("%d NULL\n", r); } free(lp->grid[r]); } free(lp->grid); } free(lp); } } //-------------------------------------------------------- // FUNCTION min //-------------------------------------------------------- static int min(int a, int b, int c) { int m; (a <= b && a <= c) ? (m = a) : (b <= a && b <= c) ? (m = b) : (m = c); return m; }
The lev_free function takes a levenshtein pointer and attempts to free it's various blocks of dynamic memory. It should be possible to call free on memory which is NULL without any errors but I have found that to be a bit erratic so have included the "if != NULLs". At least they don't do any harm. As I mentioned above this function can safely be called with a partially-malloc'ed struct.
The min function simply returns the lowest of three ints. (I could have made this a variadic function using the "..." notation to take any number of ints but too many people waste time writing general-purpose functions for specific problems.)
That's the main coding done so we just need to demonstrate it in the main function.
main.c
#include<stdio.h> #include<stdlib.h> #include<stdbool.h> #include"levenshtein.h" //-------------------------------------------------------- // FUNCTION main //-------------------------------------------------------- int main(int argc, char* argv[]) { puts("-----------------------------"); puts("| codedrome.com |"); puts("| Levenshtein Word Distance |"); puts("-----------------------------\n"); levenshtein* lp = NULL; lp = lev_init("banama", "banana"); //lp = lev_init("banama", "elephant"); //lp = lev_init("levinstein", "levenshtein"); if(lp != NULL) { lev_calc(lp); lev_print_grid(lp); lev_print_cost(lp); lev_free(lp); return EXIT_SUCCESS; } else { puts("Cannot create Levenshtein :("); return EXIT_FAILURE; } }
Here we create a levenshtein pointer and initialize it with three different pairs of words (obviously only uncommenting one at a time). If lev_init works we then call the calc and print functions before calling the free function.
Compile and run the code with these commands.
Compile and Run
gcc main.c levenshtein.c -std=c11 -lm -o main
./main
I have already included the banama/banana output above so won't repeat it here. The word distance there was 1, so banana easily qualifies as a sensible suggestion if somebody mis-types banama. However, if you typed banama you wouldn't expect your word processor to suggest elephant, so let's try those two words.
Program Output = "banama" and "elephant"
---------------------------------------------
| codedrome.com - Levenshtein Word Distance |
---------------------------------------------
e l e p h a n t
0 1 2 3 4 5 6 7 8
b 1 1 2 3 4 5 6 7 8
a 2 2 2 3 4 5 5 6 7
n 3 3 3 3 4 5 6 5 6
a 4 4 4 4 4 5 5 6 6
m 5 5 5 5 5 5 6 6 7
a 6 6 6 6 6 6 5 6 7
Minimum cost of transforming "banama" to "elephant" = 7
The word distance here is 7 as banama needs to be almost completely re-typed to get to elephant. Clearly not a valid alternative.
Program Output = "levinstein" and "levenshtein"
---------------------------------------------
| codedrome.com - Levenshtein Word Distance |
---------------------------------------------
l e v e n s h t e i n
0 1 2 3 4 5 6 7 8 9 10 11
l 1 0 1 2 3 4 5 6 7 8 9 10
e 2 1 0 1 2 3 4 5 6 7 8 9
v 3 2 1 0 1 2 3 4 5 6 7 8
i 4 3 2 1 1 2 3 4 5 6 6 7
n 5 4 3 2 2 1 2 3 4 5 6 6
s 6 5 4 3 3 2 1 2 3 4 5 6
t 7 6 5 4 4 3 2 2 2 3 4 5
e 8 7 6 5 4 4 3 3 3 2 3 4
i 9 8 7 6 5 5 4 4 4 3 2 3
n 10 9 8 7 6 5 5 5 5 4 3 2
Minimum cost of transforming "levinstein" to "levenshtein" = 2
Finally levinstein/levenshtein. One substitution and one insertion give us a word distance of 2, also a sensible suggestion. You would probably consider word distances of 2 or perhaps 3 to be reasonable for alternative suggestions, but no more.