Monday, July 04, 2016

Edit Distance

We use edit distance to do approximate matching between strings. It's time to get more familiar of it.

It is also known as Levenshtein distance, It is usually divided by maximum length of the two strings to get a percentage of similarity between two strings. getLenvenshtein(s1,s2)/max(s1.length, s2.length).


  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
package string;

import java.util.Arrays;
import java.util.Objects;

/**
1.If last characters of two strings are same, nothing much to do. 
 Ignore last characters and get count for remaining strings. 
So we recur for lengths m-1 and n-1.
2. Else (If last characters are not same), we consider all operations on 
str1, consider all 
three operations on last character of first string, recursively compute 
minimum cost for all three operations and take minimum of three values.

Insert: Recur for m and n-1
Remove: Recur for m-1 and n
Replace: Recur for m-1 and n-1

https://en.wikipedia.org/wiki/Levenshtein_distance

 * @author Andrew Ma
 *
 */
public class EditDistance {
 static int min(int x, int y, int z){
  return Math.min(x,Math.min(y, z));
 }
 //this method is from right to left, it can be left to right too
 static int editDistanceRecursive(String s1, String s2, int s1Len, int s2Len){
  if(s1Len==0)return s2Len;
  if(s2Len==0)return s1Len;
  //last characters are same, do not increase the distance, moving to next characters and continue
  if(s1.charAt(s1Len-1)==s2.charAt(s2Len-1))
   return editDistanceRecursive(s1,s2,s1Len-1,s2Len-1);
  //for last characters are not same
  return 1+min(
    //remove the difference from s1, mark one difference, then continue to check rest of strings
    //after removing, it 's no guaranteed that new last is same as last one in second string
    //so this operation does not remove last character in s2
    editDistanceRecursive(s1,s2,s1Len-1,s2Len),
    //insert a matched character to s1, mark one difference, then continue to check rest of strings
    //after adding one character, s1's length becomes s1Len+1, we adding one distance for s1Len+1 and s2Len
    //then move to next positions that are s1Len and s2Len-1
    //(when we add, we add same character, so guarantee the last ones match)
     editDistanceRecursive(s1, s2, s1Len,s2Len-1),
     //replace, now we now last ones are same after adding one edit, then we can move to next positions
      editDistanceRecursive(s1,s2,s1Len-1,s2Len-1));
  
 }
 /**
  * straight forward way of dp to remove overlapping calculation of subproblems
  * @param s1
  * @param s2
  * @param s1Len
  * @param s2Len
  * @param dp
  * @return
  */
 static int editDistanceRecursiveDP(String s1, String s2, int s1Len, int s2Len,int[][] dp){
  if(s1Len==0)return s2Len;
  if(s2Len==0)return s1Len;
  if(dp[s1Len][s2Len]>0)return dp[s1Len][s2Len];
  //last characters are same, do not increase the distance, moving to next characters and continue
  if(s1.charAt(s1Len-1)==s2.charAt(s2Len-1)){
   dp[s1Len][s2Len] = editDistanceRecursive(s1,s2,s1Len-1,s2Len-1);
   return dp[s1Len][s2Len];
  }
  //for last characters are not same
  dp[s1Len][s2Len] = 1+min(
    //remove the difference from s1, mark one difference, then continue to check rest of strings
    //after removing, it 's no guaranteed that new last is same as last one in second string
    //so this operation does not remove last character in s2
    editDistanceRecursive(s1,s2,s1Len-1,s2Len),
    //insert a matched character to s1, mark one difference, then continue to check rest of strings
    //after adding one character, s1's length becomes s1Len+1, we adding one distance for s1Len+1 and s2Len
    //then move to next positions that are s1Len and s2Len-1
    //(when we add, we add same character, so guarantee the last ones match)
     editDistanceRecursive(s1, s2, s1Len,s2Len-1),
     //replace, now we now last ones are same after adding one edit, then we can move to next positions
      editDistanceRecursive(s1,s2,s1Len-1,s2Len-1));
  return dp[s1Len][s2Len];
  
 }
 /**
  * bottom up dp. advancing to here after understanding the recursive algorithm
  * @param s1
  * @param s2
  * @return
  */
 static int editDistanceDP(String s1, String s2){
  Objects.requireNonNull(s1);
  Objects.requireNonNull(s2);
//  if(s1.length()==0)return s2.length();
//  if(s2.length()==0)return s1.length();
  //i,j means difference so far up to i-1 in s1 and j-1 in s2
  // so the last examined positions are last characters in both string, in dp[][], that is dp[m][n]
  int[][] dp = new int[s1.length()+1][s2.length()+1];
  for (int i=0; i<=s1.length(); i++)
        {
            for (int j=0; j<=s2.length(); j++)
            {
             if(i==0)
              dp[i][j]=j;
             else if (j==0)
              dp[i][j]=i;
             //when it runs to here , i and j would be greater than 0
             //i-1 and j-1 are real index to strings
             //0 is just for dp[][]
             else if (s1.charAt(i-1)==s2.charAt(j-1))
              dp[i][j] = dp[i-1][j-1];
             else{
              //edit operation happens on s1
              dp[i][j] = 1+min(
                dp[i][j-1],//remove:means i skip one to next
                dp[i-1][j],//insert: means i keeps same, j goes to next one
                dp[i-1][j-1]//replace: both skip one
                );
             }
            }
        }
  return dp[s1.length()][s2.length()];
 }
 public static void main(String[] args) {
  String s1 = "andrewma";
  String s2 = "andyMa";
  int[][] dp = new int[s1.length()+1][s2.length()+1];
  for(int i=0;i<dp.length;i++)
   Arrays.fill(dp[i], -1);
  System.out.println(editDistanceRecursive(s1, s2, s1.length(), s2.length()));
  System.out.println(editDistanceRecursiveDP(s1, s2, s1.length(), s2.length(),dp));
  System.out.println(editDistanceDP(s1, s2)); 
 }

}
Below is from wiki, easiler to understand. But I have a different understan don deletion and insertion operation.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 static int editDistanceDPWiki(String s1, String s2){
  Objects.requireNonNull(s1);
  Objects.requireNonNull(s2);
   // for all i and j, dp[i,j] will hold the Levenshtein distance between
    // the first i characters of s and the first j characters of t
    // note that dp has (m+1)*(n+1) values
  int[][] dp = new int[s1.length()+1][s2.length()+1];//initially they are 0s
  //target is empty string. source can be transformed by deleting all characters
  for (int i=1; i<=s1.length(); i++)
   dp[i][0]=i;
  //source is empty. target can be transformed by dropping all characters
  for (int i=1; i<=s2.length(); i++)
   dp[0][i]=i;
  
  for (int j=1; j<=s2.length(); j++)
        {
            for (int i=1; i<=s1.length(); i++)
            {
             if (s1.charAt(i-1)==s2.charAt(j-1))//
              dp[i][j] = dp[i-1][j-1];//no difference
             else{
              //edit operation happens on s1
              dp[i][j] = 1+min(
                dp[i][j-1],//deletion:
                dp[i-1][j],//insertion: 
                dp[i-1][j-1]//replace: both skip one
                );
             }
            }
        }
  return dp[s1.length()][s2.length()];
 }
There is another way to use two one dimension arrays instead of a matrix to save space as demonstrated in the wiki page. Other places to find good information are from Algorithms on Strings, Trees and Sequences by Dan Gusfield and Chas Emerick's implementation of the Levenshtein distance algorithm from http://www.merriampark.com/ld.htm

No comments: