Tuesday, September 13, 2011

Approximate String Matching


Here I used Levenshtein Distance method to calculate the amount of similarity between two strings. The method LevenshteinDistance(String source, String target) will return the cost to convert the source string to target string, by considering insertion, deletion and substitution of a character, each of them costs 1.

In the Tester class you can use an appropriate threshold to get the approximate matching strings.

public class StringUtils {

    public static int LevenshteinDistance(String source, String target) {
        // d[i,j] will hold the Levenshtein distance between
        // the 1st i characters of s & the 1st j characters of t;
        // note that d has (m+1) x (n+1) values

        int m = source.length(), n = target.length();
        int[][] d = new int[m][n];
        char[] s = source.toLowerCase().toCharArray();
        char[] t = target.toLowerCase().toCharArray();

        for (int i = 0; i < m; i++) {
            d[i][0] = i; // any 1st string to an empty 2nd string
        }
        for (int j = 0; j < n; j++) {
            d[0][j] = j; // any 2nd string to an empty 1st string
        }

        for (int j = 1; j < n; j++) {
            for (int i = 1; i < m; i++) {
                if (s[i] == t[j]) {
                    d[i][j] = d[i - 1][j - 1]; // no operation
                } else {
    //minimum of DELETION, INSERTION, SUBSTITUTION (each costs 1)
    d[i][j] = Math.min(d[i - 1][j] + 1, Math.min(d[i][j - 1] + 1, d[i - 1][j - 1] + 1));
                }
            }
        }
        return d[m - 1][n - 1];
    }
}

public class Tester {

    public static void main(String[] args) {
        List<String> suggestionList = new ArrayList<String>();
        List<String> sourceList = new ArrayList<String>();
sourceList.add("abc");
sourceList.add("abcefg");
sourceList.add("abcfgh");
sourceList.add("abcdhfgi");
sourceList.add("abcd ghi");
sourceList.add("abcdefgh");
        for (String s : sourceList) {
            int editCost = StringUtils.LevenshteinDistance(s,"abcdefgh");
            if (editCost <= 3) {
                suggestionList.add(editCost + s);
            }
        }
//Sort, best match first
        Collections.sort(suggestionList);
        int count = 0;
        for (String s : suggestionList) {
            System.out.println(s);
        }
    }
}

Guess the output...

No comments:

Post a Comment