How to Calculate the Edit Distance in Python? – Finxter

0
39
Python One-Liners


Motivation

Type "helo world" into your Google search bar and Google will ask you: "Did you mean: hello world". How is this done?

A simple method to detect these typos is the Levenshtein distance (also called edit distance). In fact, Google’s algorithm seems to use some variant of it. (source)

By studying this article, you’ll learn about the important practical algorithm to calculate the “Levenshtein distance” or “edit distance”.

Applications: The Levenshtein distance has important applications. Think about the auto-correction functionality on your smartphone. Say, you type in "helo" in your WhatsApp messenger. Your smartphone then selects several high probability words and sorts them (e.g. by Levenshtein distance). For example, the one with minimal Levenshtein distance (and, hence, maximal similarity) is the string "hello". Thus, it may automatically correct "helo" to "hello".

Defining the Edit Distance

The Levenshtein distance is a metric to calculate the distance between two strings. It helps you to quantify how “similar” two strings are. The Levenshtein distance is also called “edit distance” which describes precisely what it measures:

Definition: The edit/Levenshtein distance is defined as the number of character edits (insertions, removals, or substitutions) that are needed to transform one string into another.

The intuition is the following: the smaller the Levenshtein distance, the more similar the strings.

Example Edit Distance

Let’s consider an example with two strings "cat" and "chello". How to calculate the Levenshtein distance in this scenario?

We already know that the Levenshtein distance computes the minimal number of edits (insert, delete, or replace) to reach the second string starting from the first string.

Here is one minimal sequence:

  • "cat"
  • "cht" (replace "a" by "h")
  • "che" (replace "t" by "e")
  • "chel" (insert "l" at position 3)
  • "chell" (insert "l" at position 4)
  • "chello" (insert "o" at position 5)

In this way, we can transform the string "cat" in the string "chello" in five editing steps – the Levenshtein distance is 5.

Calculating the Edit Distance in Python Using a Library

If you’re not interested in creating your own implementation, you can simply install the editdistance library by using pip:

pip install editdistance

Now, you can run it using the editdistance.eval() function with the two strings as arguments:

import editdistance
editdistance.eval('banana', 'bahama')
# 2L

Okay, let’s have a look at a more beautiful one-liner solution with detailed explanation next.

Python Recursive Edit Distance

Problem Statement: Write a Python one-liner that calculates the Levenshtein distance of two strings a and b.

## The Data
a = "cat"
b = "chello"
c = "chess"

## The One-Liner
ls = lambda a, b: len(b) if not a else len(a) if not b 
         else min(ls(a[1:],b[1:]) + (a[0]!=b[0]),
                  ls(a[1:],b) + 1,
                  ls(a,b[1:]) + 1)

## The Result
print(ls(a,b))
print(ls(a,c))
print(ls(b,c))

Listing: Calculating the Levenshtein distance of two strings in one line.

Exercise: What’s the output of this code snippet?

Before I explain the one-liner to you, let’s first rewrite this naive recursive algorithm to a normal multi-line Python function if, unlike me, you don’t like concise Python code:

a = "cat"
b = "chello"
c = "chess"


def ls(a, b):
    # Recursion base cases
    if not a:
        return len(b)
    if not b:
        return len(a)

    # Replace first character
    if a[0] != b[0]:
        d_1 = ls(a[1:], b[1:]) + 1
    else:
        d_1 = ls(a[1:], b[1:])

    # Remove first character
    d_2 = ls(a[1:], b) + 1

    # Insert first character
    d_3 = ls(a, b[1:]) + 1

    # Edit distance is minimum
    return min(d_1, d_2, d_3)


print(ls(a, b))
# 5
print(ls(a, c))
# 4
print(ls(b, c))
# 3

Before we dive into the code, let’s quickly explore an important Python trick we heavily exploit in the one-liner.

In Python, every object has a truth value – while you are either good or bad in the world of Harry Potter, you are either True or False in the world of Python! Most objects are in fact True. But a few objects are False:

  • Zero 0 and 0.0 is False
  • The empty string '' is False
  • The empty list [] is False
  • The empty dict or set {} is False

💡 Remember: As a rule of thumb, Python objects are considered False if they are empty or zero.

Equipped with this information, you can now easily understand the first part of the Levenshtein function:

We create a lambda function that returns the number of edits required to transform a string a into a string b.

There are two trivial cases:

  • Suppose string a is empty. In this case, the minimal edit distance is len(b) insertions of the characters in string b. We cannot do better.
  • Similarly, if string b is empty, the minimal edit distance is len(a).

Thus, we can directly return the correct edit distance if either of the strings is empty.

Let’s say both strings are non-empty (otherwise the solution is
trivial as shown previously). Now, we can simplify the problem in three ways.

First, we ignore the leading characters of both strings a and b and calculate the edit distance from slices (i.e., substrings) a[1:] to b[1:] in a recursive manner. If the leading characters a[0] and b[0] are different, we have to fix it by replacing a[0] by b[0]. Hence, we increment the edit distance by one if they are different.

Second, we remove the first character a[0]. Now, we check the minimal edit distance recursively for this smaller problem. As we have removed a character, we increment the result by one.

Third, we (conceptually) insert the character b[0] to the beginning of the word a. Now, we can reduce this problem to the smaller problem that arises if we remove the first character of b. As we have performed one edit operation (inserting), we increment the result by one.

Finally, we simply take the minimum edit distance of all three
results (replace the first character, remove the first character, insert the
first character).

This one-liner solution demonstrates once again the importance of training your recursion skills – recursion may not come naturally to you but rest assured that it will after studying many recursive problems like this one.

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

Python One-Liners

Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.

You’ll also learn how to:

  • Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
  • Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
  • Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
  • Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
  • Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!

Additional Implementations Edit Distance Python

There’s a wealth of code already implemented in Python to solve the edit distance problem. Next, I’ll list the most relevant resources for your convenience:



Source link

Leave a reply

Please enter your comment!
Please enter your name here