A tutorial on generalized median strings

Introduction

Generalized median strings have been successfully applied to many pattern recognition problems. In many situations, noise often ends up the culprit of recognition failure. Median strings can help by averaging the results and finding the best estimate among many noisy samples. It is also possible to use median string algorithms to compute among a set of noisy samples the best prototypes for machine learning.

The nice properties of generalized median strings are useful in diverse areas for many applications such as symmetry analysis, shape recognition, optical character recognition, bar code reading, speech recognition, molecular biology, other pattern recognitions problems, signal estimation in multisensory measurements, clustering, and averaging of many kinds of data, among others: binary feature maps, 3D rotations, other geometric features, brain models, anatomical structures, and facial images.

In this tutorial, we will first define what is a string and how to compute the distance or difference between two strings. Next, we will extend the concept by defining weighted means of strings which will allow us among other things to compute strings that are halfway between a pair of strings. Then, we will introduce the concept of generalized median strings, the goal of this tutorial. Finally, we will describe a concrete application to 2D shapes, and let you play with a fun applet that will let you draw your own shapes and have the computer compute median shapes.

I, Samuel Audet, made this tutorial as my final course project for COMP 644 – Pattern Recognition at McGill University in the fall term of 2006, a course given by Professor Godfried Toussaint.

Strings

So, what is a string? A string is a list of symbols. Symbols can be, for example, letters, numbers or vectors, but any arbitrary finite alphabet of symbols can be used. x = a, b, c and y = (1,2), (2,3) are two examples of strings. The first uses letters as symbols and has three of them, so its length equals 3. The second uses two-dimensional vectors and since it contains two vectors, its length equals 2.

Edit operations

Now, we can perform what we will call edit operations on a string. There exists three kinds of operations and we call them insertions, substitutions, and deletions, as illustrated in figure 1.

edit operations
Figure 1. Illustration of the three possible edit operations: insertions, substitutions and deletions.

The operations in figure 1 are denoted as follows:

Edit operatione =
Insertion of Cepsilon -> C
Substitution of D with CD->C
Deletion of DD->epsilon

Note that the length of a string is increased by one after an insertion, reduced by one after a deletion, and remains unchanged after a substitution.

By definition, an edit operation has a cost equal to or greater than 0. The cost is denoted c(e) where e is an edit operation. When no changes occurs as the result of an edit operation, it is called an identical substitution. In such cases, the cost of the operation must be 0. Also, the cost function must follow the triangle inequality:

c(a->b) \leq c(a->d) + c(d->b)

That is, in cases where one edit operation gives the same result as two other edit operations, the cost of the former must always be smaller or equal to the sum of the costs of the latter. We can of course have a sequence of edit operations S = e_1,...,e_k, and the cost of this sequence is simply the sum of the costs of the edits, in other words:

c(S) = sum from i=1 to k  c(e_i)

Edit distance

Now, we are ready to define the edit distance between two strings. The edit distance between two strings is important because it tells us by how much two strings differ from one another. Also, the algorithm used will allow us to find an optimal edit sequence that can transform a string into another. Before that though, these two strings that we will denote as x and y must belong to the same space, in other words, they must use the same alphabet of symbols. One popular definition of their distance is equal to the minimum cost over all possible sequences S of edit operations that can change x into y, in other words:

d(x,y) = min over S (c(S))

This is commonly known as the Levenshtein edit distance.

The assumption of the first inequality also implies the triangle inequality for string distances:

d(x,y) <= d(x,z) + d(z,y)

There exists a standard algorithm using dynamic programming to compute the distance between two strings x = x_1, ..., x_n  and  y = y_1, ..., y_n. This algorithm is described by the recursive relationships of the following equations:

algorithm to find distance using dynamic programming

Wagner and Fischer [1] proved that d(x,y) = D(n,m). We can also extract from the path taken in the matrix an optimal edit sequence S which has minimal cost.

Here is a small example of the algorithm in action on two smalls strings x = a, d and y = a, b, c so that n = 2 and m = 3. For simplicity, all edit operations have a cost of 1, except identical substitutions which must have a cost of 0. The dynamic programming algorithm described above will compute the matrix D as shown in figure 2. We can see that d(x,y) = D(n, m) = D(2, 3) = 2. This is in fact true since to transform x into y, we can substitute the 'd' with 'b' and insert a 'c', or we can insert a 'b' and substitute the 'd' with 'c'. Actually, the former sequence of operations is depicted with the red line, and the latter, the blue line. This can easily be understood by realizing that diagonal paths into the matrix are substitution operations. So, in the case of the red line, we have a substitution of 'a' with 'a' (cost 0), then a substitution of 'd' with 'b' (cost 1), and finally, an insertion of 'c' (again, cost 1, so total 2). Follow the same logic for the blue line as exercise.

example of the distance algorithm
Figure 2. Illustration of the distance algorithm in action on two strings. This is the matrix D.

The blue and red lines are both optimal edit sequences that can transform string x into string y with minimal cost. It should be clear with this example that not even an optimal edit sequence S is necessarily unique.

Weighted means of strings

In this section, we will talk about the weighted means of a pair of strings. This concept will guide us one step further into our adventure in understanding generalized median strings. A weighted mean of a pair of strings x and y is a string z that satisfies:

d(x,y) = d(x,z) + d(z,y)

In other words, the distance between x and y must be equal to the distance between x and the weighted mean z plus the distance between, again, the weighted mean z and y. It is clear that z can be equal to either x or y , so that x and y are themselves weighted means as well. We can arbitrarily call alpha = d(x,z) our weight. With this definition, when alpha = 0 then z = x and when alpha = 1 then z = y. For intermediate values of alpha, z is not necessarily unique. For example, let us define two strings x = a and y = a, b, c where edit operations all have a cost of 1 for simplicity. With alpha = 1, we can think of two edit operations that would bring us closer to y. They are insertions epsilon -> b and epsilon -> c . These give us two different weighted means z = ab and z = ac where d(x,z) = alpha = 1 for both.

As we will see in the next section, a weighted mean is also usually a generalized median of a set of two strings.

To find a weighted mean at distance alpha from x, we can use the following procedure:

  1. Compute the distance d(x,y) to find an optimal edit sequence S.
  2. Take a subset S' of operations from S whose cost c(S') = alpha, assuming such a subset exists.
  3. Apply the operations of S' to x to produce z, a weighted mean.

Here is a proof that this algorithm is correct. Since c(S') = alpha, then d(x,z) <= alpha, since there might be a sequence other than S' that is cheaper and that can transform x to z. Then, since we do not need to apply the operations in S' to transform z into y, the distance between z and y is reduced by at least that much, i.e.: d(z,y) <= d(x,y) - alpha. From these two inequalities, we now know that d(z,y) <= d(x,y) - d(x,z) or alternatively d(x,y) >= d(x,z) + d(z,y). But we also know that d(x,y) <= d(x,z) + d(z,y) , the triangle inequality. So it must be that d(x,y) = d(x,z) + d(z,y) . qed

Now, a subset S' with cost alpha might not exist, but in that case Bunke and others [2] proved that if we can decompose edit operations into sub-operations, we still arrive to the same conclusion.

Generalized median strings

We can finally start talking about generalized median strings! Just before that however, we have to introduce the notion of Sum Of Distances (SOD). Given a set of strings T and an additional string p the Sum Of Distances (SOD) is the sum of distances from string p to all strings q in set T, in other words:

SOD(p) = sum from i=1 to t  d(p,q_i)

Assuming the strings are part of an arbitrary space U, a string p bar that belongs to U and that has the smallest Sum Of Distances (SOD) is called a generalized median string, or in other words:

p bar = arg min over p in U SOD(p)

Similarly, a string p hat in the set T with the smallest Sum Of Distances (SOD) is called a set median, or in other words:

p hat = arg min over p in T SOD(p)

As we have seen, the weighted means of strings are not necessarily unique, so neither are median strings. Moreover, the concept of generalized median strings is equivalent to the concept of weighted means of a pair of strings under some conditions. If we assume that the cost function for a given problem is symmetric, i.e.: c(epsilon->a) = c(a->epsilon) and c(a->b) = (b->a) for any given a and b, then the distance is also symmetric, i.e.: d(x,y) = d(y,x) for any given x or y. In this case, it is clear that if the set T contains only two strings, the notion of generalized medians reduces to the notion of weighted means, since the SOD = d(x,z) + d(z,y) = d(x,y). In addition, note that the SOD does not actually depend on z so it is equal for any weighted mean string z, including x and y themselves.

In general, however, we are interested in much larger sets of strings. The generalized median is the more general concept and therefore provides a better representation of the set than the set median, but de la Higuera and Casacuberta [3] proved that the computation of the former is NP-complete, in other words very hard to compute. On the other hand, the computation of the set median string is very straightforward. We can apply an algorithm directly modeled after the equation above, which suggests the following: Compute the Sum Of Distances (SOD) for all the strings in the set, and a string that gives the smallest SOD is a set median. This brute-force approach has time complexity O(n2) in the number of strings in the set. Computing the generalized median string is much harder and more often than not, only approximations can be computed. Case in point, in the applet below, the heuristic we used to compute an approximation is the dynamic computing approach by Jiang and others [4]. You will notice that under some circumstances it produces worse estimates (higher SOD) than the set median algorithm! The authors have also tried a genetic algorithm approach [5], which is also quite experimental. Casacuberta and de Antoni [6] have for their part taken a greedy approach. Those heuristics are all quite complex and mostly empirical, so they provide little insight into the issue of computing generalized median strings. In any case, we invite the interested reader to consult those papers.

Application to shapes

We will now describe an application to (two-dimensional) shapes where we are interested in finding the median shapes. We will use a vectorial representation where all vectors have the same length Delta (a constant, that could be for example 1). We have depicted in figure 3 an S-like shape using such a representation. All vectors have the same length Delta (You can imagine Delta = 1). In this representation, the tail of the next vector (e.g.: a_2) follows the head of the previous vector (e.g.: a_1), and so on.

edit operations
Figure 3. Illustration in vectorial form of a shape similar to an S where all nine vectors have the same length.

We can denote this shape in this representation as a string of vectors: s = a_1, a_2, a_3, a_4, a_5, a_6, a_7, a_8, a_9 .

For this application, we define the costs of the edit operations as follows: c(a -> epsilon) = c(epsilon -> a) = |a| = delta and c(a -> b) = |a-b|. In other words, the cost of insertions and deletions are equal to the length of the vectors. Moreover, |.| is implied to be a Minkowski distance between two vectors. For example, we could use the well known Euclidean distance: |a-b| = sqrt(a_x^2 + b_y^2). With this definition c(a-b) = c(b-a) and thus we end up with a symmetric distance metric for the whole string as discussed before. Notice that the minimum cost is 0, and the maximum cost is 2 Delta when a and b have the same orientation, but opposite directions, such as vectors a_1 and a_5 in the figure 3.

We can then apply the algorithms discussed previously to shapes represented with such strings. In particular, we can find a generalized median string of a set of shapes, which would give us in some perceptual way a "middle shape" of the set. For example, if we have many shapes of the letter S, they might look very similar to a human, but a machine does not know they are all supposed to be S's. We could tell the computer to find a generalized median string to help it find a prototype S. Since this single prototype has the minimal distance to shapes that we humans see as S's, it would make it much easier for the computer to recognize S's in the future, since it would only need to carry around the one prototype instead of the whole set. You will be able to play with the applet below and see what kind of prototypes the computer computes given 10 shapes that you can draw.

Fun applet

With this applet, you are allowed to draw up to 10 shapes and find out what the computer will decide are the median strings. Note that the shapes are limited to a single curve, so that you cannot draw shapes with disconnected components.

Next, the algorithm used to find an approximation of a generalized median string is the dynamic algorithm developed by Jiang and others [4]. It is dynamic in the sense that it computes an approximation of a generalized median string incrementally, one string at a time. Therefore, we get intermediate results of approximations computed for subsets of the whole set of strings. The applet exploits this fact by displaying the approximation as computed after each string is added to the computation. Watch for the blue background. Note that since this computation is only an approximation, we used randomness in the implementation, and this is why for the same set of shapes, the approximation is different each time the computation is launched.

The set median string is computed using the brute-force approach as described above. Watch for the green background.

Some interesting things to try with the applet:

Note: If you do not see the applet that is supposed to appear here, make sure your browser is equipped with a Java Runtime Environment (JRE) of version 1.4.2 or later. Such JRE for Windows and Linux can be freely downloaded from Sun.

Download the source code of this applet.

References

[1]Robert A. Wagner and Michael J. Fischer, "The string-to-string correction problem," Journal of the ACM, vol. 21, no. 1, pp. 168–173, 1974.
[2]Horst Bunke, Xiaoyi Jiang, Karin Abegglen and Abraham Kandel, "On the Weighted Mean of a Pair of Strings," Pattern Analysis and Applications, vol. 5, no. 1, pp. 23–30, 2002.
[3]Colin de la Higuera and Francisco Casacuberta, "Topology of Strings: Median String is NP-Complete," Theoretical Computer Science, vol. 230, no. 1–2, pp. 39–48, 2000.
[4]Xiaoyi Jiang, Karin Abegglen, Horst Bunke, and János Csirik, "Dynamic computation of generalised median strings," Pattern Analysis and Applications, vol. 6, no. 3, pp. 185–193, 2003.
[5]Xiaoyi Jiang, Horst Bunke, and János Csirik, "Median Strings: A Review," Data Mining in Time Series Databases, World Scientific, pp. 173–192, 2004.
[6]F. Casacuberta and M.D. de Antoni, "A greedy algorithm for computing approximate median strings," Proceedings of National Symposium on Pattern Recognition and Image Analysis, pp. 193–198, Barcelona, Spain, 1997.


Author – Samuel Audet <saudet at ok.ctrl.titech.ac.jp>
Last Modified – 2015-01-21