guest - flak
Flaakle

patience diffing algorithm

I needed a (text) diff algorithm, and if you search for one you mostly come up with the Myers algorithm. But then I stumbled across something called patience diffing, and it turns out to be just what I wanted. It’s already described elsewhere, but it seems more people could stand to know about it, so here we are. It’s easy to understand, and more importantly, usually makes pretty diffs (often prettier than Myers).

At first I was worried “patience” might refer to the required user attribute. One way to make a pretty diff is to calculate all possible diffs and pick the best one. That would certainly require a lot of patience. But no, the name refers to the card game which I would call solitaire, because one step of the algorithm uses the so called patience algorithm to find a common subsequence of matching lines.

I’ll provide some notes here, but there are some external links of interest. Patience diffing was apparently invented by Bram Cohen, although I don’t have a reference for the first implementation. An early description. Bram’s notes on patience diff advantages. A very nice and thorough explanation with a followup implementing patience diff. I would definitely recommend the previous links versus trying to implement anything from my description.

The Myers algorithm never quite clicked for me. I remember implementing something like it for a class a while ago, and I’ve looked at the algorithm a few times, but there’s no chance I could implement it from memory. Maybe I just don’t like snakes. The patience algorithm is very simple to understand. I think this comes from the fact that it relates closely to what we’re trying to do. I want to diff two files, not navigate a maze.

High level description: Find the matching lines in both files. Diff the regions in between the matching lines.

More complete description: Begin with iterators at the start and end of both files. While the first line in both files matches, move forward. While the last line in both files matches, move backward. You now have a single partially mismatched region.

Within a single mismatched region, find in order matching lines from both sides (longest subsequence problem) to subdivide into smaller regions. For each region, either recurse by returning to step 1, or simply dump the left as subtractions and the right side as additions. Or do Myers, etc.

You have a few choices. As originally described, the matching lines are found only among unique lines. This works quite well in practice. It keeps the structure of the file intact by preserving function declarations, etc, instead of focusing on all the “return 0” lines. But sometimes there aren’t any unique lines, especially after the first pass. In this case, we will consider a common subsequence of any lines. In my implementation, I do one pass looking for unique lines, then the first recursion considers any line, and then instead of recursing further just starting dumping differences.

Compared to Myers, we’re not trying to navigate the full array from one corner to the next. Instead, we’re just looking for all the diagonal lines. Or another way to put it, we’re not seeking to minimize edits. We’re seeking to preserve content. Per file unique lines are indicative of structure, so we align those first. In a coding context, this appears to work very well.

With some knowledge of the algorithm it’s possible to construct edits that perform poorly. If you double all the lines, there won’t be any unique matches. But you can create one single matching line to trick the algorithm at inconvenient points.

In practice, a peephole optimizer can improve diffs after the fact. Of course, an optimizer can improve most any diffing algorithm, so this isn’t really related to patience diffing, but I found it quite beneficial because sometimes the way patience subdivides the file leads to two awkward adjacent chunks. It’s pretty cheap to fix these up, so why not? If we’re going to add a “boring” line, either blank or just generic syntax like a closing brace, look ahead to see if the same line is already included. If so, swap the existing line to before the edit, and add the boring line after the diff.

Weren’t we promised pretty diffs at the start? I think there are some implementation strategies for Myers that it generally avoids these cases. You have more control over walking the graph. The recursion in patience diffing means each chunk has less information about the surrounding text. Overall though, I quite like the results. An occasionally larger than minimal diff is an acceptable compromise to avoid the incomprensible diffs that find a way to preserve only curly braces.

Example cribbed from diff adding --patience option to git. Before:

--- a/f.c
+++ b/f.c
@@ -1,26 +1,25 @@
 #include <stdio.h>
 
-// Frobs
-int frobnitz(int foo)
+int fib(int n)
 {
-       int i;
-       for (i = 0; i < 10; i++)
+       if (n > 2)
        {
-               printf("Your answer");
-               printf("%d\n", foo);
+               return fib(n-1) + fib(n-2);
        }
+       return 1;
 }
 
-int fact(int n)
+// Frobs
+int frobnitz(int foo)
 {
-       if (n > 1)
+       int i;
+       for (i = 0; i < 10; i++)
        {
-               return fact(n-1) * n;
+               printf("%d\n", foo);
        }
-       return 1;
 }
 
 int main(int argc, char **argv)
 {
-       frobnitz(fact(10));
+       frobnitz(fib(10));
 }

After:

--- file1.c     date1
+++ file2.c     date2
@@ -1,26 +1,25 @@
 #include <stdio.h>
 
+int fib(int n)
+{
+       if (n > 2)
+       {
+               return fib(n-1) + fib(n-2);
+       }
+       return 1;
+}
+
 // Frobs
 int frobnitz(int foo)
 {
        int i;
        for (i = 0; i < 10; i++)
        {
-               printf("Your answer");
                printf("%d\n", foo);
        }
 }
 
-int fact(int n)
-{
-       if (n > 1)
-       {
-               return fact(n-1) * n;
-       }
-       return 1;
-}
-
 int main(int argc, char **argv)
 {
-       frobnitz(fact(10));
+       frobnitz(fib(10));
 }

Of course, this is not the only algorithm to generate the second diff. Both Mercurial and OpenBSD diff produce the second result, although they can be tricked by other inputs.

Here’s a real world example, from my implementation of diff itself. The first diff is from mercurial, hg log -p.

diff -r f0529006659a -r 43062c24d1ce diff.go
--- a/diff.go   Thu Feb 14 04:13:23 2019 -0500
+++ b/diff.go   Thu Feb 14 04:16:30 2019 -0500
@@ -119,17 +119,18 @@
        return true
 }
 
-func Unidiff(name1, date1 string, d1 []byte, name2, date2 string, d2 []byte) string {
-       hasnl1 := len(d1) == 0 || len(d1) > 0 && d1[len(d1)-1] == '\n'
-       l1 := strings.Split(string(d1), "\n")
-       if hasnl1 && len(d1) > 0 {
-               l1 = l1[:len(l1)-1]
+func bytestolines(d []byte) ([]string, bool) {
+       hasnl := len(d) == 0 || len(d) > 0 && d[len(d)-1] == '\n'
+       l := strings.Split(string(d), "\n")
+       if hasnl && len(d) > 0 {
+               l = l[:len(l)-1]
        }
-       hasnl2 := len(d2) == 0 || len(d2) > 0 && d2[len(d2)-1] == '\n'
-       l2 := strings.Split(string(d2), "\n")
-       if hasnl2 && len(d2) > 0 {
-               l2 = l2[:len(l2)-1]
-       }
+       return l, hasnl
+}
+
+func Unidiff(name1, date1 string, d1 []byte, name2, date2 string, d2 []byte) string {
+       l1, hasnl1 := bytestolines(d1)
+       l2, hasnl2 := bytestolines(d2)
        diffs := Diff(l1, l2)
        var buf strings.Builder
        WriteUnidiff(&buf, name1, date1, hasnl1, name2, date2, hasnl2, diffs)

Here’s how I render that change.

diff -r f0529006659a -r 43062c24d1ce diff.go
--- diff.go     2019-02-14 04:13:23 -0500 EST
+++ diff.go     2019-02-14 04:16:30 -0500 EST
@@ -119,17 +119,18 @@
        return true
 }
 
+func bytestolines(d []byte) ([]string, bool) {
+       hasnl := len(d) == 0 || len(d) > 0 && d[len(d)-1] == '\n'
+       l := strings.Split(string(d), "\n")
+       if hasnl && len(d) > 0 {
+               l = l[:len(l)-1]
+       }
+       return l, hasnl
+}
+
 func Unidiff(name1, date1 string, d1 []byte, name2, date2 string, d2 []byte) string {
-       hasnl1 := len(d1) == 0 || len(d1) > 0 && d1[len(d1)-1] == '\n'
-       l1 := strings.Split(string(d1), "\n")
-       if hasnl1 && len(d1) > 0 {
-               l1 = l1[:len(l1)-1]
-       }
-       hasnl2 := len(d2) == 0 || len(d2) > 0 && d2[len(d2)-1] == '\n'
-       l2 := strings.Split(string(d2), "\n")
-       if hasnl2 && len(d2) > 0 {
-               l2 = l2[:len(l2)-1]
-       }
+       l1, hasnl1 := bytestolines(d1)
+       l2, hasnl2 := bytestolines(d2)
        diffs := Diff(l1, l2)
        var buf strings.Builder
        WriteUnidiff(&buf, name1, date1, hasnl1, name2, date2, hasnl2, diffs)

The Unidiff() line in the middle is important. It hasn’t changed. The second diff clearly captures what actually happened here. I added a new function, moving some code from an existing function, but the existing function itself was not removed.

Posted 2019-02-13 21:34:14 by tedu Updated: 2019-02-20 10:06:24
Tagged: programming