How does Fossil's excellent diff algorithm work?

(1) By MBL (RoboManni) on 2022-12-30 10:45:36 [link] [source]

I was inspired by drh's Devoxx post when I came to this question about the diff algorithm used by fossil.

Looking through the manual of pijul I found this section about the Diff Algorithm and was interested to learn more about it. The Myers diff shows every detail with simulation steps of how it works.

Then I became interested to see how fossil is doing its diff'ing but -unfortunately- could not find that in the documentation. Question: Is there a good description of fossil's algorithm and how it is applied? Can someone post me a link, please? Thanks in advance.

(I was using the index of Fossil Documentation and searched for keywords diff algorithm. The 2nd match refers to chapter 3.0 the Fossil Concepts with the matching text of: Fossil has an excellent built-in "diff" algorithm that works fine for most people.)

(2) By Warren Young (wyoung) on 2022-12-30 11:08:44 in reply to 1 [link] [source]

Here.

(3) By Richard Hipp (drh) on 2022-12-30 11:50:32 in reply to 2 [link] [source]

The delta algorithm is different from the diff algorithm. The delta algorithm strives to find minimal deltas. The diff algorithm strives to generate a diff that is understandable by humans. These are different goals.

I originally though that I could use the delta algorithm to implement diff. But experiments shows me that that did not work. One problem was that the delta algorithm will descramble common sections of code into the correct order even if they occur in a different order in the original document. But for human readability, it is very important that the order of common blocks be preserved. Another issue is that the delta algorithm is byte-oriented whereas the diff algorithm is line oriented.

The only documentation on the diff algorithm is comments in the code. The key routine to look at is longestCommonSequence().

(4) By Daniel Dumitriu (danield) on 2023-01-04 12:03:36 in reply to 1 [source]

[computer science history talk ahead]

The matter of diff(erenc)ing two files has got a lot of attention due to its importance in programming, and derived from that, in version control systems. The underlying problem is called the longest common subsequence and is well-studied in theoretical computer science (read: algorithmics); while NP-hard in the general case, polynomial solutions can be obtained by dynamic programming. The main goal is then to go from quadratic to subquadratic complexity.

The first algorithm (now known as Hunt-Szymanski) has been initiated by the original implementors of the Unix diff¹, Douglas McIlroy and James Hunt, and is seemingly still used in some implementations (OpenBSD).
GNU diff uses Myers' algorithm (and later), but defaults to a faster heuristic by Paul Eggers. (As a detour, Gene Myers became famous for BLAST, which solves the DNA/RNA sequence alignment problem, itself a particular case for the longest common subsequence).
Git uses by default Myers' as well; the much slower patience ("Solitaire") and slightly faster histogram variants are supposedly more "meaningful" for source code.
Mercurial uses the difflib library, which is based on the Ratcliff-Obershelp algorithm.

According to its fossilized history, Fossil's diff started using a variation of Myers' algorithm, and switched after a couple of months to its own hash-based heuristic, which in some cases and for small inputs (about 40 lines) delegates to a subroutine computing the expensive (quadratic) optimal solution. The study of this algorithm is left as an exercise to the reader 😉.

^{^} Interesting enough, Larry Wall's patch did not come until 10 years later.

(5) By MBL (RoboManni) on 2023-01-04 18:00:25 in reply to 4 [link] [source]

Thank you very much ...

Now I will have a lot to read and study to satisfy my curiosity.

Fossil User Forum

How does Fossil’s excellent diff algorithm work?