Fossil: Branching, Forking, Merging, and Tagging

Background

In a simple and perfect world, the development of a project would proceed linearly, as shown in Figure 1.

Figure 1

Each circle represents a check-in. For the sake of clarity, the check-ins are given small consecutive numbers. In a real system, of course, the check-in numbers would be long hexadecimal hashes since it is not possible to allocate collision-free sequential numbers in a distributed system. But as sequential numbers are easier to read, we will substitute them for the long hashes in this document.

The arrows in Figure 1 show the evolution of a project. The initial check-in is 1. Check-in 2 is derived from 1. In other words, check-in 2 was created by making edits to check-in 1 and then committing those edits. We say that 2 is a child of 1 and that 1 is a parent of 2. Check-in 3 is derived from check-in 2, making 3 a child of 2. We say that 3 is a descendant of both 1 and 2 and that 1 and 2 are both ancestors of 3.

DAGs

The graph of check-ins is a directed acyclic graph commonly shortened to DAG. Check-in 1 is the root of the DAG since it has no ancestors. Check-in 4 is a leaf of the DAG since it has no descendants. (We will give a more precise definition later of "leaf.")

Alas, reality often interferes with the simple linear development of a project. Suppose two programmers make independent modifications to check-in 2. After both changes are committed, the check-in graph looks like Figure 2:

Figure 2

The graph in Figure 2 has two leaves: check-ins 3 and 4. Check-in 2 has two children, check-ins 3 and 4. We call this state a fork.

Fossil tries to prevent forks. Suppose two programmers named Alice and Bob are each editing check-in 2 separately. Alice finishes her edits first and commits her changes, resulting in check-in 3. Later, when Bob attempts to commit his changes, Fossil verifies that check-in 2 is still a leaf. Fossil sees that check-in 3 has occurred and aborts Bob's commit attempt with a message "would fork." This allows Bob to do a "fossil update" which pulls in Alice's changes, merging them into his own changes. After merging, Bob commits check-in 4 as a child of check-in 3. The result is a linear graph as shown in Figure 1. This is how CVS works. This is also how Fossil works in "autosync" mode.

But perhaps Bob is off-network when he does his commit, so he has no way of knowing that Alice has already committed her changes. Or, it could be that Bob has turned off "autosync" mode in Fossil. Or, maybe Bob just doesn't want to merge in Alice's changes before he has saved his own, so he forces the commit to occur using the "--allow-fork" option to the fossil commit command. For any of these reasons, two commits against check-in 2 have occurred and now the DAG has two leaves.

So which version of the project is the "latest" in the sense of having the most features and the most bug fixes? When there is more than one leaf in the graph, you don't really know, so we like to have check-in graphs with a single leaf.

Fossil resolves such problems using the check-in time on the leaves to decide which leaf to use as the parent of new leaves. When a branch is forked as in Figure 2, Fossil will choose check-in 4 as the parent for a later check-in 5, but only if it has sync'd that check-in down into the local repository. If autosync is disabled or the user is off-network when that fifth check-in occurs, so that check-in 3 is the latest on that branch at the time within that clone of the repository, Fossil will make check-in 3 the parent of check-in 5!

Fossil also uses a forked branch's leaf check-in timestamps when checking out that branch: it gives you the fork with the latest check-in, which in turn selects which parent your next check-in will be a child of. This situation means development on that branch can fork into two independent lines of development, based solely on which branch tip is newer at the time the next user starts his work on it. Because of this, we strongly recommend that you do not intentionally create forks on long-lived shared working branches with "--allow-fork". (Prime example: trunk.)

Let us return to Figure 2. To resolve such situations before they can become a real problem, Alice can use the fossil merge command to merge Bob's changes into her local copy of check-in 3. Then she can commit the results as check-in 5. This results in a DAG as shown in Figure 3.

Figure 3

Check-in 5 is a child of check-in 3 because it was created by editing check-in 3. But check-in 5 also inherits the changes from check-in 4 by virtue of the merge. So we say that check-in 5 is a merge child of check-in 4 and that it is a direct child of check-in 3. The graph is now back to a single leaf, check-in 5.

We have already seen that if Fossil is in autosync mode then Bob would have been warned about the potential fork the first time he tried to commit check-in 4. If Bob had updated his local check-out to merge in Alice's check-in 3 changes, then committed, then the fork would have never occurred. The resulting graph would have been linear, as shown in Figure 1.

Realize that the graph of Figure 1 is a subset of Figure 3. Hold your hand over the check-in 4 circle of Figure 3 and then Figure 3 looks exactly like Figure 1, except that the leaf has a different check-in number, but that is just a notational difference — the two check-ins have exactly the same content. In other words, Figure 3 is really a superset of Figure 1. The check-in 4 of Figure 3 captures additional state which is omitted from Figure 1. Check-in 4 of Figure 3 holds a copy of Bob's local checkout before he merged in Alice's changes. That snapshot of Bob's changes, which is independent of Alice's changes, is omitted from Figure 1. Some people say that the approach taken in Figure 3 is better because it preserves this extra intermediate state. Others say that the approach taken in Figure 1 is better because it is much easier to visualize a linear line of development and because the merging happens automatically instead of as a separate manual step. We will not take sides in that debate. We will simply point out that Fossil enables you to do it either way.

The Alternative to Forking: Branching

Having more than one leaf in the check-in DAG is called a "fork." This is usually undesirable and either avoided entirely, as in Figure 1, or else quickly resolved as shown in Figure 3. But sometimes, one does want to have multiple leaves. For example, a project might have one leaf that is the latest version of the project under development and another leaf that is the latest version that has been tested. When multiple leaves are desirable, we call this branching instead of forking:

Key Distinction: A branch is a named, intentional fork.

Forks may be intentional, but most of the time, they're accidental.

Figure 4 shows an example of a project where there are two branches, one for development work and another for testing.

Figure 4

The hypothetical scenario of Figure 4 is this: The project starts and progresses to a point where (at check-in 2) it is ready to enter testing for its first release. In a real project, of course, there might be hundreds or thousands of check-ins before a project reaches this point, but for simplicity of presentation we will say that the project is ready after check-in 2. The project then splits into two branches that are used by separate teams. The testing team, using the blue branch, finds and fixes a few bugs. This is shown by check-ins 6 and 9. Meanwhile the development team, working on the top uncolored branch, is busy adding features for the second release. Of course, the development team would like to take advantage of the bug fixes implemented by the testing team. So periodically, the changes in the test branch are merged into the dev branch. This is shown by the dashed merge arrows between check-ins 6 and 7 and between check-ins 9 and 10.

In both Figures 2 and 4, check-in 2 has two children. In Figure 2, we call this a "fork." In diagram 4, we call it a "branch." What is the difference? As far as the internal Fossil data structures are concerned, there is no difference. The distinction is in the intent. In Figure 2, the fact that check-in 2 has multiple children is an accident that stems from concurrent development. In Figure 4, giving check-in 2 multiple children is a deliberate act. So, to a good approximation, we define forking to be by accident and branching to be by intent. Apart from that, they are the same.

Fossil offers two primary ways to create named, intentional forks, a.k.a. branches. First:

    $ fossil commit --branch my-new-branch-name

This is the method we recommend for most cases: it creates a branch as part of a checkin using the version in the current checkout directory as its basis. (This is normally the tip of the current branch, though it doesn't have to be. You can create a branch from an ancestor checkin on a branch as well.) After making this branch-creating checkin, your local working directory is switched to that branch, so that further checkins occur on that branch as well, as children of the tip checkin on that branch.

The second, more complicated option is:

    $ fossil branch new my-new-branch-name trunk
    $ fossil update my-new-branch-name
    $ fossil commit

Not only is this three commands instead of one, the first of which is longer than the entire simpler command above, you must give the second command before creating any checkins, because until you do, your local working directory remains on the same branch it was on at the time you issued the command, so that the commit would otherwise put the new material on the original branch instead of the new one.

In addition to those problems, the second method is a violation of the YAGNI Principle. We recommend that you wait until you actually need the branch and create it using the first command above.

(Keep in mind that trunk is just another branch in Fossil. It is simply the default branch name for the first checkin and every checkin made as one of its direct descendants. It is special only in that it is Fossil's default when it has no better idea of which branch you mean.)

Justifications For Forking

The primary cases where forking is justified over branching are all when it is done purely in software in order to avoid losing information:

By Fossil itself when two users check in children to the same leaf of a branch, as in Figure 2. If the fork occurs because autosync is disabled on one or both of the repositories or because the user doing the check-in has no network connection at the moment of the commit, Fossil has no way of knowing that it is creating a fork until the two repositories are later sync'd.

By Fossil when the cloning hierarchy is more than 2 levels deep.

Fossil's synchronication protocol is a two-party negotiation; syncs don't automatically propagate up the clone tree beyond that. Because of that, if you have a master repository and Alice clones it, then Bobby clones from Alice's repository, a check-in by Bobby that autosyncs with Alice's repo will not also autosync with the master repo. The master doesn't get a copy of Bobby's checkin until Alice separately syncs with the master. If Carol cloned from the master repo and checks something in that creates a fork relative to Bobby's check-in, the master repo won't know about that fork until Alice syncs her repo with the master. Even then, realize that Carol still won't know about the fork until she subsequently syncs with the master repo.

One way to deal with this is to just accept it as a fact of using a Distributed Version Control System like Fossil.

Another option, which we recommend you consider carefully, is to make it a local policy that checkins be made only against the master repo or one of its immediate child clones so that the autosync algorithm can do its job most effectively; any clones deeper than that should be treated as read-only and thus get a copy of the new state of the world only once these central repos have negotiated that new state. This policy avoids a class of inadvertent fork you might not need to tolerate. Since forks on long-lived shared working branches can end up dividing a team's development effort, a team may easily justify this restriction on distributed cloning.

You've automated Fossil (e.g. with a shell script) and forking is a possibility, so you write fossil commit --allow-fork commands to prevent Fossil from refusing the check-in because it would create a fork. It's better to write such a script to detect this condition and cope with it (e.g. fossil update) but if the alternative is losing information, you may feel justified in creating forks that an interactive user must later clean up with fossil merge commands.

That leaves only one case where we can recommend use of "--allow-fork" by interactive users: when you're working on a personal branch so that creating a dual-tipped branch isn't going to cause any other user an inconvenience or risk forking the development. Only one developer is involved, and the fork may be short-lived, so there is no risk of inadvertently forking the overall development effort. This is a good alternative to branching when you just need to temporarily fork the branch's development. It avoids cluttering the global branch namespace with short-lived temporary named branches.

There's a common generalization of that case: you're a solo developer, so that the problems with branching vs forking simply don't matter. In that case, feel free to use "--allow-fork" as much as you like.

Fixing Forks

If your local checkout is on a forked branch, you can usually fix a fork automatically with:

    $ fossil merge

Normally you need to pass arguments to fossil merge to tell it what you want to merge into the current basis view of the repository, but without arguments, the command seeks out and fixes forks.

Tags And Properties

Tags and properties are used in Fossil to help express the intent, and thus to distinguish between forks and branches. Figure 5 shows the same scenario as Figure 4 but with tags and properties added:

Figure 5

A tag is a name that is attached to a check-in. A property is a name/value pair. Internally, Fossil implements tags as properties with a NULL value. So, tags and properties really are much the same thing, and henceforth we will use the word "tag" to mean either a tag or a property.

A tag can be a one-time tag, a propagating tag or a cancellation tag. A one-time tag only applies to the check-in to which it is attached. A propagating tag applies to the check-in to which it is attached and also to all direct descendants of that check-in. A direct descendant is a descendant through direct children. Tag propagation does not cross merges. Tag propagation also stops as soon as it encounters another check-in with the same tag. A cancellation tag is attached to a single check-in in order to either override a one-time tag that was previously placed on that same check-in, or to block tag propagation from an ancestor.

The initial check-in of every repository has two propagating tags. In Figure 5, that initial check-in is check-in 1. The branch tag tells (by its value) what branch the check-in is a member of. The default branch is called "trunk." All tags that begin with "sym-" are symbolic name tags. When a symbolic name tag is attached to a check-in, that allows you to refer to that check-in by its symbolic name rather than by its hexadecimal hash name. When a symbolic name tag propagates (as does the sym-trunk tag) then referring to that name is the same as referring to the most recent check-in with that name. Thus the two tags on check-in 1 cause all descendants to be in the "trunk" branch and to have the symbolic name "trunk."

Check-in 4 has a branch tag which changes the name of the branch to "test." The branch tag on check-in 4 propagates to check-ins 6 and 9. But because tag propagation does not follow merge links, the branch=test tag does not propagate to check-ins 7, 8, or 10. Note also that the branch tag on check-in 4 blocks the propagation of branch=trunk so that it cannot reach check-ins 6 or 9. This causes check-ins 4, 6, and 9 to be in the "test" branch and all others to be in the "trunk" branch.

Check-in 4 also has a sym-test tag, which gives the symbolic name "test" to check-ins 4, 6, and 9. Because tags do not propagate across merges, check-ins 7, 8, and 10 do not inherit the sym-test tag and are hence not known by the name "test." To prevent the sym-trunk tag from propagating from check-in 1 into check-ins 4, 6, and 9, there is a cancellation tag for sym-trunk on check-in 4. The net effect is that check-ins on the trunk go by the symbolic name of "trunk" and check-ins on the test branch go by the symbolic name "test."

The bgcolor=blue tag on check-in 4 causes the background color of timelines to be blue for check-in 4 and its direct descendants.

Figure 5 also shows two one-time tags on check-in 9. (The diagram does not make a graphical distinction between one-time and propagating tags.) The sym-release-1.0 tag means that check-in 9 can be referred to using the more meaningful name "release-1.0." The closed tag means that check-in 9 is a "closed leaf." A closed leaf is a leaf that should never have direct children.

How Can Forks Divide Development Effort?

Above, we stated that forks carry a risk that development effort on a branch can be divided among the forks. It might not be immediately obvious why this is so. To see it, consider this swim lane diagram:

Figure 6

This is a happy, cooperating team. That is an important restriction on our example, because you must understand that this sort of problem can arise without any malice, selfishness, or willful ignorance in sight. All users on this diagram start out with the same view of the repository, cloned from the same master repo, and all of them are working toward their shared vision of a unified future.

All users, except possibly Alan, start out with the same two initial checkins in their local working clones, 1 & 2. It might be that Alan starts out with only check-in 1 in his local clone, but we'll deal with that detail later.

It doesn't matter which branch this happy team is working on, only that our example makes the most sense if you think of it as a long-lived shared working branch like trunk. Each user makes only one check-in, shaded light gray in the diagram.

Step 1: Alan

Alan sets the stage for this problem by creating a fork from check-in 1 as check-in 3. How and why Alan did this doesn't affect what happens next, though we will walk through the possible cases and attempt to assign blame in the post mortem. For now, you can assume that Alan did this out of unavoidable ignorance.

Step 2: Betty

Because Betty's local clone is autosyncing with the same upstream repository as Alan's clone, there are a number of ways she can end up seeing Alan's check-in 3 as the latest on that branch:

The working check-out directory she's using at the moment was on a different branch at the time Alan made check-in 3, so Fossil sees that as the tip at the time she switches her working directory to that branch with a fossil update $BRANCH command. (There is an implicit autosync in that command, if the option was enabled at the time of the update.)

The same thing, only in a fresh checkout directory with a fossil open $REPO $BRANCH command.

Alan makes his check-in 3 while Betty has check-in 1 or 2 as the tip in her local clone, but because she's working with an autosync'd connection to the same upstream repository as Alan, on attempting what will become check-in 4, she gets the "would fork" message from fossil ci, so she dutifully updates her clone and tries again, moving her work to be a child of the new tip, check-in 3. (If she doesn't update, she creates a second fork, which simply complicates matters beyond what we need here for our illustration.)

For our purposes here, it doesn't really matter which one happened. All that matters is that Alan's check-in 3 becomes the parent of Betty's check-in 4 because it was the newest tip of the working branch at the time Betty does her check-in.

Step 3: Charlie

Meanwhile, Charlie went offline after syncing his repo with check-in 2 as the latest on that branch. When he checks his changes in, it is as a child of 2, not of 4, because Charlie doesn't know about check-ins 3 & 4 yet. He does this at an absolute wall clock time after Alan and Betty made their check-ins, so when Charlie comes back online and pushes his check-in 5 to the master repository and learns about check-ins 3 and 4 during Fossil sync, Charlie inadvertently revives the other side of the fork.

Step 4: Darlene

Darlene sees all of this, because she joins in on the work on this branch after Alan, Betty, and Charlie made their check-ins and pushed them to the master repository. She's taking one of the same three steps as we outlined for Betty above. Regardless of her path to this view, it happens after Charlie pushed his check-in 5 to the master repo, so Darlene sees that as the latest on the branch, causing her work to be saved as a child of check-in 5, not of check-in 4, as it would if Charlie didn't come back online and sync before Darlene started work on that branch.

Post Mortem

The end result of all of this is that even though everyone makes only one check-in and no one disables autosync without genuine need, half of the check-ins end up on one side of the fork and half on the other.

A future user — his mother calls him Edward, but please call him Eddie — can then join in on the work on this branch and end up on either side of the fork. If Eddie joins in with the state of the repository as drawn above, he'll end up on the top side of the fork, because check-in 6 is the latest, but if Alan or Betty makes a seventh check-in to that branch first, it will be as a child of check-in 4 since that's the version in their local check-out directories. Since that check-in 7 will then be the latest, Eddie will end up on the bottom side of the fork instead.

In all of this, realize that neither side of the fork is obviously "correct." Every participant was doing the right thing by their own lights at the time they made their lone check-in.

Who, then, is to blame?

We can only blame the consequences of creating the fork on Alan if he did so on purpose, as by passing "--allow-fork" when creating a check-in on a shared working branch. Alan might have created it inadvertently by going offline while check-in 1 was the tip of the branch in his local clone, so that by the time he made his check-in 3, check-in 2 had arrived at the shared parent repository from someone else. (Francine?) When Alan rejoins the network and does an autosync, he learns about check-in 2. Since his #3 is already checked into his local clone because autosync was off or blocked, the sync creates an unavoidable fork. We can't blame either Alan or Francine here: they were both doing the right thing given their imperfect view of the state of the global situation.

The same is true of Betty, Charlie, and Darlene. None of them tried to create a fork, and none of them chose a side in this fork to participate in. They just took Fossil's default and assumed it was correct.

The only blame I can assign here is on any of these users who believed forks couldn't happen before this did occur, and I blame them only for their avoidable ignorance. (You, dear reader, have been ejected from that category by reading this very document.) Any time someone can work without getting full coordination from every other clone of the repo, forks are possible. Given enough time, they're all but inevitable. This is a general property of DVCSes, not just of Fossil.

This sort of consequence is why forks on shared working branches are bad, which is why Fossil tries so hard to avoid them, why it warns you about it when they do occur, and why it makes it relatively quick and painless to fix them when they do occur.

Review Of Terminology

Branch
A branch is a set of check-ins with the same value for their "branch" property.

Leaf
A leaf is a check-in with no children in the same branch.

Closed Leaf
A closed leaf is any leaf with the closed tag. These leaves are intended to never be extended with descendants and hence are omitted from lists of leaves in the command-line and web interface.

Open Leaf
A open leaf is a leaf that is not closed.

Fork
A fork is when a check-in has two or more direct (non-merge) children in the same branch.

Branch Point
A branch point occurs when a check-in has two or more direct (non-merge) children in different branches. A branch point is similar to a fork, except that the children are in different branches.

Check-in 4 of Figure 3 is not a leaf because it has a child (check-in 5) in the same branch. Check-in 9 of Figure 5 also has a child (check-in 10) but that child is in a different branch, so check-in 9 is a leaf. Because of the closed tag on check-in 9, it is a closed leaf.

Check-in 2 of Figure 3 is considered a "fork" because it has two children in the same branch. Check-in 2 of Figure 5 also has two children, but each child is in a different branch, hence in Figure 5, check-in 2 is considered a "branch point."

Differences With Other DVCSes

Single DAG

Fossil keeps all check-ins on a single DAG. Branches are identified with tags. This means that check-ins can be freely moved between branches simply by altering their tags.

Most other DVCSes maintain a separate DAG for each branch.

Branch Names Need Not Be Unique

Fossil does not require that branch names be unique, as in some VCSes, most notably Git. Just as with unnamed branches (which we call forks) Fossil resolves such ambiguities using the timestamps on the latest checkin in each branch. If you have two branches named "foo" and you say fossil up foo, you get the tip of the "foo" branch with the most recent checkin.

This fact is helpful because it means you can reuse branch names, which is especially useful with utility branches. There are several of these in the SQLite and Fossil repositories: "broken-build," "declined," "mistake," etc. As you might guess from these names, such branch names are used in renaming the tip of one branch to shunt it off away from the mainline of that branch due to some human error. (See fossil amend and the Fossil UI checkin amendment features.) This is a workaround for Fossil's normal inability to forget history: we usually don't want to actually remove history, but would like to sometimes set some of it aside under a new label.

Because some VCSes can't cope with duplicate branch names, Fossil collapses such names down on export using the same timestamp based arbitration logic, so that only the branch with the newest checkin gets the branch name in the export.

All of the above is true of tags in general, not just branches.