Fossil

Artifact [dfd6e05a]
Login

Artifact [dfd6e05a]

Artifact dfd6e05ab2e0e3758938d42bc3e558ecbf2b06e5:

Instance of technote [be8f2f34] - Design Note: Scaling Fossil Up by drh 2010-10-09 10:26:44.
C <b>Design\sNote:</b>\sScaling\sFossil\sUp
D 2010-10-09T10:35:40
E 2010-10-09T10:26:44 be8f2f3447ef2ea3344f8058b6733aa08c08336f
P ba60f28251f20688010da11d1a58d47ba627c429
T +bgcolor * #c0ffc0
T +sym-blog *
U drh
W 9898
<title>Scaling Fossil Up</title>

<h2>Overview</h2>

Joerg Sonnenberger has done an import of the CVS repository for the
NetBSD pkgsrc into Fossil.  The result demonstrates
some scaling issues with Fossil.  (See ticket [6c5471445173a17fffa6aa] for
some additional information.)  This design note describes the issues and
suggests paths toward their solution.

Checkout and commit from pkgsrc are a little sluggish but are workable.
Most of the website pages are workable.  Where Fossil bogs down is on
the "fossil rebuild" command and on cloning.  Enhancements to fix those
issues, and to improve the speed of checkout and commit, are suggested.

<h2>Details</h2>

The size of the pkgsrc repository is about 550MB.  This is about 4 times
larger than anything we have seen before.  But the size of the repository
seems not to be a problem.  Some administrative operations scale linearly
with size, but it really is a linear scaling and while those commands take
a little longer than on other smaller repositories, the slowdown is not
unacceptable.

The pkgsrc repository has history for a little more than 13 years.  This
is not a problem.  The SQLite repository has 10.3 years of history and it
works just fine.

The total size of a checkout in pksrc is about 340MB.  This is not a problem
for Fossil either.   The Fossil repository for the 
[http://www.sqlite.org/slt/timeline | SLT (SQL Logic Test)] test
program for SQLite has a 1.3GB checkout (4x larger than pkgsrc) and the
SLT repository works fine.  Doing a commit can be a little sluggish on SLT
since Fossil verifies the checkin by computing an MD5 checksum over the
entire checkin content (1.3GB) and does so twice from different sources to
verify that everything is consistent.  This can take around 30 seconds or
so, which seems a little sluggish, but is manageable.  If desired, it would
be easy enough to disable the MD5 repository checksum steps as a "setting".
The respository MD5 checksum is part of Fossil repository 
[/doc/trunk/www/selfcheck.wiki | self-checks]
and it would be unfortunate to have to disable them, but it is easily doable
in a fully backwards and forwards compatible way.

The areas where pkgsrc is giving Fossil problems are these:  The pkgsrc
checkout has a huge number of small files (over 60,000) and there are
many, many checkins (190,000 over 13 years, or about 40 checkins per day).
Let's consider the latter issue first.

Fossil (like most VCSes) stores the most recent version of each file as
full text but then stores all prior versions as a delta from the next
version.  Storing deltas rather than full text is a storage space 
optimization.  Without the delta compression, a Fossil repository for
pkgsrc would be about 500GB rather than 500MB.  The deltas move backwards
from newer to older versions of each file, since the assumption is that most
accesses are for recent versions of the file and so you want to be able to
get to the most recent versions using the fewest number of deltas.

Storing backwards deltas works fine for all of the repositories we have seen
before.  In 10 years of history, there have been 8446 checkins to SQLite,
and so the longest delta chain is 8446 links long.  Fossil can burn through
that relatively quickly.  But with pkgsrc, the longest delta chain is 20
times longer - 190,283 links - which is beginning to get very sluggish.

One possible solution would be to have Fossil keep track of the delta chain
length and automatically insert a full-text copy of files after 500 or
1000 deltas, thus limiting the maximum delta chain length.  I read that
Mercurial does this.  It will make the repository a little larger, but
probably not that much.

The bigger problem is the huge number of files in the checkout: over 60K.
When I was designing Fossil, it looked at the Linux kernel and used it
as my upper-bound on the number of files in a project.  The Linux kernel
has about 20K files.  So pkgsrc is about 3x larger than the design load
for Fossil.  Other Fossil repositories typically have on the order of 1000
files per checkout.  SQLite has 873 in a checkout.  The SLT repository,
even though the total size of the checkout is 4 times larger than pkgsrc,
is only 645 files.  (SLT has a small number of very large files whereas
pkgsrc as a large number of very small files.)

The problem with having a huge number of files is that the 
[/doc/tip/www/fileformat.wiki#manifest | "manifest" file]
associated with each checkin becomes huge.
The manifest stores the name and SHA1 hash of every file in the checkout.
For a repository like SQLite, the total manifest size is about 60K.  But
for pkgsrc it is about 5MB, or almost 100 times larger.

Manifests, like everything else, are stored in the repository as a chain
of deltas.  Most checkins only change a handful of files, and so the deltas
are very small and the history of all manifests can be stored in very little
space, despite the large size of each separate manifest.  The problem is
that the manifests still have to be expanded to their full 5MB prior to
being used for anything and it now takes about 100 times longer to apply
each delta (because 100 times more text is being copied) and about 100 times
longer to parse each manifest file.

The proposed solution to large manifests is to create a new control file
type which I will call a sub-manifest and to add a new card type to the
manifest file (the S-card) that links a manifest to a sub-manifest.  A
sub-manifest contains F-cards and S-cards and maybe a Z-card both nothing 
else: no date or check-in comment or signature.  The idea is to split a
large manifest into tree, with the main manifest being the root of the
tree and sub-manifests being branches and leaves.  We might try to limit
the size of each manifest and sub-manifest to about 20 to 50K.  We could
have separate sub-manifest for each directory in the check-out, as Git
does, but there are over 14,000 separate subdirectories in pkgsrc, and
having to visit 14,000 artifacts in order to do the equivalent of 
"fossil list" seems excessive.  A better approach would be for Fossil
to automatically partition the repository into a few hundred sub-manifests
with less than one thousand entries each.

With sub-manifests, most checkins will only modify files in a single
sub-manifest.  If there is a single sublayer (the main manifest pointing
to about 200 sub-manifests of around 300 files each) then a typical
checkin would only involve changing the main manifest (about 10K in size)
and one sub-manifest (about 20K in size).   Thus the size of checkin
overhead drops from 5MB to 30K.

With sub-manifests as described above, it should be possible to apply
the delta chain to reach a desired checkin about 100 times faster (since
there is 100 times less deltaing to do on each step).  And when doing
a "fossil rebuild" there will be about 100 times less manifest parsing
work to do.  

To do a "fossil rebuild" on the current pkgsrc repository using the current
version of Fossil requires about 8 hours of CPU time on my 4-year-old 
workstation.  Most of this time is spent parsing the 190,000 manifests.
If manifest parsing time were reduced by 100 fold, the "fossil rebuild"
time could drop from 8 hours to about 5 minutes.

Adding sub-manifests and the S-card is an extension to Fossil.  It is
fully backwards compatible.  Newer versions of Fossil that use sub-manifests
would still be able to read and write older repositories that have only
flat manifests.  But older versions of Fossil would not be able to read
newer repositories that used sub-manifests.  

<h2>Cloning</h2>

The other main area of performance concern with large repositories is
cloning.  The sync protocol for Fossil is optimized for sharing a small
number of changes between two largely identical repositories.  The
sync protocol will work to synchronize two vastly different repositories,
but it is much less efficient for that case.

The problem with clone is that it is currently implemented to use the
non-optimal case of the sync protocol.  Clone works by first creating
a new empty repository, then syncing it to the clone source.  That
runs ok for smaller repositories such as SQLite.  But it falls down
badly on pkgsrc.

The solution here is to enhance the sync protocol to work better with
the cloning case.  If the client could tell the server "I have nothing,
send me everything", the server could simply start downloading compressed
and deltaed artifacts in linear order.  The client could then install the 
artifacts it its new repository, and then run rebuild at the end to
reconstruct the meta-data.  This process should be almost as fast
as FTP.  Of course, rebuild must still be run at the end, but with
the sub-manifest fix described above, that should only take about 5
minutes or so.

<h2>Summary</h2>

Fossil is able to handle checkouts of the size of pkgsrc without any
problems.  A checkout from the SLT repository is four times larger and it
works fine.  The problem with pkgsrc is the large number of files and
subdirectories in each individual checkout and the huge number of checkins
over the history of the project.  Proposed solutions for these are
described above.

There are other areas of concern that will need to be addressed.  
File browsing on the website is sluggish as each page renders in time
proportional to the total number of distinct filenames in the repository
(which is 100K for pkgsrc versus 2K for the next biggest repository).
And I have not yet tried to run "fossil annotate" on a file with
a really huge change history - there could be undiscovered issues there.
But for the most part, Fossil will be usable for pkgsrc-scale projects
if we can just address the rebuild and cloning issues.
Z dbc59c2a490deb7ac4b9d596cd7932e4