Fossil User Forum

Is it possible to reduce the size of a fossil repository?
Login

Is it possible to reduce the size of a fossil repository?

Is it possible to reduce the size of a fossil repository?

(1) By anonymous on 2019-12-10 09:48:19 [link] [source]

I am using fossil-2.10 and I have just noticed that 'fossil new' creates a repository file that is 230KB big. If my memory serves me well, few years ago, the size of a new fossil repository used to be a couple of times smaller, like 50KB. Sure I understand that fossil keeps evolving and getting new features, which influence the size of the repository (like built-in forum, switching to longer hashes, bundled web GUI themes etc). But this is my use case: I prefer to keep per-application settings in their own fossil repositories, so I have many small ones. Most of them are no more than 10-20 linear commits. Almost none of them have over 100 commits. Some of them have a couple of branches for different app versions or different hosts. But I almost never use tickets/wiki/forum/web GUI theming etc, I only interact with those repos from command line via "fossil commit --branch" to record a new revision and "fossil checkout revisionID" to retrieve a snapshot. Are there any ways to minimize the size of such fossil repositories? Or is this just the price I have to pay for the option of being able to use the aforementioned additional features in the future, shall I need it? Is my wish just an irrelevant end-user caprice in the age of terabyte hard drives? Thanks.

(2) By Stephan Beal (stephan) on 2019-12-10 10:14:40 in reply to 1 [link] [source]

You can try:

fossil sqlite3
sqlite> vacuum;<ENTER>
sqlite> .quit

or:

fossil rebuild --compress repo.fossil

(Noting that that will have no useful effect on a new repo.)

If those don't shrink a repo, then it's as small as it'll ever get. A new db on my system is 229k. The dbstat command says it's using 56 pages of 4k each (224k), with no free pages, meaning that 224k of that is unavoidably used for data storage and the other 5k are sqlite3-internal.

Looking at the config table of the new db, it only contains small data, not a copy of the skin, so it seems highly unlikely that you'll be able to shave any more off of that 229k.

(3) By Stephan Beal (stephan) on 2019-12-10 10:21:58 in reply to 1 [link] [source]

PS:

Is my wish just an irrelevant end-user caprice in the age of terabyte hard drives?

Not caprice, really, but 220kb is practically microscopic by modern standards. That's smaller than some CSS files hosted by large websites. Just to randomly pick a site, imdb.com sends me 3.46MB of data when i open the page (1.3MB compressed). My most-visited site (BoardGameGeek) sends 3MB (1MB compressed), not including ads.

In any case, judging by the output of dbstat, you're not going to get a repo db smaller than about 230kb.

(4) By jvdh (veedeehjay) on 2019-12-10 10:29:21 in reply to 2 [link] [source]

still, the question of the OP seems valid: it is true that new empty repos used to be around 50kB. now they are 4-5 times larger. what parts of the schema are, exactly responsible for this increase?

it sure is not a really relevant issue but I sympathise with the OP here. I've done a bit of statistics on my repos: there are currently 58 of them around. median size 408 kB i.e. 50% of the repos are smaller than that number (and thus are fractionally relevant effected by the 230 kB "offset"). otoh. mean size is 13300 kB (demonstrating that there are some large repos...). so on average the overhead of possible excessive initial repo size is irrelevant. but for the use case of the OP it obviously matters. a bit ... ;)

(5) By Stephan Beal (stephan) on 2019-12-10 10:50:35 in reply to 4 [link] [source]

still, the question of the OP seems valid: it is true that new empty repos used to be around 50kB. now they are 4-5 times larger. what parts of the schema are, exactly responsible for this increase?

If we dump the schema to plain text, it's only 5-6kb:

echo .schema | fossil sqlite3 -R therepo.fossil > foo

However, fossil UI offers more insight:

fossil ui therepo.fossil

Then visit: http://localhost:8080/repo-tabsize

That will show exactly how much space is used by each table. On a new repo, the majority of tables each take up about 4-5% of that space and there's nothing in there which is specific to the forum features. The mlink table is the largest, taking up 9% (roughly 20kb), even though it currently holds no data other than its schema text and the (opaque) sqlite-related internals. By dropping all 4 mlink indexes and running vacuum i was able to shave off 5kb from that. It is conceivably possible to gain more space by dropping more indexes, but fossil may (or may not) re-add those at will.

... so on average the overhead of possible excessive initial repo size is irrelevant...

"Excessive" would be overstating it a bit. It's larger than it used to be (IIRC, sqlite3 used to have a smaller default block size?), but it's very far from excessive. We can think of that 230kb as the cost of entry for all of the core sqlite3-related infrastructure, not the least of which is its durability in the face of disasters like power outages in mid-commit.

(6.1) By Richard Hipp (drh) on 2019-12-10 11:09:51 edited from 6.0 in reply to 1 [link] [source]

Deleted

(7) By Richard Hipp (drh) on 2019-12-10 11:17:18 in reply to 1 [source]

The underlying repository is an SQLite database. Each table and each index in the database takes up a minimum of one page, even it if holds no content. When a fresh repository is first created, it contains 53 tables and indexes. The default page size is 4096 bytes.

If you want, you could try:

$ fossil sql new-repo.fossil
sqlite> PRAGMA page_size=512;
sqlite> VACUUM;
sqlite> .q

That seems reduce the empty repository size from 230K down to 35K. But it might make routine repository operations a little slower. Though, if you only have tiny repos, probably the speed doesn't matter.

For my public-facing repos, I typically set the page size to 8192 as that typically gives a reduced size once the repository starts to collect a lot of content, and perhaps marginally better performance.

(8) By marc on 2019-12-10 16:24:19 in reply to 7 [link] [source]

Interesting.

Given that this seems to be useful to some (including Richard), perhaps the page size setting (and concomitant vacuum, run automatically) could be exposed as a setting?

(9) By anonymous on 2019-12-10 16:48:15 in reply to 7 [link] [source]

Thank you very much for the insight. As I understand, I can even use the sqlite client directly and in batch mode like this:

sqlite3 repo.fossil 'PRAGMA page_size=512; VACUUM;'

on both new/empty repositories and repositories that already have content, as explained here:

https://www.sqlite.org/pgszchng2016.html

And I can wrap this inside a directory tree walking routine when I will need to mass compress my bunch of tiny repositories.

(10) By Richard Hipp (drh) on 2019-12-10 17:23:00 in reply to 8 [link] [source]

Not a setting, but you can do:

 fossil rebuild --pagesize 512

Or similar, for whatever pagesize you want.

(11) By marc on 2019-12-10 17:30:46 in reply to 10 [link] [source]

Oh, perfect. Thanks Richard.

(Didn't realise this existed as the above examples were using fossil sql.)

(12) By Warren Young (wyetr) on 2019-12-10 17:34:40 in reply to 9 [link] [source]

No need for a script:

    $ fossil all rebuild --pagesize 512

(13) By Warren Young (wyetr) on 2019-12-10 18:01:51 in reply to 4 [link] [source]

new empty repos used to be around 50kB

Curious, I decided to graph Fossil repo size vs Fossil major version, here.

It looks like there's only been one big jump, in 1.35, over three years ago now. It's not clear to me from the ChangeLog why there should be such a jump there.

Other than that, it's all incremental change.

I used this script to gather the data:

for v in $(fossil tag list | grep version- | cut -f2 -d-)
do
  fossil clean
  fossil up version-$v
  ./configure
  make -j11 &&
    rm -f /tmp/x &&
    ./fossil init /tmp/x &&
    echo $v,$(stat --format='%s' /tmp/x) >> /tmp/fossil-sizes.csv
done

I've done a bit of statistics on my repos

Keep in mind that the 229 kB number and my stats above are nearly 100% overhead, a degenerate case. When you start putting actual data into a repo, the percentage of overhead decreases.

Put another way, the nearly 4 kB of slack in those default-sized SQLite pages fills up eventually. When the second page is created, overhead is under 50% by definition. 33% with the third page, etc.

For substantial repos, page size overhead should be down in the single digits.

(14) By Warren Young (wyetr) on 2019-12-10 18:56:17 in reply to 13 [link] [source]

For substantial repos, page size overhead should be down in the single digits.

It turns out, we don't have to guess. There's a tool that comes with SQLite called sqlite3_analyzer. (Say "make sqlite3_analyzer" at the top level of a checkout of its Fossil repo.)

When run on a freshly-rebuilt and vacuumed clone on fossil-scm.org/fossil, we get this data:

Unused bytes on all pages......................... 3710881      6.0%  (BLOB)
Unused bytes on all pages......................... 20720        2.6%  (DELTA)
Unused bytes on all pages......................... 56139        1.8%  (EVENT)
Unused bytes on all pages......................... 62538        1.4%  (MLINK)
Unused bytes on all pages......................... 33295        4.2%  (PLINK)
Unused bytes on all pages......................... 37167        1.6%  (TAGXREF)
Unused bytes on all pages......................... 291939      15.8%  (TICKET)

That's the relevant rows only from the voluminous output, and it's pre-filtered to remove stats on tables amounting to less than 1% of the total DB size, since a 99% overhead in a nearly unused table taking a page or two in the DB is just noise.

I have no idea why the TICKET table is so big and also has so much overhead even after a rebuild and vacuum.

Other than that, it's as I said: for tables holding substantial amounts of data, overhead is in the single digits.

(15) By Martin Gagnon (mgagnon) on 2019-12-10 19:06:52 in reply to 13 [link] [source]

It's not clear to me from the ChangeLog why there should be such a jump there.

I think it’s because the default page size in SQLite change on that built-in SQLite version. 1.35 is shipped with SQLite 3.13, the default page size change from 1024 to 4096 on 3.12.

See: https://www.sqlite.org/releaselog/3_12_2.html

(16) By TomaszDrozdz on 2019-12-11 08:48:14 in reply to 1 [link] [source]

How about not to have separate Fossil repo for each application
but one repo with separate branch (branches that have no common ancestor) for each application ?

(17) By Warren Young (wyoung) on 2019-12-11 17:30:31 in reply to 15 [link] [source]

There's an easy solution to the problem, then, isn't there? Have fossil init use a 512 byte page size, then send the pragma for 4k at the end of the init sequence so that all future pages get efficient packing.