Enormous repo size after conversion from svn
(1) By Lev Serebryakov (blacklion) on 2020-10-11 13:44:17 [source]
I've finished conversion of FreeBSD svn repo into Fossil. Results are discouraging: repo size is 295G (316494716928 bytes) after vacuum. It takes 3 days to create.
original svn repo size is 4.2G
svn → git repo (not full one, but close enough) size is 2.4G
svn → git (2.4G) → fossil repo size is 5.2G (but it takes 6+ days to convert it).
There is a lot of branches of different types, all of them were renamed to "branches/z-y-z", same with tags ("tags/x-y-z"). There are 1059 branches and tags in total.
What did I do wrong?
I don't think I could "upload" 295G repo into public place, but I could run any experiments, sql requests, etc., on it to diagnose this situation.
(2) By Richard Hipp (drh) on 2020-10-11 16:25:00 in reply to 1 [link] [source]
Please try running:
fossil rebuild --compress-only
Let us know if that helps.
(3.1) By Lev Serebryakov (blacklion) on 2020-10-11 18:20:28 edited from 3.0 in reply to 2 [link] [source]
I think, it will take several days (as such large file could not be placed at SSD, I need to use spinning rust for it), but I'll try.
BTW, I've tried to compress this file — xz -9
shows compression about 0.89 (I've stopped process after hour, as it is useless with such low compression).
(4) By Andreas Kupries (aku) on 2020-10-11 18:22:45 in reply to 3.0 [link] [source]
Can you tell us the exact commands you used to convert from svn
to fossil
?
Right now my best idea of the problem is that the conversion somehow caused fossil to not perform history / delta compression. IOW normally fossil saves only changes of files from commit to commit. And here it seems that it has stored the full data for each unique version of a file.
Richard's advice to try
fossil rebuild --compress-only
is a an attempt to get such delta compression into the repository after the fact, with fossil essentially searching the sea of blobs for pairs worthy to be delta-compressed.
(5.2) By Lev Serebryakov (blacklion) on 2020-10-11 18:54:30 edited from 5.1 in reply to 4 [link] [source]
Can you tell us the exact commands you used to convert from svn to fossil ?
# ~lev/bin/svnrebase `cat ~lev/fsl/svn2fsl.rename` < ~lev/fsl/svn.dump | fossil --svn --tags tags --branches branches --trunk head --no-vacuum /usr/home/fossil/freebsd-src.fossil
# echo "vacuum;" | SQLITE_TMPDIR=/usr/home/fossil sqlite3 /usr/home/fossil/freebsd-src.fossil
~lev/bin/svnrebase
is this script.
~lev/fsl/svn2fsl.rename
is list of all branches and tags to rename into two flat namespaces (branches
and tags
). You could find it here
I think, that I did mistakes with some branches in user
, but I don't want to try re-import without them, as it takes a long time again…
You could see structure of repo at the web, but it doesn't contain many technical and historical branches.
EDIT: svnweb
interface contain all branches, but it is almost impossible to find them via limited web interface if you don't know exact revision ranges when these branches were alive.
(6) By Lev Serebryakov (blacklion) on 2020-10-12 14:17:53 in reply to 4 [link] [source]
It doesn't help at all.
(7) By Warren Young (wyoung) on 2020-10-12 15:20:59 in reply to 6 [link] [source]
Doesn't this result imply that the artifacts are improperly linked? How else can you explain such an utter failure of delta compression to take effect?
Lev, does the resulting repository actually work? I mean, if you check out the tip version of trunk, do you get something close to what's currently published via Subversion? If you then go back a few weeks (e.g., fossil up 2020-09-28
) does that match up with what was published via Subversion on that date?
Does switching branches work?
Would you please post the output of fossil dbstat
for that repo?
(8) By Lev Serebryakov (blacklion) on 2020-10-13 15:37:57 in reply to 7 [link] [source]
Basic tests with tip, history and main branches are Ok. Looks like some user/
and/or projects/
related banches are messed up, but it is hard to check all of them...
I'll try to convert without user/
and projects/
parts of repo. It is impossible to convert without vendor/
(which is problematic too) part, as there are many copies from vendor/
namespace, and conversion breaks without it.
Here are stats for this enormous database:
repository-size: 316,263,800,832 bytes
artifact-count: 1,794,583 (stored as 832,070 full text and 962,513 deltas)
artifact-sizes: 710,198 average, 441,681,550 max, 1,274,508,749,607 total
compression-ratio: 40:10
check-ins: 193,592
files: 8,785,006 across all branches
wiki-pages: 0 (0 changes)
tickets: 0 (0 changes)
events: 0
tag-changes: 273
latest-change: 2020-09-25 20:39:20 - about 17 days ago
project-age: 9,985 days or approximately 27.34 years.
project-id: b2822505ddaa434a93f33b577d79f84a1a675526
schema-version: 2015-01-24
fossil-version: 2020-08-20 13:27:04 [b98ce23d4f] [2.12.1] (clang-10.0.0 (git@github.com:llvm/llvm-project.git llvmorg-10.0.0-0-gd32170dbd5b))
sqlite-version: 2020-08-14 13:23:32 [fca8dc8b57] (3.33.0)
database-stats: 77,212,842 pages, 4096 bytes/pg, 0 free pages, UTF-8, delete mode
Here are stats for database converted via git:
> sudo fossil dbstat -R freebsd-src-fromgit.fossil
repository-size: 4,689,629,184 bytes
artifact-count: 2,195,815 (stored as 241,765 full text and 1,954,050 deltas)
artifact-sizes: 1,519,385 average, 32,512,964 max, 3,336,288,096,370 total
compression-ratio: 711:1
check-ins: 783,119
files: 828,119 across all branches
wiki-pages: 0 (0 changes)
tickets: 0 (0 changes)
events: 0
tag-changes: 0
latest-change: 2020-10-05 19:26:54 - about 7 days ago
project-age: 9,985 days or approximately 27.34 years.
project-id: 025ed3a97aa991bdd4c2bfb7ee7d691e8e119ca5
schema-version: 2015-01-24
fossil-version: 2020-08-20 13:27:04 [b98ce23d4f] [2.12.1] (clang-10.0.0 (git@github.com:llvm/llvm-project.git llvmorg-10.0.0-0-gd32170dbd5b))
sqlite-version: 2020-08-14 13:23:32 [fca8dc8b57] (3.33.0)
database-stats: 1,144,929 pages, 4096 bytes/pg, 0 free pages, UTF-8, delete mode
(9) By Warren Young (wyoung) on 2020-10-13 16:15:52 in reply to 8 [link] [source]
The 4x difference in commits, the vast difference in compression ratio, and the vast differences in full-text file artifacts vs delta artifacts tells me the second repo is far more likely to be complete and sane. Something's badly wrong with the first, if they're meant to represent the same history.
If your objection to the second one is the larger size (roughly 2x) it's because Fossil stores more metadata than Git does, with which it provides more features. (e.g. Prevention of detached head state.)
(10) By poetnerd on 2020-10-13 16:39:34 in reply to 9 [link] [source]
It does seem odd that the enormous repo says it has 8 million files, while the via-git conversion says it has 8 hundred-thousand files.
Why would one have an order of magnitude more files present?
(11) By Andreas Kupries (aku) on 2020-10-13 17:02:42 in reply to 10 [link] [source]
AFAIK svn handles branches via a virtual file system kind of thing where files and directories are virtually copied ?!
So if there are 10 branches and somehow the import treats these all as distinct files (per their paths) ?
Note: Just throwing out a (less than half-baked) idea in the hope it sparks recognition in somebody here.
I wonder if that also blows up the commit manifests ?
(12) By Scott Robison (sdr) on 2020-10-13 19:00:10 in reply to 11 [link] [source]
When I first tried to port an svn repo to fossil (knowing virtually nothing about fossil or DVCS, just wanting to try it), it did try to import every tagged / branch / etc in the virtual file system into every commit. Essentially, there is a mismatch between the svn idea of monolithic repos where multiple subprojects, tags, and branches are stored in a namespaced file system, and the fossil / dvcs approach.
I wound up using svn tools to slice and dice my repo into separate pieces so that I could import the various trees as multiple fossils. My understanding is that some effort has been expended on improving this since I tried, but the recommendations by the svn project are not hard and fast rules, so unless one religiously follows the svn recommendations the importer might still have a hard time differentiating pieces.