View Ticket
Ticket Hash: 5199df97bc08fff1e5656d165738f0d87f2a3ad1
Title: Alternative to reconstruct for large imports
Status: Closed Type: Feature_Request
Severity: Important Priority:
Subsystem: Resolution: Open
Last Modified: 2010-12-21 13:52:10
Version Found In: aab38ef02f
The current "fossil reconstruct" interface works well for small scale operation, but having one file per artifact (even across multiple directories) brings a lot of overhead.

Attached is a patch to implement a variant called reconstruct-sql, which takes the artifacts from a sqlite3 database. This makes it a lot easier to deal with, especially if the processing is already done in a higher level language. Compressing the artifacts helps to cut down required disk space a lot, too.

Related questions would be if explicitly tagging artifacts as manifests or including the hash or temporal order would be used to speed up the import. A lot of time is currently still spend in content_put. I haven't analyzed yet whether it is the work required for delta processing or whether it is the plain issue of having very large manifests (e.g. a working copy of pkgsrc is 60k files).

<hr /><i>drh added on 2010-10-06 00:06:41:</i><br />
If your import process is able to deal with zlib compression and sqlite, then
why not just create an empty repository (using the "fossil new" command), 
push the files directly into the BLOB table of the repository, then invoke
"fossil rebuild" to process them?

Or, how about an --incremental option to "fossil reconstruct".  With
--incremental, you don't have to populate the inport directory with
all of your files all at once.  Start with some subset, import them,
clear out the directory and refill it with new files, import again,
repeat until done.

<hr /><i>anonymous added on 2010-10-06 09:40:24:</i><br />
Well, the rebuild part needs hours as well, so I was wondering if that can be optimized too. At the moment, reconstruct(-sql) is processing artifacts twice (content_put and during rebuild), which is very costly for the manifests. Putting the entries directly into the blob table would remove the first part, but the rebuild would still be crawling.

<hr /><i>anonymous claiming to be Joerg Sonnenberger added on 2010-11-05 14:09:37:</i><br />
I'm trying the new+import directly+rebuild road. One issue is that this triggers the verify loop again in all its slow glory. Can that be made optional and maybe available as a separate full integrity check?

<hr /><i>anonymous claiming to be Joerg Sonnenberger added on 2010-11-05 15:28:52:</i><br />
There is one other issue with using "fossil new". It is currently not possible to inhibit the creation of the "initial empty checking".

<hr /><i>anonymous claiming to be Joerg Sonnenberger added on 2010-12-21 13:52:10:</i><br />
fossil new + using sqlite3 to override the project code + extracting the initial revision + rebuild --noverify do what I want.