Power loss after week of conversion from git, how to resume?

(1) By anonymous on 2020-09-25 18:49:57 [link] [source]

Hello!

The FreeBSD project decided to move from SVN to git, while no concrete plans yet. I tried to make a test if Fossil would be better (from license and dependencies point of view - single file against hundred pkgs - it definetely is). As I am just user and has no access to original repo, I tried to do from git.

So I did

git clone https://github.com/freebsd/freebsd.git

resulting in

1,6G ./.git

and then run

git fast-export --all | fossil import --git freebsd.fossil

As it has no understandable progress bar (just a line with date of unknown branch), and was slowing down my desktop, I had to periodically kill -STOP/kill -CONT it. Finally, after more than a week, there was a power loss, leaving files in state:

-rw-r--r--    1 vadim  wheel   6.7G Sep 23 11:17 freebsd.fossil
-rw-r--r--    1 vadim  wheel   118K Sep 12 19:45 freebsd.fossil-journal

Now, how could I resume conversion? Or, if I can't, may be I have to start the process with some options to be able to resume of power loss occurs again? May be some options to have some better understanding of progress?

Please help, to be able to evaluate if Fossil is good for such a big opensource project like FreeBSD.

WBR, nuclight.

(2) By Richard Hipp (drh) on 2020-09-25 19:57:22 in reply to 1 [source]

Regarding the power loss: I think you need to start over. The SQLite database file will automatically rollback when you reconnect, which means that since your transaction didn't commit, it will start over again at the beginning.

Fossil as a better SCM than Git for FreeBSD

That's a completely different question. Fossil was designed for SQLite, which is a rather smaller project that FreeBSD. Fossil works amazingly well with SQLite-sized projects. For FreeBSD, maybe not so much, at least not it its current form.

But, we've talked about an enhancement that would make Fossil work much better as a SCM for FreeBSD. The idea is to set up a "remote repository mode" for Fossil, so that it works more like SVN and less like Git. The large multi-gigabyte repository database still lives on the server. Users are not required to clone. Instead, you do a check-out by connecting to the remote repository just like you would with SVN.

Cloning would still be available to those who want a complete copy of the project history. So Fossil would still continue to work like Git in that respect. But it would add the new capability to just do a check-out (including a "slice" checkout of just part of the source tree) for those who just want a copy of the latest code, thus giving Fossil the ability to work without cloning, like SVN. Each user can choose for themselves how they would prefer to work.

But the Fossil developers have only talked about doing this. We believe it is feasible. But we've not undertaking the project because it does not solve a problem that we ourselves are having.

If some other important project (like FreeBSD for example) would like to work with us on this enhancement, to do some testing and help us develop requirements and help debug issues, then we'd probably be willing to undertake the effort to make it happen. If this is of interest to the core FreeBSD developers, have them contact me.

(3) By anonymous on 2020-09-25 20:26:21 in reply to 2 [link] [source]

will automatically rollback when you reconnect, which means that since your transaction didn't commit, it will start over again at the beginning.

Oh, wait, it doesn't commit transaction after every VCS commit?!..

OK, this time I had bad luck, but may be some tweaks for next time? Still... may be some options for helping to resume if power loss occurs again? Something to estimate of how long does it take?..

fossil help import shows some options, --incremental and something about marks, but I don't understand what is this and how it works.

Or may be some workarounds are possible? Maybe somewhat like manual invocation several times on branch-by-branch basis? (still need help how to do it)

If some other important project (like FreeBSD for example) would like to work with us on this enhancement, to do some testing and help us develop requirements and help debug issues, then we'd probably be willing to undertake the effort to make it happen. If this is of interest to the core FreeBSD developers

For me to be able to offer this to them, I wanted to do some tests before that to have something to show (if it will not show something unacceptable, e.g. like being slower than SVN, immediately) - for this I need to convert repo to fossil first.

May be you have some tips of hot to speed up the process?

BTW, used version is fossil version 2.11 [4df919803b] 2020-05-25

-- WBR, nuclight

(5) By Warren Young (wyoung) on 2020-09-25 20:47:29 in reply to 3 [link] [source]

Oh, wait, it doesn't commit transaction after every VCS commit?!..

For normal commits, yes, but a repo conversion isn't simply a replay of commits from one repo into another, in time-series order. Both Fossil and Git store artifacts as a pile of artifacts in no particular order, Git because it's reliant on filesystem entry ordering, Fossil because it's dependent on SQLite page allocation, Btrees and such.

Therefore, how would incremental conversion work? If you pull over 10% of the artifacts on the first pass, you might have a pile of artifacts that is so chewed up into a lace of disconnected commits and such that almost nothing hangs together.

However, I suspect there's a way around this: incremental deconstruct / reconstruct.

Fossil has these two commands that turn a SQLite DB full of Fossil artifacts into a pile of files and then reconstruct the repo DB from that pile of files. A potential solution then is to somehow create a "git deconstruct" — which may be simply a matter of unpacking Git's packfiles and converting them into Fossil artifact form — and then adding an --incremental option to fossil reconstruct so you can purposefully put off the rebuild step.

Thus:

   $ cd ~/tmp/working-space
   $ git-deconstruct /path/to/repo/.git     # fills ~/tmp/working-space
   $ fossil reconstruct --incremental ~/museum/freebsd.fossil 0*
   $ ... 1*
   $ ... 2*    # etc, thru f*
   $ fossil rebuild ~/museum/freebsd.fossil

That won't work as-is because the globs will spam the system's command length limits, but a small adjustment with xargs or similar will fix that.

The deconstruct and reconstruct steps should be a small multiple of the whole-repo reading and writing time.

You'll end up paying a huge amount of time for a rebuild over FreeBSD's ~25 year history, though. Maybe an incremental rebuild option would be helpful as well.

(6) By anonymous on 2020-09-25 21:15:03 in reply to 5 [link] [source]

Both Fossil and Git store artifacts as a pile of artifacts in no particular order

But why? Commits are not soup, thay are DAG: what if to take commit with no parents, convert everything what it refers to, commit transaction, take it childs, rinse and repeat?

Also, it seems that one big transaction in SQLite is progressively slowing down itself: for that week, the speed of .fossil-file growth was about 3 Gb first day and was 100-200 Vb per day after a week.

A potential solution then is to somehow create a "git deconstruct"

Then, how do already existing --incremental and marks options to fossil import work? I don't know how, so I tried to guess what it may be for branch-by-branch, and did for fresh repo:

git fast-export --progress 1000 --export-marks=../git.marks master | fossil import --git ../freebsd.fossil

I don't know if this is correct for the first time, and how marks in fossil should be done, especially on next commands - probably something like

for b in `git branch -a`; do ...

Did I missed something?

-- 
WBR, nuclight

(7) By Stephan Beal (stephan) on 2020-09-26 03:08:36 in reply to 6 [link] [source]

But why? Commits are not soup, thay are DAG: what if to take commit with no parents, convert everything what it refers to, commit transaction, take it childs, rinse and repeat?

Commits are, but that's not how the data are stored. They're a pile of completely opaque blobs with no specific ordering, and fossil does not know which (if any) are commits until it reads them. To order them in time sequence it would have to read every one of them in advance and record their order somewhere (ignoring for the moment that there may be no relationship between any given commits, as in the case of multi-root repos, and that timestamps can be modified by amendment records (which may appear before or after their target commit in that pile of opaque blobs)).

Its only feasible option is what it currently does: reads and applies each blob in whatever order they are made available. Fossil does not "connect the dots" between versions until it has all of the blobs and can "rebuild" its other tables from that. That rebuild step will be quite painful (time-consuming) for you in such a large tree. As fossil currently works, it is simply not suitable/practical for such large trees.

(8) By Warren Young (wyoung) on 2020-09-26 23:36:19 in reply to 6 [link] [source]

...take commit with no parents...

I take that to mean the root of the DAG, yes?

...convert everything what it refers to, commit transaction, take it childs, rinse and repeat?

That's what you're missing: Git can't tell you that.

Ironically, Fossil can tell you that, but if you had a Fossil repo to query for this info, you wouldn't need the conversion.

I think you'd have better luck restarting from the FreeBSD Subversion repo, since then you can simply ask the repo for commit r1, then r2, etc. until you get to the last commit.

If you must have a method of converting from Git to Fossil in commit order, depth-first on each branch, then I think you'd have to crawl all of the Git branches from tip back to the DAG root, rebuilding the DAG in RAM, then perform a tree traversal algorithm on that.

...Which is isn't a practical solution as-is, since this whole thread started when you had to reboot the system. On restart, you'll not only have to rebuild the RAM DAG, you'll have to reconstruct where in the DAG traversal you stopped.

At that point you want to construct a database of commits so you can record the operation order, where you left off, etc....which begins to look a lot like Fossil!

No, I'll repeat my prior suggestion: It's far simpler to have a mode for fossil reconstruct that lets you blindly shove potentially-disconnected artifacts into the repo, with the rebuild step put off until after all of the artifacts are finally inserted. It means you can't use the repo until the last step completes, but that's the state of the world today.

The biggest downside is that it will take a lot of disk space and be a slower conversion overall. That's the cost of incremental update.

(9) By Lev Serebryakov (blacklion) on 2020-09-28 10:33:09 in reply to 8 [link] [source]

I've tried to import FreeBSD subversion repo to Fossil (as I have full mirror) and it looks impossible to achieve properly.

Looks like fossil could import only one branch, one tags and one trunk directories, and if content is copied outside these three directories, it complains and stop conversion. Which stops conversion on Revision 3 (very beginning of import).

FreeBSD repository has very complex structure, not limited to trunk, one place for tags and one place for branches. First of all, we have TWO places for regular branches: releng/ and stable/. But it is least of problems, I could rename them into branches/releng/ and brnahces/stable/ with simple svn-dump filter.

But it doesn't stop here, as historically many files were copied from some «strange» places like cvs2svn/ (technical measures of CVS to SVN conversion) or vendor/ (historical way to support 3rd party software in FreeBSD), which don't exist in latest revisions. It is how repository starts, and fossil import --svn complains Copy from path outside the import paths on Revision 3, because Revisions 1 and 2 adds files to cvs2svn/ and Revision 3 copies them in head/ (name for trunk, which is Ok). I could rename cvs2svn/ to branches/cvs2svn/, but I'm afraid, there will be other such artifacts later. And fossil doesn't report «invalid» path, which breaks import, so it will be hard to understand what happens on revision, say, 123567 :-)

Another problem, we have users/ and projects/ in repo, which are places for private branches. They could be renamed to something like branches/private/users/ and branches/private/projects/, but, I afraid, they could cause problems after several hours (days?) of import, too.

(10.1) By Warren Young (wyoung) on 2020-09-28 14:31:34 edited from 10.0 in reply to 9 [link] [source]

I've tried to import FreeBSD subversion repo

Are you "anonymous" above, or do we have a second party to the FreeBSD side of the discussion here now?

it looks impossible to achieve properly.

I'm not seeing impossible, I'm seeing incremental, time consuming, and annoying. Whether that's a distinction without a difference is up to you.

Looks like fossil could import only one branch, one tags and one trunk directories

I deeply doubt that, in principle, at least.

Fossil isn't limited to trunk, tags, and branches. It will store as many disconnected DAGs as you like. It's highly unusual to have more than one DAG in a Fossil repo in a project started within Fossil, but even then, it's possible to achieve.

...Unlike Git, by the way, which may throw away disconnected pieces of the tree in a git gc pass. That's not a bug, that's a documented, purposeful behavior. Be sure to check that before completing the move to Git!

I could rename them into branches/releng/ and brnahces/stable/ with simple svn-dump filter.

There's also fossil import --svn --rename-brach and friends.

Also, you could dump individual branches and then fossil import --svn --incremental --flat --rename-* them into place.

any files were copied from some «strange» places like cvs2svn/ (technical measures of CVS to SVN conversion) or vendor/

My largest Fossil-based project had a CVS → SVN → Git → Fossil conversion path, and it was started not all that long after FreeBSD. It didn't have as many commits, but I also took advantage of freedoms Subversion gave me that Fossil doesn't directly support. It took a lot of massaging, but I got it done.

The linked mailing list posting gives the scripts for all of this, which you're welcome to study, hack, and repurpose.

Because of the way that web site handles email attachments, it presents the 3 separate scripts all concatenated together. The first one (with the #!/bin/bash shebang line) is the top-level svn2fossil script. The next one (with a #!/bin/sh shebang) is the git-fixups script called by the top-level, and the next one is the fakeauthor hook.

You should also have an authors.txt file, per the Fossil Cookbook section that my scripts are based on.

The advantage of going through Git as an intermediary is that it has more powerful tools to massage the repo.

My scripts are made to be run incrementally, avoiding rework wherever possible. It should be possible to iterate your way toward a complete conversion. You might be able to restart from the existing FreeBSD Git repo instead of re-converting, but that's pure speculation, since I haven't looked at either repo in any detail.

fossil doesn't report «invalid» path, which breaks import

Easily patched. Fossil advances only to the extent that its users contribute their desired features and fixes.

users/ and projects/ in repo, which are places for private branches

You don't want to use the term "private branch" here: that means something quite different to Git and Fossil than what you mean in this case.

Instead, simply say that they're "branches". Any nuance above that is outside Fossil or Git, at the human level.

I afraid, they could cause problems after several hours (days?) of import, too.

I don't see how. This very software project, Fossil, has many personal branches in the public repo.

(11.1) By Lev Serebryakov (blacklion) on 2020-09-29 12:00:20 edited from 11.0 in reply to 10.1 [link] [source]

I'm second person, who helps "anonymous" above :-)

Looks like I managed to filter ("massage") svn dump not to have "not-imported" copy sources.

One thing that bothers me: now it is about 1/3 done and fossil file is about 190GB (yes, it is GB!). svn repo is 4.2GB and not-complete git conversion (as far as I know there is no complete git conversion, it is work in progress) is 2.4GB. Text dump itself is 121G.

I need to stop process, as my SSD is almost full, and restart it with (much slower) NAS as target.

UPDATE:

Converting FreeBSD repo to git is very complex task, which is worked on by dedicated project members for more than half a year now, and still there is no perfect ("final", "ultimate") conversions, only partial ones, which is Ok for main and recent history, but not completely full for very beginning of the project.

(12) By Warren Young (wyoung) on 2020-09-29 23:38:49 in reply to 11.1 [link] [source]

fossil file is about 190GB (yes, it is GB!). svn repo is 4.2GB

This is another result of the fact that Fossil doesn't import from Git in any particular order: it can't efficiently do delta compression until all of the artifacts are present.

It could try to do delta compression, but it would either not give much advantage or you'd end up re-compressing the whole tree again and again, making it smaller at each step, at the cost of an O(n²) sort of algorithm.

(13) By Lev Serebryakov (blacklion) on 2020-09-30 01:05:15 in reply to 12 [link] [source]

But I'm importing from svn, where order is determined! :) And still I need how much, 100 times more temporary space?

(14) By Lev Serebryakov (blacklion) on 2020-10-01 12:46:40 in reply to 12 [link] [source]

Ok, conversion has consumed 121GB of dump, produced 514GB (!) SQLite3 file and FAIL in "Vacuuming" stage:

SQLITE_FULL(13): statement aborts at 9: [INSERT INTO vacuum_db.'blob' SELECT*FROM"repository".'blob'] database or disk is full
SQL: INSERT INTO vacuum_db.'blob' SELECT*FROM"repository".'blob'
SQL: SELECT'INSERT INTO vacuum_db.'||quote(name)||' SELECT*FROM"repository".'||quote(name)FROM vacuum_db.sqlite_schema WHERE type='table'AND coalesce(rootpage,1)>0
SQL: VACUUM
SQLITE_FULL(13): statement aborts at 1: [VACUUM] database or disk is full
SQL: VACUUM
Database error: database or disk is full: {VACUUM}

Disk has another 5TB (!) free. It is ZFS.

Also, removal of database was cruel. It takes 2 full days to produce it.

(4) By Dan Shearer (danshearer) on 2020-09-25 20:35:44 in reply to 2 [link] [source]

Richard Hipp (drh) said on 2020-09-25 19:57:22

Fossil works amazingly well with SQLite-sized projects. For FreeBSD, maybe not so much, at least not it its current form [...] The idea is to set up a "remote repository mode" for Fossil, so that it works more like SVN and less like Git. The large multi-gigabyte repository database still lives on the server.

I think this is well worth pursuing if a larger project commits to using and testing it.

It is a common misconception that git magically keeps scaling. It doesn't, even if you're willing to do truly giant clones. That is why GVFS was created which brings exactly this architecture to git - still only available for Windows, and seemingly even that is seemingly not kept up to date on the current public repo.

If Fossil supports this architecture it will probably be going where some SCM needs to go anyway, and no maintained free SCM currently does.

Dan