Tracking files across renames

(1) By Richard Hipp (drh) on 2019-11-27 13:25:27 [link] [source]

N.B.: Moving this discussion to its own thread. The discussion originated here.

Recap:

I've been pushing for nomenclature to help us talk about the best way to handle tracking files across renames. The Fossil blockchain already records all the necessary information needed to do that. The problem is how to present the information in a useful and intuitive way in the UI.

I have lamented the sometimes ambiguous use of the word "file". Possible meanings of "file" include:

A specific sequence of bytes with a unique hash. (An "artifact")
All artifacts having the same on-disk filename.
All artifacts that derive from a common ancestor or a common "fossil add".

The /finfo page uses definition (2) above, since this allows the file of interest to be uniquely identified by its on-disk filename. For definition (3) you either have to specify the hash of one of the members of the "clan" (one of the set of files that derive from a common ancestor) or you have to specify the filename together with a check-in.

I have argued that /finfo should continue to operate according to definition (2) and that for tracking files according to definition (3) a separate URI should be used. I proposed "/claninfo". But that is all open to debate.

An Issue Concerning Definition (3)

Perhaps "artifacts having a common ancestor" is an insufficient definition for "file". Consider the following check-in graph:


    1  ---------- 3 ----- 5 ---- 7 ------- 8 --- 9 --- 10
         \                           /
          `--- 2 ---- 4 ------ 6 ---'

Suppose a new "file" named "xyzzy.txt" is added on both check-ins 5 (the trunk) and 4 (on the branch). So an object with an on-disk name of "xyzzy.txt" in check-ins 6 and 7 have a common descendant (the object named "xyzzy.txt" in check-in 8) but they do not have a common ancestor. Are they the same file?

If we ask Fossil to show the complete history of the file named "xyzzy.txt" in check-in 10, does it show us the "xyzzy.txt" in both check-ins 6 and 7? The current /finfo page does show both files, but the hypothetical /claninfo page which uses the "all artifacts with a common ancestor" would not, since the "xyzzy.txt" file in check-in 6 has no common ancestor with the "xyzzy.txt" file in check-in 7.

Is this a sufficient obscure corner case that it can be ignored? Or do we need to expand the third definition of "file" to encompass the case where the same "file" was added twice in separate branches which were later merged, yielding a revision graph that is not a tree?

(2) By Andy Bradford (andybradford) on 2019-11-27 16:46:04 in reply to 1 [link] [source]

> For definition (3) you  either have to specify the hash  of one of the
> members of  the "clan"  (one of the  set of files  that derive  from a
> common ancestor) or  you have to specify the filename  together with a
> check-in.

Perhaps this should be called  "bloodline"? Then again, perhaps that too
closely mirrors reality.

Thanks,

Andy

(3) By Warren Young (wyoung) on 2019-11-27 19:53:17 in reply to 1 [link] [source]

1. A specific sequence of bytes with a unique hash. (An "artifact")

That already has a name, as you say.

2. All artifacts having the same on-disk filename.

This is why I've been saying that you need to specify the starting point in the DAG, not look files up by name giving possibly many artifact hashes on possibly many branches, some of which might not even have connected lineages. Start with the DAG, not the index mapping file names to hashes.

If you don't specify a starting point in the finfo query, Fossil should use well-understood rules in deciding, much as it does already for other types of conflicts: prefer current branch, use "trunk" when no branch name is available, use latest when two equally good options are available, etc.

Suppose a new "file" named "xyzzy.txt" is added on both check-ins 5 (the trunk) and 4

Then you have xyzzy.txt@4 and xyzzy.txt@5, and neither has any prior history.

the object named "xyzzy.txt" in check-in 8

I think finfo should prefer tracing on the same branch when given a choice. If you ask for finfo of xyzzy.txt@10, it should give you xyzzy.txt@10, 9, 8, 7, and 5. It would be good if the finfo report told you that there's another branch with more info about xyzzy.txt between 8 and 7, much as it should report a rename at the point where it happens.

To learn about xyzzy.txt@4, you'd need to start the finfo query at 4 or 6.

I'm not much concerned with the terminology. Standard genealogical terms should suffice.

(4) By Richard Hipp (drh) on 2019-11-28 13:09:22 in reply to 1 [link] [source]

Summary of the situation so far:

We still do not have good nomenclature to distinguish between "file" meaning a set of artifacts with the same on-disk filename and "file" meaning artifacts descending from a common source. A fourth possible definition of "file" is objects with the same on-disk content. It is possible (and not too unusual) to have to more more on-disk files that have the same sequence of bytes. These are represented in Fossil by a single entry in the BLOB table and a single hash, and thus, in some sense, are a single "file". At some point, we should add a documentation page talking about the various possible meanings of the word "file".

Let's ignore the obscure case of a common "file" that is added in two separate branches and then merged together. The merge algorithm does not know how to deal with that case either. The merge reports that the two files lack a common ancestor, so it is not able to resolve any differences between them, so it just imports the version of the file found in the primary parent.

I propose to identify common-ancestor files by adding a new MROOT column to the MLINK table. MLINK.MROOT will be the BLOB.RID value for the initial version of the file. In other words, MLINK.MROOT will be the root of the change graph for the file.

It is unclear at this time whether or not a new index on MLINK.MROOT will be desirable. We can figure that out later. One of the great "stoppable ideas" behind the SQL language is that we can defer that decision with no impact on the code.

In the current implementation, the MLINK.PID field holds a -1 if a file was added to the check-in via merge. That representation makes it difficult to trace the change graph for the file and ought to be changed. When a file is added by merge, the MLINK.PID field should be the BLOB.RID value of the file in the merge parent.

In my investigations of this matter, I find that there are a few other corner cases where the MLINK table is not being computed perfectly. The deficiencies appear to be harmless in the current implementation. Nevertheless, it would be good to clean that all up.

(5) By jvdh (veedeehjay) on 2019-11-28 14:01:39 in reply to 4 [link] [source]

We still do not have good nomenclature to distinguish between "file" meaning a set of artifacts with the same on-disk filename and "file" meaning artifacts descending from a common source. A fourth possible definition of "file" is objects with the same on-disk content. It is possible (and not too unusual) to have to more more on-disk files that have the same sequence of bytes. These are represented in Fossil by a single entry in the BLOB table and a single hash, and thus, in some sense, are a single "file". At some point, we should add a documentation page talking about the various possible meanings of the word "file".

considering the perspective of a potential average user that wants to use fossil just as a VCS and not to bother too much with technical details, it would be good to view the nomenclature issue from his perspective: so what would he easily understand, what not so much?

to me, it seems intuitive (well, I know...) to denote as "file" the entity having been added to the repo at some point in the past and experiencing modifications and possible renames afterwards in the future development -- so the whole finfo history across the renames, if you like. that is your 'meannig artifacts descending from a common source' AFAICS.

from the simple user perspective that is essentially all to know: the file is distinct from its current (or past) filename. the name is a tag attached to the file by which one refers to the file (under its current name).

"file" in the sense of 'a set of artifacts with the same on-disk filename' I do not even quite understand. do you mean "same on-disk filename in different commits"? this seems to be not needed for the backtracking finfo or does it? I also would describe this situation as "this are different files/entities whose name tags are not unambiguous across the full history of those files".

overall I would propose to only speak about files and (file)names... does that make sense?

(6) By Richard Hipp (drh) on 2019-11-28 22:21:50 in reply to 4 [link] [source]

I propose to identify common-ancestor files by adding a new MROOT column to the MLINK table. MLINK.MROOT will be the BLOB.RID value for the initial version of the file. In other words, MLINK.MROOT will be the root of the change graph for the file.

This idea does not work.

The problem is that the BLOB.RID for the initial check-in of the file does not uniquely identify the file. The initial check-in might have been a common template, that is also used as a template to seed many other files in the same project, and in that case, multiple logically independent files would have the same BLOB.RID value.

I think the solution is to expose the rowid of the MLINK table as a new column MLINK.MLINKID which is the INTEGER PRIMARY KEY. Then have the MLINK.MROOT contain the MLINK.MLINKID of the root of the change tree.

(11) By anonymous on 2019-12-11 18:47:58 in reply to 6 [link] [source]

The initial check-in might have been a common template, that is also used as a template to seed many other files in the same project

I think the solution is to expose the rowid of the MLINK table as a new column....

Still would require users indicate that the new file was renamed from a template file.

On the other hand, nearly all my colleagues just make copies from template files and just not worry about tracing back to specific versions of the templates.

In my own experience, after many attempts to appease the subset of "process lawyers" who did want that traceability, the 2 most reliable ways to do that are to either embed the templates into tools that instantiate files from templates, or store the templates in a document management system that performs some kind of keyword substitution.

(Where I work, we use a "project setup" tool that instantiates a "proto working copy" with, among other things, a template folder containing templates of the various kinds of files we might need, then creates the project repository and commits the proto working copy to that. The templates include a header that identifies the template and its version, as well as places to insert file specific information that developers fill in when they make copies.)

(7) By Florian Balmer (florian.balmer) on 2019-12-01 15:35:46 in reply to 1 [link] [source]

Maybe this is implicit, but would this also have diffs spanning across file renames show the changes line by line, instead of full file deletion and new file creation?

It seems that with git, line history can be preserved for combined and split files, but it looks tricky:

(8) By jshoyer on 2019-12-02 23:48:16 in reply to 7 [source]

Florian, you are asking about diffs between check-ins, correct? (E.g. generated via a /timeline page by clicking on one circle and then another, in which file renaming currently shows up as ‘full file deletion/new file creation’, usually.)

Or are you talking about a diff between two versions of a file? (Generated via the hypothetical new page analogous to /finfo but with tracking across renames.)

Thanks for those git links! I did not know it was possible to do those things in git and still preserve line history. I track a lot of prose with Fossil and often wish that I could combine two files into one (while preserving line history). Seems related to the corner case, noted above, of merging two versions of a file that do not share a common ancestor. Similarly, I'd like to be able to copy and split files -- I see that a `fossil cp` command was previously proposed (1, 2). Just throwing that out there in case it is relevant to possible changes in the MLINK table etc.

It would be nice if `fossil blame` and `annotate` could also track across file renames. (Perhaps that has been another implicit consideration in the discussion thus far.)

(9) By Florian Balmer (florian.balmer) on 2019-12-03 15:28:11 in reply to 8 [link] [source]

... diffs between check-ins ... or ... diff between two versions of a file?

This may be handy for either variant -- but it won't work for diffs output in the classic patch format (through the "Patch" hyperlink).

(10) By juef on 2019-12-11 07:14:39 in reply to 1 [link] [source]

FYI: similar git-users discussion - Keeping history of file after renaming.