Is it possible to see the entire history of a renamed file?

(1) By Marcos Cruz (programandala.net) on 2020-10-11 15:43:39 [link]

Is there any command to see the history of a file beyond renames, similar to `git log --follow <file>`?

I have made some tries, renaming files and commiting them with and without changes, just in case, but it seems no connection is preserved between the new files and the old ones.


```
fossil mv --hard <old_file> <new_file>
fossil ci -m "Rename ..."
```

I get the following:

`fossil timeline -v <check-in>` outputs "DELETED <old_file>", but does not mention the renaming.

`fossil finfo <old_file>` outputs "no history for file".

`fossil annotate <old_file>` outputs "unable to retrieve content of artifact #0".

`fossil finfo <new_file>` outputs the basic info of its single check-in, but no mention about the renaming.

`fossil annotate <new_file>` outputs the current contents of the file, as if it were created in the renaming check-in.

I have found no mention about this subject in the documentation or the forum.

Thank you in advance.

(2) By Richard Hipp (drh) on 2020-10-11 16:19:59 in reply to 1 [link]

The underlying data structures of Fossil allow for this, but there is no
such command.  Part of the reason why not is that it is difficult to know
how to identify the file you have in mind, if the file is changing names.
Other commands (or web pages) for looking at information about a file use
the filename to identify the file.  But if the filename is dynamic, which name
do you use?  What if multiple independent files were all called README.md
at various (distinct) times during the history of the project.  If you want
the history of README.md, how do identify which one.  It could be that multiple
independent (unrelated) files named README.md are current at this point in time,
though on different branches.  Again, how do you specify which one you mean.

You might be thinking "the file I am interested in has a name that was never
used by any other file so the ambiguity doesn't come up."  They might be true
in your case, or even in most cases, but the code that implements this still
has to deal with the case where there is ambiguity.

It seems to me that you have to identify an instance of the file you want
to track either by its SHA3 hash or by name+timestamp+branch or by
name+checkin.  It starts to get complicated.  And I've never had a pressing
need to do it myself (as I rarely rename files in a project) so I've never
taken the time to implement it.

Do you have patches?

(3) By jshoyer on 2020-10-11 22:45:53 in reply to 2 [link]

A little over a year ago
there were two discussion threads
on terminology
for tracking across file renames
([96cd01cef6 | 1],
 [10850fb10f | 2]),
and there was little agreement on what was best.


There was interest in drawing distinctions among
 1) a sequence of bytes with a unique hash
i.e. an artifact,
 2) the set of all artifacts in a repository
that have a given file-path in at least one check-in
(i.e. what a finfo page returns
 when you do do not specify a check-in to work backwards from)
and
  3) the set of all artifacts that share a common ancestor
(where that ancestor was an artifact checked in
 at a specific file-path at a specific time).

(4) By jvdh (veedeehjay) on 2020-10-12 08:41:46 in reply to 2 [link]

to answer the last question first: no, I don't :).

regarding the usefulness of the requested functionality, I would second that it really would be valuable. it seems one of the few features which I continue to miss in comparison to mercurial, say (the other one was grep for some years but that is now solved). renames do happen rarely but if they do (for whatever reason) the missing ability to cross the rename boundaries is annoying.

regarding ambiguities and corner cases I can imagine that they exist regarding how the user needs to specify his intention but the task itself is not ambiguous, right? I always wondered (given that the rename information actually is tracked by fossil) why timeline/finfo (or more important in my view: grep!) do not work across renames.

so naive question: would it not suffice (as a first step anyway) if the most obvious use case were handled: the specified file name refers to CURRENT version. full stop. that seems your "name+checkin" variant of making the specification unambiguous but silently using CURRENT as the checkin. I guess that this essentially would solve 99.9% of the users wishes in this respect.

it seems that in this case no new syntax/commands in the UI are needed and fossil "only" would have to secretly expand the given name to "name+CURRENT" and proceed accordingly. am I missing something obvious here?

(5) By Richard Hipp (drh) on 2020-10-12 13:42:00 in reply to 4 [link]

> would it not suffice ... if ... the specified file name refers to CURRENT version?

For the web interface (which is where this feature would be most useful)
there is not normally a "CURRENT" version.

(6) By jvdh (veedeehjay) on 2020-10-12 14:36:43 in reply to 5 [link]

personally, I miss that feature primarily in the CLI since that's my preferred (and, I believe, the "canonical") UI. so it sure is a matter of personal taste whether the feature is considered more useful in the CLI or the GUI.

and if it could more easily be done for the CLI and only for this special name+CURRENT case, it would help a lot I believe. not only for viewing the full history/timeline but more importantly (I think) in the context of "where did this strangely named variable creep into the code" or similar, i.e. when grepping for some patterns in the code.

(8) By Marcos Cruz (programandala.net) on 2020-10-12 18:36:17 in reply to 4 [link]

> would it not suffice (...) if the most obvious use case were handled: the specified file name refers to CURRENT version.

That is the only use case I sometimes missed.

(7) By Marcos Cruz (programandala.net) on 2020-10-12 18:29:17 in reply to 2 [link]

Thank you for the detailed explanation. I'm aware of the difficulty.

The reason I missed this feature is I'm working on the conversion of a website of mine from one static site generator to another, both written by me years ago. Since I'm changing the hierarchy of many of the hundred of pages, which partially depends on the file names, e.g. "language.section.year.month.day.fs", it's not uncommon that many source pages are renamed two or more times during the process of converting the contents from one system to the other.

Anyway, I note the history of renamings in the page sources, in case something gets wrong.

> What if multiple independent files were all called README.md at various
> (distinct) times during the history of the project. If you want the history of
> README.md, how do identify which one.

Of course, a file would have to be identified by a name and a checkin as you say.

(9) By anonymous on 2020-10-13 09:55:42 in reply to 2 [link]

Related to this post about renamed files, can Fossil show the equivalent of `SVN Blame` across file renames or copies?

Just yesterday I used Perforce's P4V Time-Lapse view of file that was refactored out of a single large file  
split into many smaller files (and the original large file removed). I.e. the ancestor of all newly added files  
is a single file, deleted in the same commit. To figure out when a bug was added (before or after the refactoring).  
And was pleasantly surprised to see all commits, even those from the original large file (or the one before, since  
the large file was *moved* earlier too, so that code had 3 different file names total in fact). All I had to do was  
a `p4 copy` when refactoring, instead of plain FS copies and SCM adds.

Does Fossil do that too? Just curiosity from a lurker. Thanks.

(10) By jshoyer on 2020-10-13 10:43:42 in reply to 9 [link]

No, the `fossil blame` command/page
annotates lines back to the check-in in which the file was renamed.
See for example [https://fossil-scm.org/home/blame?filename=www/permutedindex.html&checkin=855578b61091c49d24a891b8b2db4f92705842794e3e3292a1b0700efac2a21c]

If you check the 'log' box
you'll see the annotation goes back to check-in 63c2c732,
the same check-in you'll see at the very bottom
of [fossil:/finfo?name=www/permutedindex.html]

There is a `fossil mv` command
but currently no `fossil copy`.

(11) By anonymous on 2020-10-13 12:24:59 in reply to 10 [link]

Thanks. I appreciate the details.

Sounds like Fossil has ways to go still then :)

I'm someone which tends to move files and code around quite a bit.  
So the ability to track code through these moves and copies is pretty  
essential to how I work, which I get might be usual to some / most.

As Richard said, the underlying data-model probably supports such use cases,  
but the UI / *chrome* does not I guess. (modulo `move` vs `copy` perhaps?  
that might require a data-model change?).

Which is not surprising for a niche volunteer-based SCM.  

I wonder if GIT does better there now, given all the Fossil-vs-Git talk here.

(12) By Stephan Beal (stephan) on 2020-10-13 12:47:20 in reply to 11 [link]

> Sounds like Fossil has ways to go still then :)
>
> I'm someone which tends to move files and code around quite a bit.
So the ability to track code through these moves and copies is pretty
essential to how I work, which I get might be usual to some / most.

Fossil's low-level SCM metadata format is file-based, not chunk-of-code based. git is somehow able to say "this chunk of code was moved from file X to file Y," but fossil cannot. It would hypothetically be possible for fossil to attempt to calculate and track that when a change is made, but it would not be a simple thing to do. It would necessarily be stored outside of the SCM metadata, and need to be calculated when a commit is made or sync'd to a new repo, and recalculated each time a repo is rebuilt. The latter could add considerable time to a rebuild, in particular for repos with lots of files.

Patches to that effect would be thoughtfully considered.

> As Richard said, the underlying data-model probably supports such use cases,
but the UI / chrome does not I guess. (modulo move vs copy perhaps?
that might require a data-model change?).

The data model supports tracking the *file-level* renames (moves), but knows *nothing at all* about copying files, nor anything about individual chunks of files. In its view, any two artifacts with the same content hash are the same content, and that's the closest it can come to recognizing a "copy" operation (noting that copying a file and changing a single byte before committing it changes its hash, so it would not be a copy in fossil's eyes).

It does track file renames, and that information is locked forever in the SCM metadata, but automatically determining how to allow the user to follow the path back through renames is nowhere near as trivial as it may initially sound, as Richard detailed.

(13) By anonymous on 2020-10-13 13:56:56 in reply to 12 [link]

I think you're getting off-track. Perforce does not track *code chunks*.

No SCM does that, that I know of. But diff'ing two revisions of the same
file is no different conceptually from diff'ing two arbitrary blobs.

Modern SCMs store file contents as blobs, indexed by checksums, and delta-compress
related blobs to save space, as an optimization. But they store metadata about that
content separately, in folders/files/revisions/etc... Lets call them metadata nodes.

To track moves and copies, you only need to record a *directional* relationship between
two metadata nodes, and when displaying history, or computing *blame*, you use those links
to traverse the whole *history* for that content blob. Going in the other direction is not
possible, as a single metadata-node could be split into several new metadata-nodes (7 in my case!),
but when starting from the *leaf* node, it's a *linear* traversal (ignoring merges, that is).

So coming back on diff'ing blobs, across a rename or copy, the blobs are not arbitrary,
they is a tangible user-expressed relationship recorded in the metadata, and the *blame*
and *diff* commands can follow those (optionally), just like P4V Time-Lapse view does.
It's hard to convey how feature-rich that view is, when you've never seen it...

At least that's my mental model :). I know nothing of Fossil internals, obviously!
But I'm fairly sure Fossil could do the above, if it wanted to.

How could Fossil be less capable than a dinosaur like Perforce, after all! ;)

(14) By Richard Hipp (drh) on 2020-10-13 14:38:30 in reply to 11 [link]

> Sounds like Fossil has ways to go still then :)

Not funny.

If you want to advance the state of the world, then it falls to you to
help with that task.  Complaining to others that they have not done enough
does nothing to move us forward.  Be a problem solver, not a grumbler.

Here are three positive things that you could do to help begin moving us
forward:

  1.  Create a username for yourself so that you are no longer identified
      as "anonymous".  Discussions are more productive when all parties know
      with whom they are talking.

  2.  Come up with a word that describes the closure of all artifacts across
      both file renames and edits.  We cannot use "file" for this since
      the word "file" mean all artifacts that have the same filename.  So
      when you rename a file, it becomes a different file by definition of
      "file".  The word "file" is the closure of artifacts across edits.
      I need an alternative word that means almost the same thing
      as "file", but which is the closure across both edits and renames.
      One reason why Fossil does not have the greatest support for following
      changes across renames is that we do not have a good word for talking
      about the concept.  So you could help simply by coming up with a
      good term.

  3.  Suppose the word you choose to describe the concept in point 2 above
      is "x-file".  (Not a good choice, but I have nothing better at the
      moment.)  I need a way to uniquely identify x-files.  We currently have
      a way to uniquely identify artifacts (their SHA3 hash).  And we currently
      have a way to uniquely identify files (their pathname).  But there is
      no good way to uniquely identify an x-file.  One possible naming
      convention would be a filename and hash pair.  This filename/hash pair
      would mean the
      closure that is all edits and renames of that particular filename and
      SHA3 hash.  But such a name is not unique, because there are many possible
      filenames/hash pairs for the same x-file.  We need a name that is both
      unique and invariant over time. What do you propose?

(15) By jvdh (veedeehjay) on 2020-10-13 15:05:22 in reply to 14 [link]

regarding file vs. 'x-file' I personally do not believe that the semantic discrimination would be helpful to the user. if one insists, one might use some approximate synonym of "file" in the conventional sense, e.g. "document", for  denoting your "x-file". but I rather would prefer not to enforce this distinction since from the user side "the file" is what you call x-file here, mostly: the present document and all it's prior revisions/renames. 

regarding 'uniquely identify x-file': I believe one only needs an unambiguous identifier and "filename+hash" (especially "filename+CURRENT" or similar, i.e. the combination of a filename and a checkin hash) seems to achieve that. or what am I missing? the users intention always( or only usually?) will be to execute some action on the past history of the considered document/x-file (grep, blame, finfo etc.) as it presents in a certain revision (most of the time the CURRENT one).

looking at the mercurial UI, it seems to make nothing special of it and does what I believe is sensible: call the x-file/document simply "the file", e.g. in its "grep" help page: "use the --follow argument to also follow the specified
FILEs across renames and copies." seems sufficient to me...

(16) By Richard Hipp (drh) on 2020-10-13 15:23:23 in reply to 15 [link]

> regarding file vs. 'x-file' I personally do not believe that the semantic discrimination would be helpful to the user.

Whether or not the term is essential to the user is debatable, but also
irrelevant.  The distinction *is* essential to whatever poor soul has to 
implement and maintain the code.  So, until we come up with terminology
that does distinguish between file and x-file, the code won't be implemented
or maintained.

(17) By jvdh (veedeehjay) on 2020-10-13 15:40:19 in reply to 16 [link]

I am aware of the latter aspect and I don't take the asked for functionality being implemented for granted (just would be happy if fossil would acquire that capability).

regarding "debatable, but irrelevant": that seems not to be your typical attitude w.r.t. the fossil user base ;). it sure is not irrelevant for the average user (i.e. people who do not want to concern themselves with internals of the used VCS, but just want to *use* the tool for their work), whether they would be expected to make avoidable (not to say unnecessary) semantic distinctions.

so if you/the developers need an identifier for 'x-file' as a discriminator against what seemingly fossil calls a "file" internally, I would say just pick any (including "document"): it is just an internal -- ultimately arbitrary -- terminology, right?

so would "document equals file(name)+checkin hash" work? and could/should the help pages not better simply follow mercurial in talking of "file and its past renames" or similar? I reiterate that the users view/meaning of "file" probably is "file content plus attached filename(s) past and present" and that talking about and enforcing the "file vs. document" distinction might just cause avoidable confusion. at least that would be my guess.

(18) By Warren Young (wyoung) on 2020-10-13 15:47:14 in reply to 16 [link]

Instead of trying to rename the "file" concept, how about you take a design cue from Unix and associate an "inode" with it when extracting metadata from a manifest artifact?

Indeed, I think you already have that. Isn't that `repository.filename.fnid`?

You've given examples in the past that amount to "is *this* file `foo` at the tip of the branch the same as *that* file `foo` from 2018" but that's because it is trying to treat all files `foo` as the same. The actual fact is, we know at each commit step in the repo whether 2018's `foo` is the same as tip's `foo` because we can trace its history through time via the optional "oldname" field in the F cards. If the two don't share parentage, they aren't the same file.

Likewise simpler renames: we can reasonably link 2017's file `bar` to tip's `foo` through such a chain.

Reconstructing this DAG probably isn't trivial during rebuild. I think you'll have to add more columns to the `filename` table to be able to query the changes across time. Updating this metadata on each commit should be straightforward once you've worked out what you need during rebuild.

Having done that, the SQL query behind something like `fossil finfo foo` first does a lookup on `filename.name` for "foo" as of the current commit ID, then uses the resulting `fnid` to trace that file through history.

(19) By Warren Young (wyoung) on 2020-10-13 16:04:05 in reply to 18 [link]

Proposed addition to the rebuild algorithm:

1. After the commit DAG is available during rebuild, scan the tips of all branches for file artifacts, assigning each one an `fnid` arbitrarily. (Autoincrement integer.)

2. Recurse down these branches until you get to the root. At each step, record the artifact ID for each F card in a new table. ("fnhist"?) For F cards without an "oldname" value, also record the commit ID and the current `fnid` for that `name`. For renamed files, look up the `fnid` by the child's name and record that instead. The resulting tuple is `{fnid, commit, artifact}`.

Now you have a secondary table that ties tables `filename` and `artifact`, with which you can trace the history of a given file.

On commit, you simply update the 3-tuple above. For non-changed files, do nothing. For changed files that aren't renamed, look up the current `fnid` then record the new commit ID and file artifact ID for that `fnid`. For renamed files, look up the old `fnid` and associate that with the new commit ID and artifact ID.

This should solve the `fossil finfo foo` case, at the least.

(20) By Richard Hipp (drh) on 2020-10-13 16:11:02 in reply to 18 [link]

OK.  I will accept "inode" as word that describes for the closure of a file across
all edits and renames.  A Fossil-inode is not exactly the same as a
unix filesystem inode, but the concept is close enough, I think.  I will
take "inode" as a working name for the concept, pending better suggestions.

Note, however, that repository.file.fnid is *not* an inode, as that value
changes when the file name changes, and thus does not track files across
renames.  It is unclear, yet, whether or not
I need a separate "inode" table in the schema in order to track inodes,
or a new column in the "mlink" table to record the inode number, and so forth.
If such schema changes are needed, that will likely mean that a
"fossil rebuild" is required when upgrading, which in turn means
that nothing will land on trunk until after 2.13 is released and we start
the next development cycle.  It might be sufficient to compute the members
of an inode on an as-needed basis, by walking the existing mlink table,
so that no schema changes are required.

(21) By jshoyer on 2020-10-13 16:43:29 in reply to 20 [link]

I suggest ‘artifact lineage’,
to emphasize that such a set
sometimes spans multiple file-path names.
Referring to ‘ancestor/ancestral artifact lineages’
vs. ‘descendant artifact lineages’
might be useful
to distinguish the case
when a user is interested
in looking forward in time
vs. backwards in time.
Being able to easily find descendants of a check-in
is one of the great features of Fossil.

I am not familiar with the details
of how unix filesystem inodes work,
but I could get used to that word for this concept.
I [8d3effe5d7 | previously]
pointed out that the word ‘clade’
is a concise way
to refer to things that share a common ancestor.
That word may be too uncommon
to use in this file-o-genetic context though.

Thanks for your attention to precise terminology.

(22) By Warren Young (wyoung) on 2020-10-13 16:53:13 in reply to 21 [link]

> I suggest ‘artifact lineage’

There are more types of artifacts in Fossil than file artifacts, and I can't see any need to extend this concept to them as well.

However, "lineage" alone might be useful. Thus "`fossil finfo foo`" becomes "Look up the lineage of the file currently called 'foo' in the checked-out version, yielding a lineage ID, then trace that ID back through history."

> how unix filesystem inodes work

Assuming you do know how Fossil works at the commit DAG level, the third paragraph [here][1] should look familiar.

> I could get used to that word for this concept.

I'm not wild about it. I used it more as an example than a proposal. Indeed, most of what I've written up-thread is really about inode *numbers* rather than inodes proper.

I like "lineage" better.

> this file-o-genetic context

🤦‍♂️ (Yes, I see what you [perpetrated][2] there.)


[1]: https://en.wikipedia.org/wiki/Unix_filesystem#Principles
[2]: https://en.wikipedia.org/wiki/Phylogenetics

(26) By Marcos Cruz (programandala.net) on 2020-10-14 01:20:36 in reply to 22 [link]

> I like "lineage" better.

I think it's clearer than "inode", which already has a specific meaning.

(30) By Richard Hipp (drh) on 2020-10-14 12:01:29 in reply to 26 [link]

## Clan

One syllable, and thus easy to pronounce.  Four letters, takes up little
space in the source code text.  Means roughly "having a common ancestor" which
is a good definition for the closure of file artifacts across edits and renames.

(31.1) By Warren Young (wyoung) on 2020-10-14 12:52:50 edited from 31.0 in reply to 30 [link]

"Clan" is a collective noun for multiple distinct related individuals. It implies the named file and all of those in the same directory and subdirectories. (Uncles, cousins, grandparents...)

A "lineage" is a single individual's evolutionary history, making it a felicitously apt analogy for what we're speaking of here in the context of Fossil.

**EDIT:** If you want a four-letter word, then [line, sense 18](https://en.wiktionary.org/wiki/line).

(33) By Martin Gagnon (mgagnon) on 2020-10-14 13:22:20 in reply to 31.1 [link]

What about **trail** ? 5 letters, not bad.

As in file trail. Similar to path but without interfering with the other meaning.

(35.2) By jamsek on 2020-10-14 14:58:37 edited from 35.1 in reply to 31.1 [link]

I think *lineage* is clearly the most apt term; it's literally made for  
this feature.

If it has to be a four-letter, single syllable word, although *line* is,  
again, almost perfectly apt, it might cause confusion. In which case,  
*trac* (TRace Ancestral Check-ins) might be better.

ETA: can't count.

(36) By anonymous on 2020-10-14 15:10:18 in reply to 35.0 [link]

I like that one: file‘s **track**

(37) By jshoyer on 2020-10-14 19:54:21 in reply to 35.2 [link]

A ‘trac’ backronym
would invite confusion
with Richard's venerable CVSTrac system
(www.cvstrac.org)
and the similar Trac project
that it inspired.

‘Clan’,
when operationally defined,
sounds like a good word
to use in implementing the proposed feature.
The fact that is a collective noun
might be a feature rather than a defect.
It may be useful
that the word is distinct
from the concepts
of ancestral lineages
and descendent lineages
(which of course can be subsets of a clan).
The word ‘lineage’
could be used in some documentation
even if it is not in the source code.

I am not immediately finding any problematic connotations
associated with the word ‘clan’
(when spelled out),
though the Wikipedia page does note
that “conceptual confusion arises from colloquial usages of the term”
(and I am not an anthropologist).

I was curious
if the words ‘clan’ and ‘clade’
share a direct etymological root.
It appears not,
as the Oxford English Dictionary
says that clade came the Greek κλάδος (branch)
whereas clan came from the Latin <i>planta</i>
via Gaelic.
So both words ultimately relate to botanical metaphors.

(38) By Warren Young (wyoung) on 2020-10-14 20:27:00 in reply to 37 [link]

Clade would also work without causing conflicts with lay definitions or software terms of art.

(32) By G. David Butler (gdb) on 2020-10-14 13:20:47 in reply to 30 [link]

It always amazes me how useful an unambiguous, in the context of a domain, term helps focus discussion and free minds to solve a problem.

I like it!

(34) By anonymous on 2020-10-14 14:47:36 in reply to 30 [link]

Can we summarize why there's a need for a new concept?

Is it not just the file name history that is requested?

Currently, the history is basically the manifest's history, which operates on the level of the whole file-tree. Now we need to supplement this with history of file names that the tree comprises.

(24) By Kees Nuyt (knu) on 2020-10-13 23:28:09 in reply to 20 [link]

> I will
take "inode" as a working name for the concept, pending better suggestions.
>
> Note, however, that repository.file.fnid is *not* an inode, as that value
changes when the file name changes, and thus does not track files across
renames.  It is unclear, yet, whether or not I need a separate "inode" table in the schema in order to track inodes, or a new column in the "mlink" table to record the inode number, and so forth.

Suggestion: instead of inode, call it **track** or **trace**, because it identifies (is the identity of) the whole track/trace/path the 'x-file' walked in the DAG, irrespective of any renames or changes of content. Personally, I prefer **track** over **trace**.

We cannot use **path**, that would be confusing.

I thought of calling x-file an **object**, but that does not describe the fact that x-file has a history and (often) a future.

If I understand correctly, the concept of file in fossil tends to signify the state of an managed object in terms of contents and path/filename. It has a history (and a future), but it does not identify that path trough history.

Hence this suggestion.

~~~
-- 
Regards,
Kees Nuyt

(27) By John Rouillard (rouilj) on 2020-10-14 01:42:56 in reply to 24 [link]

A file **track** seems like a good name. If fossil also included
a copy operation, you could implement the following to refactor file1
into file1a and file1b:

~~~~ pikchr

file "file1" fit
  move
F2:    file "file1a" fit;
  down; move 0.5cm;
F3:    file "file1b" fit
    
 arrow from first file.e "copied" above to F2.w;
 arrow from first file.e "moved" below aligned to F3.w;
~~~~

Then the **history track** of file1a and file1b would include file1.
File1 itself would be removed as a result of the move.

This would allow lines in file1a or file1b to be tracked back through
file1 and any ancestors of file1.

(65) By anonymous on 2020-10-21 06:23:10 in reply to 14 [link]

I'd like to suggest "file anchor".  The term comes from the Anchor Modelling technique, where Entities whose attributes and relations change through time are tracked.  The identity of the Entities is called the "anchor".

(66) By Warren Young (wyoung) on 2020-10-21 06:29:30 in reply to 65 [link]

The issue's settled already: [the chosen term is "clade"](fossil:/info/8c598d7232c227e1).

(70) By sam atman (mnemnion) on 2020-10-27 19:07:04 in reply to 14 [link]

> Come up with a word that describes the closure of all artifacts across both file renames and edits. We cannot use "file" for this since the word "file" mean all artifacts that have the same filename. So when you rename a file, it becomes a different file by definition of "file". The word "file" is the closure of artifacts across edits. I need an alternative word that means almost the same thing as "file", but which is the closure across both edits and renames. One reason why Fossil does not have the greatest support for following changes across renames is that we do not have a good word for talking about the concept. So you could help simply by coming up with a good term.

I'd consider document a "good enough" name for this. If I move a file, or change its name, I still think of it as the same "document".  Not all such files are documents, of course, but enough are that talking about them shouldn't be confusing. 

For 3, the tuple (original name, first checkin hash) should be "good enough". The hash will be unique in the great majority of cases, since it's pretty typical to check in a file with some contents, while the combination should be always unique.  Although I can imagine a somewhat far-fetched circumstance where someone checks in, let's say, `README.txt`, as an empty file, then moves it to the documentation folder, and checks in another empty `README.txt`. 

In which case, the tuple (original name, first hash, first timestamp) is assuredly unique. So that might be the best choice here.

(23.1) By Richard Hipp (drh) on 2020-10-14 12:06:26 edited from 23.0 in reply to 1 [link]

## Filename vs. File Clan

Some definitions:

  *  **file artifact** &rarr;  A single artifact (identified by its
     unique hash) that represents the content of a file in one or
     more check-ins.

  *  **filename** &rarr;  The name of a file in one or more check-ins
     together with all file artifacts that can be called by that name
     in any check-in.

  *  **file clan** &rarr;  (Or simply a "clan".) The closure of all file
     artifacts that
     derive from one another, either by an edit or by a rename or by
     both.

Curiosities:

  *  The same file artifact can appear in two or more filenames, if
     a project happens to have two more more files in its source tree
     that have identical content.  For example, in the latest check-in of
     Fossil, the two files skins/bliz/details.txt and
     skins/blitz_no_logo/details.txt have identical content and thus
     are represented by the same file artifact
     ([1260bc47](fossil:/artifact/1260bc47)) but have different names.

  *  Sometimes a single filename can be composed of two or more
     clans.  An example of this in Fossil is the
     ["configure" source file](fossil:/finfo?name=configure).  That
     file is not currently part of the tree, but it has been part of
     the tree at three distinct times during the history of Fossil.
     Hence, the "configure" filename contains three different
     clans.  If you select to see a history graph of the
     filename "configure" you see all three clans as disjoint
     subgraphs.  (That is
     what happens now.)  If some future version of Fossil gives you the
     ability to see all the different file artifacts of a clan,
     then you'll only see one of those clans of "configure",
     depending on which one you ask for.

  *  Sometimes a single clan can be composed of two or more
     filenames.  An example of this in the Fossil repository is the
     two files [www/permutedindex.wiki](fossil:/finfo/www/permutedindex.wiki)
     and [www/permutedindex.html](fossil:/finfo/www/permutedindex.html).
     The file was originally called by the "*.wiki* name but was renamed
     to the "*.html" name at some point.  All of the various editions of
     these two filenames combine to form a single clan.

More complicated arrangements are certainly possible, but I don't have
any example of those at the moment.

## Pending Issues

I still do not have a good way to assign a canonical name to a
clan.  I can give them arbitrary numbers, but those numbers will
vary from one repository to another.  And arbitrary numbers would not be
meaningful to humans.

When requesting (for example) to see all of the file artifacts of a
clan, one could identify the clan by specifying one of its members.
That is, you could specify a filename and a specific check-in.  The combination
of filename and check-in does precisely identify a clan.  However, the name
is not unique, as there are many other filename/check-in combinations that
could describe the same clan.

Perhaps the canonical name of a clan could be the filename and first
check-in in which the clan appears?

*Edit: "inode" &rarr; "clan"*

(25.1) By Scott Robison (sdr) on 2020-10-14 00:24:35 edited from 25.0 in reply to 23.0 [link]

Since some concern has been expressed about "inode" due to its established meaning, perhaps a variation that is less entrenched but still embodies the concept would work. Instead of "index node" it could be a "history node" or a "path node" or a "trace node" or even a "fossil node" (hnode / pnode / tnode / fnode).

Or "nodule".

Just thoughts. Maybe worthless. You get what you pay for. Etc.

(29) By Warren Young (wyoung) on 2020-10-14 03:46:18 in reply to 23.0 [link]

> fossil inode 

To clarify, I'm advocating for "lineage" [in the evolution sense][1] rather than in the looser lay sense. It is a perfect match to "fossils" and "artifacts". We can then say things like "show me the lineage of the file currently called foo".

I believe jshoyer is using the term in this same sense, given the *nature* of his pun. (Ahahaha!)

> the "configure" source file. That file is not currently part of the tree

Sure it is. If it weren't, we'd need some sort of "bootstrap" script to fetch it before you could build from a fresh checkout.

Moreover, if `configure` were missing from tip of trunk, I'd expect the current implementation of "`fossil finfo configure`" to not give results, yet it does. In fact, it gives more detail than the equivalent `/finfo?name=configure` call.

I poked into the gaps shown in the `/finfo` timeline, and it appears that these *other* `configure` files are from Autoconf, so they represent different *lineages*. The improved `fossil finfo` we're discussing here wouldn't show them at all, since you're implicitly giving it a commit ID to work with. `/finfo` would need another parameter (probably "ci") to make the same distinction: `/finfo?name=configure&ci=trunk` should show only the lineage of `configure` as traced back from tip-of-trunk.

How does `fossil finfo` currently trace the history anyway? I ask not as a mere curiosity but because it seems to have bearing on this thread's core problem.

> Sometimes a single fossil inode can be composed of two or more filenames.

Yes, that's the rename case we're primarily concerned with in this case, not a "curiosity." The wished-for improvements to `finfo` would show that `www/permutedindex.html` was once called `www/permutedindex.wiki`. They're of the same lineage.

> I still do not have a good way to assign a canonical name to a fossil inode....you could specify a filename and a specific check-in....

Yes, precisely: `configure@9e816f0aa91...`, meaning the file called `configure` as of the commit ID currently at tip-of-trunk.

This scheme means the lineage potentially has *many* names, but they all resolve to the same lineage. The only case where a lineage has only one name is when it's been committed once and never modified.

> the name is not unique

I don't see see any reason that it has to be *globally* unique. That's why I proposed the local lineage ID idea: only within a given repo do you need to uniquely identify a lineage by a single scalar value, and then only as a local optimization for fast lookups.

They're analogous to RIDs in that way.

[1]: https://en.wikipedia.org/wiki/Lineage_(evolution)

(28) By anonymous on 2020-10-14 01:50:10 in reply to 1 [link]

...different anon

It seems there's some confusion about the 'file' notion here. It would help to state a few reasonable example use cases before jumping to introducing additional concepts.

I'd start with acknowledging that a repository is commonly treated by users as a dynamic collection of files/trees and their history.

That is, **file-path** (in Fossil it's a 'filename') is a first-class entity, unique per given check-in. Therefore, if I wanted to see history for a given file-path, it would have to exist at the given check-in.

Now, the way I'd see some use cases that involve some rename history:

1. List history of the current file-path back to the check-in at which the file was **first** added to the current ancestry line. This means a file might have gone through renames-from, possibly some deletions with subsequent re-additions or renames-to.

2. Follow through the past renames-to
to list the file-path's provenance, that is to end up with its firstmost ancestor file-path (and the check-in).

3. Combine the history of the given file-path across the whole repository, starting with the current point and possibly showing the file's history in other branches to end up at the firstmost point in time the file-path was added to the repo.

4. Follow the file-path provenance (renames-to) across the whole repo, starting with current time-point to end up at the check-in that the first-most check-in ancestor-path was added into the repo

Additionally, a Fossil-specific use-case is requesting all recorded history (unfiltered by ancestor/descendent) for the given file-path.

All of these cases operate with the same concept of Fossil 'file-name'.

There're likely more use-cases of practical interest, I hope those too could be listed along.

(39) By Richard Hipp (drh) on 2020-10-17 02:58:00 in reply to 28 [link]

## Progress Report

Given two inputs:

   *  A filename
   *  A check-in that contains a file with that name

Then the following SQL will generate a table that shows all edits of that
file, both forwards and backwards in time, and *across renames*.

All I have to do now is modify the SQL currently on the /finfo page to
use (optionally) SQL similar to that shown below (with some additional
result columns added), and then we can have a report that
shows the entire history of a file, across renames.

I bring this up because the SQL will become the single most complex query
in Fossil once I get it installed (and assuming it doesn't contain undiscovered
bugs).  I don't know how many thousands of lines of procedure code
this one query replaces, but I suspect it is a lot.  I'll need to add
copious comments to the SQL, I believe, to make it maintainable long-term.

Adjust the `$filename` and the `$checkin` to the particular file whose
history you want to see, and post the whole script into the input of the
"[fossil sql](/help?cmd=sql)" command:

-----

~~~
.param set $filename 'src/mkversion.c'
.param set $checkin '155c8b6'

WITH RECURSIVE fns(fnid) AS (
  SELECT fnid FROM filename WHERE name=$filename
  UNION
  SELECT pfnid FROM mlink, fns
   WHERE mlink.fnid=fns.fnid
     AND pfnid>0
), flink(fid,fnid,pfid,pfnid) AS (
  SELECT fid, fnid, pid,
    CASE WHEN pfnid>0 THEN pfnid ELSE fnid END
    FROM mlink
   WHERE NOT isaux AND fid>0 AND pid>0
     AND fnid IN fns
  UNION
  SELECT pid,
    CASE WHEN pfnid>0 THEN pfnid ELSE fnid END,
    fid, fnid
    FROM mlink
   WHERE NOT isaux AND pid>0 AND fid>0
     AND fnid IN fns
), clade(fid,fnid) AS (
  SELECT blob.rid, fn.fnid FROM blob, filename fn
   WHERE blob.uuid=(SELECT uuid FROM files_of_checkin($checkin)
                     WHERE filename=$filename)
     AND fn.name=$filename
   UNION
  SELECT flink.fid, flink.fnid
    FROM clade, flink
   WHERE clade.fid=flink.pfid AND clade.fnid=flink.pfnid
)
SELECT
  date(min(event.mtime)) AS date,
  (SELECT substr(uuid,1,10) FROM blob WHERE rid=mlink.fid) AS file,
  (SELECT substr(uuid,1,10) FROM blob WHERE rid=mlink.mid) AS checkin,
  mlink.fnid AS fnid,
  CASE WHEN mlink.pid>0 THEN mlink.pid END AS pid,
  CASE WHEN mlink.pfnid>0 THEN mlink.pfnid 
       WHEN mlink.pid>0 THEN mlink.fnid END AS pfnid
  FROM mlink, clade, event
 WHERE mlink.fid=clade.fid AND mlink.fnid=clade.fnid
   AND NOT mlink.isaux AND mlink.pid>=0
   AND event.objid=mlink.mid
 GROUP BY mlink.fid, mlink.fnid
 ORDER BY event.mtime DESC;
~~~

(40) By Stephan Beal (stephan) on 2020-10-17 14:18:46 in reply to 39 [link]

> I bring this up because the SQL will become the single most complex query in Fossil once I get it installed

Simply out of curiosity: can you estimate how long it took you to formulate it?

(41) By Richard Hipp (drh) on 2020-10-17 15:03:35 in reply to 40 [link]

A day.  But that is still less time that it would take to develop and debug
the procedural code to do the same.  And maintenance moving forward should
be easier as well.

I just discovered a bug.  The "fns" CTE table is following name changes
backwards in time but not forwards.  Hence, if you are looking at an
historical version of some file that changes names multiple times, you
see prior names of that file, but not future names.

The code as it stands is checked in.  I'm going running, and then I need
to work on some other things.  I'll fix the "fns" problem after that.  If
you want, you can try out my latest check-in and report any problems you see.

Test case:

   *  <https://fossil-scm.org/fossil/finfo?name=src/mkversion.c&ci=trunk>

The complete history of the src/mkversion.c file including the rename
(from win/version.c) that occurred on 2011-06-18.  If I ask to look at
a version of the file from *before* the rename

   *  <https://fossil-scm.org/fossil/finfo?name=win/version.c&ci=723f8c200f44c7fb&m=5cff6467abe31d9c>

Then the changes that occur after the rename are omitted.  That is the
"fns" bug.

(42) By anonymous on 2020-10-17 18:25:02 in reply to 41 [link]

Impressive AND truly informative!

It's very nice to be able to see the transition of a file through the names all the way to its origin. Perhaps that could be the name for such feature, something like **"Trace File Origin"**

However, once it's past the current file name, the File link is not pulling the previous name, instead it incorrectly keeps the current name; this results in 'file does not exist' error.

For example, on the **Added** entry for the `src/mkversion.c` example: [https://fossil-scm.org/fossil/file?name=src/mkversion.c&ci=f66f414fd3eb6457](https://fossil-scm.org/fossil/file?name=src/mkversion.c&ci=f66f414fd3eb6457)

(43) By Richard Hipp (drh) on 2020-10-17 18:49:24 in reply to 42 [link]

> the File link is not pulling the previous name

Corrections have been made for this which are now running on the website.
Does the latest code fix the problem?
What else have I missed?

(44) By anonymous on 2020-10-17 19:28:27 in reply to 43 [link]

Yep, this works ok for the `src/mkversion.c` case. Thanks for the quick fix.

However, there's some confusing history for [`Makefile.classic`](https://fossil-scm.org/fossil/finfo?name=Makefile.classic).

Indeed, it had a Renamed change on 2011-06-14, but shoudn't there be more of its history back when it was called `Makefile`, well, it should have an Added entry somewhere in the history too?

(45) By Richard Hipp (drh) on 2020-10-17 19:53:01 in reply to 44 [link]

Thanks for pointing out the problem with Makefile.classic.  I don't know
what is wrong there, yet.  I will get to that.  But for right now, I'm
going to focus on the "fns" problem, which (according to my current thinking)
is going to require enhancing SQLite to support a more general form of
recursive common table expressions.  Behold an example of the symbiosis
between Fossil and SQLite &rarr; Fossil is driving innovation in SQLite.
So I'll be working on SQLite for the next few days, trying to figure out
how to generalize CTEs. I will get back to working on the new Fossil capabilities,
just not right away.  At least, that is my current plan of attack.
I reserve the right to change my mind, since, as we all know, no battle
plan survives first contact with the enemy.

Meanwhile y'all can help out by finding and reporting additional areas
of breakage and/or goofiness in the
new /finfo track-changes-across-file-rename feature.

Note that I will also be needing to enhance /annotate to also work across
file renames.  That problem will be much easier after /finfo is completely
fixed, so I will hold off working on it until then.

(47) By Richard Hipp (drh) on 2020-10-19 18:33:45 in reply to 44 [link]

The latest trunk check-in seems to be working better.

  *  [All files named "Makefile.classic"][1]

  *  [History of "Makefile.classic" across renames][2]

[1]: fossil:/finfo?name=Makefile.classic
[2]: fossil:/finfo?name=Makefile.classic&ci=590cee0ec92d5392&m=bc3c18ed64dc40f8

Please continue to look for and report problems.

## Question:

Should the **Rename:** segments be moved to the front of the
description in the second graph above, or in graphs like this one:

  *  <https://pikchr.org/home/finfo?name=tests/test08.pikchr&m&ci=tip>

Rather than being buried deep down inside the description where they are hard to see?

(49) By graham on 2020-10-19 22:29:03 in reply to 47 [link]

Looks good. I'd probably say "yes" to moving notice of renaming to the front... for projects that don't rename files ever/much, it won't make any difference; for projects where renaming _does_ happen – and where not realizing a file's name has changed could get confusing – I believe it will be useful to have changes prominent.

(50) By Warren Young (wyoung) on 2020-10-19 23:37:50 in reply to 47 [link]

> Please continue to look for and report problems.

The timeline entries have hyperlinks for the old/new names on only one end of the graph. [Example](https://tangentsoft.com/pidp8i/finfo?name=bin/pidp8i). This is the "middle" name for a file that's had three names over its lifetime. In the top timeline box, you can follow the link to the current name, but in the last box, there is no link to follow to the prior name; you have to manually edit the URL to backtrace the naming.

In happier news, the latest improvements made this tracing *possible*, where it wasn't before. It could only trace one direction through the graph before, so it's getting better!

(52) By Warren Young (wyoung) on 2020-10-20 00:33:37 in reply to 47 [link]

> Please continue to look for and report problems.

While playing with this new feature, I experienced a segfault when loading `/finfo` on one of my files, but while I was typing up the report, drh committed [this](fossil:/info/ab71d95a9f326369), which fixed it. Was that supposed to fix a segfault in `graph_finish()`, called from `finfo_page()`? If not, then we may still have a problem.

The bisect I collected before the symptom went away was:

----

```
  1 BAD     2020-10-19 20:52:34 0fac549b5663fafc
  4 BAD     2020-10-19 20:09:25 a0bca48147b3ff1d
  5 BAD     2020-10-19 18:24:41 8c598d7232c227e1 CURRENT
  3 GOOD    2020-10-19 14:37:26 47bfea074b122616
  2 GOOD    2020-10-17 18:47:39 0af477224547fea8
```

----

Alas, I found this while trying the feature against a repo I can't share, but the above-linked commit's comment about large repos may be relevant: this is my biggest repo, bigger than either SQLite or Fossil.

(54) By Richard Hipp (drh) on 2020-10-20 01:09:29 in reply to 52 [link]

For this big repo, what is the result of:

~~~
   SELECT max(fnid) FROM filename;
   SELECT max(rid) FROM blob;
~~~

If the product will not fit in a 32-bit signed integer, then problems might
have resulted prior to my fix.  I'm not aware of any segfault possibilities - 
I would have thought you would just get a goofy graph.  But maybe I missed
something.

(56) By Warren Young (wyoung) on 2020-10-20 09:38:02 in reply to 54 [link]

Good test; I wouldn't have thought of it on my own.

Alas, no dice:

```
sqlite> SELECT max(fnid) FROM filename;
30489
sqlite> SELECT max(rid) FROM blob;
125704
```

Since running into the problem, I've built Fossil for debugging, and I get no useful diagnostics:

```
./configure --fossil-debug --with-sanitizer=enum,null,undefined
```

I also tried it under Valgrind, and it still didn't flag any serious problem.

(57) By anonymous on 2020-10-20 09:58:00 in reply to 56 [link]

Why *no dice*?

`125'704 * 30'489` **does** overflow a **signed** 32-bit integer, at `3'832'589'256`

(59) By Warren Young (wyoung) on 2020-10-20 10:33:28 in reply to 57 [link]

Sorry, I missed the word "product". I thought he was asking if either term was over 2.1 billion and change.

So good! It looks like I did inadvertently run into a problem we can reasonably expect the commit to fix.

(53) By John Rouillard (rouilj) on 2020-10-20 00:47:57 in reply to 47 [link]

Yes, I think making renaming more prominent is needed. Maybe adding an icon of some sort (css or svg based) would be helpful when scanning the timeline to identify renames.

(55) By jamsek on 2020-10-20 02:37:44 in reply to 47 [link]

Yes, the *renamed* notice should be foremost and emphasised to make more  
prominent.

This is truly outstanding work, Richard. To see this feature implemented  
from its inception via a synergistic development process with SQLite is  
remarkable. Not to mention the quasi-disparaging impetus that could have  
very easily received an entirely different response. Speaks volumes  
about you and the community you've fostered. Thank you.

(58) By anonymous on 2020-10-20 10:03:41 in reply to 55 [link]

I object to *disparaging* (*grumbler* here), for what it's worth.

I was stating facts: no history across renames, and copies.  
And thus that Fossil lacked features of old SCMs like Perforce.

Sure, my tongue-in-cheek comment did not go well with Richard.  
I apologize for that. And I'm glad renames are better supported now.

I chimed in about copies, since that's the use-case I just went through.  
And from what I read, it is not a use-case supported by Fossil at the moment.

(63) By jshoyer on 2020-10-20 22:17:04 in reply to 58 [link]

You essentially can copy files,
and now their divergent histories
show up on single file-info pages.
There's no `fossil cp` command,
but if you rename a file on a branch
and then merge that branch back in to another branch
you can manually stage things
to force “duplication” of the file.

<pre>
fossil new repo.fossil
mkdir workdir;  cd workdir
fossil open ../repo.fossil
echo 'line 1' > file-A.txt
fossil add file-A.txt
fossil ci -m 'Add one-line file A'
fossil mv -hard file-A.txt file-B.txt
fossil ci -m 'Rename file, A to B' --branch second-branch-B
echo 'line 2' >> file-B.txt
fossil ci -m 'Add a second line to the file (B)'
fossil co trunk
fossil merge second-branch-B
echo 'line 1' > file-A.txt    # recreate file with its original name
fossil add file-A.txt
fossil ci -m 'Merge in file B in a way that separates its history going forward from that of file A'
</pre>

(67) By anonymous on 2020-10-21 06:58:07 in reply to 63 [link]

Thanks for sharing this recipe to *emulate* `copy`.

It's obviously not user-friendly, and messes up the history with an  
artificial branch, i.e. far from ideal, but useful to know until proper  
support is added, hopefully.

(62) By jshoyer on 2020-10-20 22:15:59 in reply to 47 [link]

Amazing work.

One small appearance issue,
exemplified the file formerly named Makefile.
The page [fossil:/finfo?name=Makefile | /finfo?name=Makefile]
has a link at the top to [fossil:/file?name=Makefile | /file?name=Makefile],
which currently gives the equivalent of an error page:
“File 'Makefile' does not exist in this repository.”
I of course knows
that does not mean that there was never a file called Makefile
in any check-in in this repository,
but some could find it confusing.
Changing the message might help
(“No file 'Makefile' in the latest check-in in this repository”)
or you might add code
to check to make sure that such a file-path never existed.

A minor thing,
but I wanted to document it
in case it hasn't been noticed previously.

(69) By jshoyer on 2020-10-21 12:45:31 in reply to 62 [link]

Thank you
for resolving the issue with the /file page
for files that have been removed or renamed
in check-in [fossil:5cf5ad8491c97b09aa99].
Resolves the issue,
builds fine for me
on MacOS 10.14.6
(without OpenSSL).

Being able to see file-clade family trees
is already providing me valuable context
for the many scripts I have renamed in my repositories.

(64) By anonymous on 2020-10-21 01:54:35 in reply to 47 [link]

> All files named "Makefile.classic"

I'm not sure how to interpret this, but for [Makefile.classic](https://fossil-scm.org/fossil/finfo?name=Makefile.classic) (without ci query-param) the finfo history seems to stop on Renamed-to commit.

My understanding is that in the "all"-case the finfo history would still continue past the Renamed-to commit(s) all the way back to the Added commit when single-origin, or respective commits in many-origin case.

(46) By Warren Young (wyoung) on 2020-10-18 06:49:17 in reply to 39 [link]

> both forwards and backwards in time, and across renames.

I've only checked two cases, but even to the extent that it may be incomplete or imperfect so far, the feature is *solid* and *useful*.

It even has a feature I don't think anyone asked for explicitly and was about to add to it, when I found it in the code with a date of 2020-10-17: the link from `/file` to `/finfo`.

Thank you deeply, drh!

For what it's worth, I went looking into adding that feature because I expected it to be in the third-level button bar at the top of the page where the Annotate and Artifact buttons are, expecting to find an Info button. What I discovered only by reading the code is that it's a hyperlink on the file name in the header line immediately below. ("File $NAME from $CKIN_ID.")

(48) By Richard Hipp (drh) on 2020-10-19 18:58:30 in reply to 39 [link]

## Follow-up:

I enhanced SQLite so that [recursive common table expressions][1] are allowed
to hold two or more recursive SELECTs.  This makes the query above much
simpler and faster.  See the code below.  The original CTE design in SQLite
followed the PostgreSQL documentation, which requires that a CTE have only
a single recursive SELECT.  I looked around at other systems, and I found that
SQL Server allows multiple recursive queries, which seemed to nicely solve
this problem.

[1]: https://www.sqlite.org/draft/lang_with.html#recursivecte

Four items of note:

  1.  SQLite is now the only SQL database engine that I know of that allows
      multiple recursive SELECT statements in a single recursive CTE and
      allows the use of the UNION operator to connect those statements.
      (If you know of another, please tell me.) SQL Server allows
      multiple recursive SELECT statements but only allows UNION ALL operator.
      PostgreSQL allows UNION but only a single recursive SELECT statement.
      The SQL code shown below requires both.  The crux of the problem is
      to find all the nodes on a graph that communicate with a given node.
      Solving the problem efficiently in SQL requires both the ability to
      have multiple recursive SELECT statements in a CTE and the ability to
      use the UNION operator to connect those statements.

  2.  This episode is an example of how Fossil aids the development of
      SQLite not only by serving as the VCS for SQLite, but also by serving
      as a test platform for SQLite.  Fossil needed a new SQL capability,
      a capability described by SQL standards but implemented nowhere, and
      so SQLite was enhanced to provide that capability.  Not only is Fossil
      improved, but so is SQLite.  This is an example of the benefit of
      [dogfooding](wikipedia:/wiki/Eating_your_own_dog_food).

  3.  Close observers will notice my use of the CROSS JOIN operator to
      force a particular join order.  This was necessary for efficiency and
      points to a potential optimization opportunity in the SQLite query
      planner.  I'll be working on that directly.  At some point, the
      CROSS JOIN operator should become unnecessary.

  4.  Because Fossil is using the new SQLite capability, that means it won't
      work with an external SQLite DLL or shared library, as the version of
      SQLite that Fossil needs in order to work has not yet been released.

The new, simplified, faster SQL is shown below.

~~~
WITH RECURSIVE clade(fid,fnid) AS (
  SELECT blob.rid, 102 FROM blob
   WHERE blob.uuid=(SELECT uuid FROM files_of_checkin('tip') WHERE filename='src/mkversion.c')
   UNION
  SELECT mlink.fid, mlink.fnid
    FROM clade, mlink
   WHERE clade.fid=mlink.pid
     AND ((mlink.pfnid=0 AND mlink.fnid=clade.fnid)
          OR mlink.pfnid=clade.fnid)
     AND mlink.fid>0   UNION
  SELECT mlink.pid, CASE WHEN mlink.pfnid>0 THEN mlink.pfnid ELSE mlink.fnid END
    FROM clade, mlink
   WHERE mlink.pid>0
     AND mlink.fid=clade.fid
     AND mlink.fnid=clade.fnid
)
SELECT
  datetime(min(event.mtime),toLocal()),
  coalesce(event.ecomment, event.comment),
  coalesce(event.euser, event.user),
  mlink.pid,
  mlink.fid,
  (SELECT uuid FROM blob WHERE rid=mlink.pid),
  blob.uuid,
  (SELECT uuid FROM blob WHERE rid=mlink.mid),
  event.bgcolor,
  (SELECT value FROM tagxref WHERE tagid=8 AND tagtype>0 AND tagxref.rid=mlink.mid),
  mlink.mid,
  mlink.pfnid,
  blob.size,
  mlink.fnid,
  filename.name
FROM clade CROSS JOIN mlink, event, blob, filename
WHERE mlink.fnid=clade.fnid AND mlink.fid=clade.fid
  AND event.objid=mlink.mid
  AND blob.rid=clade.fid
  AND filename.fnid=clade.fnid
GROUP BY CASE WHEN mlink.fid>0 THEN mlink.fid ELSE mlink.pid+1000000000 END, mlink.fnid
ORDER BY event.mtime DESC
~~~

(51) By G. David Butler (gdb) on 2020-10-20 00:15:27 in reply to 48 [link]

This is a fantastic result! I use RECURSIVE all the time and ran into a similar problem. The code change in SQLite was minimal but the expressive power is awesome. The code reduction in one of my queries was unbelievable.

Thank you for the dog food!

(60) By anonymous on 2020-10-20 10:54:01 in reply to 48 [link]

> *recursive common table expressions* are allowed [...]

The railroad diagram shows only `UNION` and `UNION ALL` as compounds.  
But point #1 also mentions `INTERSECT` and `EXCEPT`. Which is correct?

(61) By Richard Hipp (drh) on 2020-10-20 17:26:18 in reply to 60

The individual SELECT statements of a compound SELECT can, in general, be
separated by any of the following operators:

   *  UNION
   *  UNION ALL
   *  INTERSECT
   *  EXCEPT

However, the recursive SELECT statements in a recursive common table
expression can only be use the first two:  UNION and UNION ALL.  And
in the case of multiple recursive SELECT statements, they all have to use
the same operator.  SQL Server takes this one step further and requires
that you use only UNION ALL.  To recap:

   *  PostgreSQL &rarr;  Allows both UNION and UNION ALL but only permits
      a single recursive SELECT.

   *  SQL Server &rarr;  Allows multiple recursive SELECTs but only permits
      UNION ALL.  UNION is not supported.

   *  SQLite &rarr;  Allows UNION and UNION ALL and allows multiple recursive
      SELECT statements.  The best of both!

Can somebody who knows complete this chart for other popular SQL database
engines?

(68) By Markus Winand (MarkusWinand) on 2020-10-21 12:42:36 in reply to 61 [link]

I have some charts for this on my website:

[https://modern-sql.com/caniuse/with_recursive_(non-linear)]

The current versions of Db2 (LUW), MariaDB, MySQL and SQL Server support non-linear recursive queries (that is: referring the query_name more than once in the query definition).

Regarding support of <code>UNION <nowiki>[DISTINCT]</nowiki></code> in recursive queries:

[https://modern-sql.com/caniuse/with_recursive_(union_distinct)]


The current versions of MariaDB, MySQL, PostgreSQL and—you know it—SQLite support <code>UNION</code> (w/o <code>DISTINCT</code>) or <code>UNION DISTINCT</code> in recursive queries.