Privacy and Fossil implementation changes

(1) By Dan Shearer (danshearer) on 2020-10-21 09:01:32 [source]

Privacy issues came up on the thread Is Git Irreplacable? .

Anything that threatens the integrity of the Merkle tree needs careful thought, This reply covers the main issues and contains a document at the end with an implementation proposal for how Fossil can address privacy and at the same time make life easier for existing Fossil users in areas other than privacy.

Main points:

I don't think privacy issues are really a big threat to the Merkle tree
EU privacy law (which is driving this debate worldwide) is more opportunity than threat
There are practical, everyday benefits to Fossil users from implementing some privacy fixes in accordance with EU law

Warren Young (wyoung) on 2020-10-14 04:53:50 said:

Meanwhile Fossil takes a principled PII protection stance from go:

I don't think this is completely true despite intentions, but it could be. Fossil doesn't publish email addresses by default, and it has a "View PII" capability, and it generally ticks many other boxes. So we're off to a good start.

At the end of this response I have included a draft document for how Fossil can be modified to distinguish itself on privacy grounds, and how that might help Fossil generally. I also feel Fossil should be compared to Github not git and with a bit of patching Fossil could provide privacy benefits that Github does not.

I agree with Warren that this needs to start from the EU's GDPR. Alternatives to the GDPR are based on the OECD privacy principles and guidelines . The GDPR starts with human rights and personal data, which is a superset of the OECD-ish laws seen elsewhere including the Californian one Warren referred to. The forthcoming EU ePrivacy Regulation (or "Child of the GDPR") mandates end-to-end encryption and has other interesting features, going even further than the GDPR, so now is a good time to review Fossil's already-pretty-good privacy situation. To take just one other example to illustrate that privacy is changing globally, Pakistan has a draft GDPR-ish (not OECD-ish) privacy law.

Git had architectural support for such global collaboration - in form of E-Mail address Yes, and it's arguably a tarpit of a GDPR violation. There is no way to support the Europeans' notion of "right to be forgotten" within Git short of shooting the Merkle tree full of holes.

No, this is incorrect, but it needs context to show why, which is why I have put this in a separate thread. The good news is that the situation is favourable to Fossil. The circumstances under which a Merkle tree even potentially could get shot full of holes on GDPR grounds are limited, and not greatly made better or worse by Fossil's current design in any case.

Here is big-picture context:

There is no such thing as a right to be forgotten at EU level although the famous Costeja case in Spain is commonly misquoted. Google was required to remove search results because the data was "inadequate, irrelevant or no longer relevant". Even in this old case Fossil would not be required to blow a hole in a Merkle tree, and as proof here is a link to the original data which is required to be available . The GDPR did not apply because this was in 2014 under the then-current 1995 EU privacy law, and the GDPR did not become law until 2016.

Really, there is no EU right to be forgotten In 2012 a draft version of the GDPR had one, but that was soon deleted in drafts and the GDPR was passed in 2016 with a much more limited Right to Erasure in Article 17 . Article 17 has the alternative title of 'Right to be Forgotten' for reasons I find humorously recursive: draft EU privacy laws have input and review from thousands of people, and the catchy phrase "Right to be Forgotten" was so widely used it couldn't be forgotten.

The Right to Erasure Doesn't Bear Much on Fossil. Article 17 gives a list of the highly specific reasons which are possible grounds for a Controller to agree to a request by an individual's request that their personal data be erased. A hosted instance of Fossil needs a perpetual lawful basis for processing which we already have a very strong case for by implication. With tweaks roughly like I propose at in the the document at the end of this response we can solidly establish a lawful basis. We can encourage all Fossil instances to have this by providing it as a default. (OECD-ish jurisdictions including California don't require a lawful basis for processing data btw.)

Back when all of this GDPR stuff was hot news, I recall GitHub, Inc. basically punting on this, saying it's the repo owner's problem. That's BS, as far as I'm concerned, but it'll probably take a lawsuit to settle it. Fortunately, no one's been crazy enough to bother.

Your recollection is correct and as of today Github definitely states repos are the repo owner's problem , but the analysis is factually incorrect. Github knows this statement is no kind of escape, because the GDPR still insists that Github delete data according to a lawful request regardless of whether they are a Processor or Controller. However once again that is beside the point because Article 17 so limits the scope of the potential problem for Fossil that the Merkle trees seem pretty safe. There have been binding legal decisions on this, but in brief the GDPR means what it says, and draws firm boundaries for the benefit of Fossil and Fossil users.

Meanwhile Fossil takes a principled PII protection stance from go: we reveal only user names, nothing else. If you're from the EU and you think even your user name should be removed from the history of a project, and the repo owners agree to accommodate you, it'll still shoot holes in the Merkle tree, but it's a lower risk than in Git repos.

It is commendable we don't keep or publish data we don't need. But apart from that, the paragraph quoted above isn't robust privacy analysis. The repo owners can agree to anything they like, but regardless of their views the GDPR applies with full force to the name we publish in Fossil and the SHA-256 GUID we associate it with. It would likely apply just as much if we also included a full name and even an IP address, because we associated the personal data with a GUID. It would take a lawyer to say whether git is higher or lower risk on these grounds, and I don't think the lack of an email address is as strong a point in favour of Fossil as it might seem. The GDPR's definition of Personal Data makes it clear there is a defined minimum but is also an ever-expanding cloud of indirect pointers in the future. A 48-bit bluetooth address without a name attached to it is still personal data according to the GDPR, because the kinds of wizards who are on this forum could do all sorts of devious things with it. We need to look elsewhere to find advantages in how Fossil implements privacy protection.

Note that this sort of thinking is spreading beyond the EU. California has a GDPR-like law on the books now, and all of this current kerfuffle in the US Congress, grilling the brass from the big tech companies, is likely some sort of prelude to passing similar laws at the federal level. If you're outside both the US and the EU, this still affects any organization with global reach, and it affects software products produced within their bounds, which outsiders then use.

To make sense of this it helps to turn the problem on its head. Always think about the personal data, which is always owned by the individual, who has a variety of human rights attached to it. The point is less to do with the global reach of an organisation, but who the individuals are. If an individual is an EU citizen, or is resident in the EU on some basis, or in certain other non-EU countries, then their data is covered by the EU no matter where in the world it is held. For enforceability then yes, it is first organisations with global reach who need to improve. A small Fossil hosting company in Kenya or the US could choose not to allow any EU citizen/residents to use their service, but they are better off assuming that any user could be, and treating personal data accordingly.

Right now there are even bigger privacy problems than mere GDPR compliance for any Fossil hoster that is based in Australia, the US, China etc that has surveillance laws that conflict with the human rights-based approach of the GDPR. That topic is definitely out of scope for this forum.

Following are proposals we can implement that would improve privacy in Fossil.

Draft Proposals for Privacy in Fossil

Any given Fossil installation is the responsibility of those who run it. But Fossil can make it easier to maintain good practice, and can provide good defaults.

Privacy Statement

Every Fossil instance should come with a boilerplate privacy statement, listing the personal data that Fossil holds in its logs, administrative data, cookies and elsewhere, and giving classes of use including replication, admin tasks, security checks etc. It's a GDPR requirement to state all personal data we keep/could keep, and what it is used for, and why. The first part of establishing a lawful basis for use is showing that we have a good reason to collect it in the first place, and that the user consented to their specific data being collected and used in their specific case.

This should not be a long or complicated statement. It should be linked from the default home page.

Records of Agreement

It is usual/almost mandatory for those who run a Fossil repository to have an agreement with committers addressing copyright, joint copyright, liability and so forth. Fossil's Contributor Agreement is an example.

Fossil should ask for and keep two extra records:

A documented link between every committer's name and their purpose for committing . This is part of establishing a perpetually lawful basis for processing the personal data associated with a given committer. A contributor's agreement is good start to establishing that. The implementation of this is that we maintain online evidence associating a particular agreement with the user who has signed it on a particular time and date. The agreement could be a a wiki page, and the version signed needs to be recorded as well. This would be information that is visible publicly by default (or at least to all who have the "Developer" capability?) Privacy rights compete with public interest on this matter, and it is likely that there is an overriding public interest in knowing "User X has signed document Y version Z on $date" given that the purpose is to create a public chain of trust in the right to use the source code in Fossil. The discussion is different for non-open source code stored in Fossil, but the facility is still needed. As a bonus, this is very relevant to establishing copyright status or non-status.
A documented link from every committer to the privacy agreement/release they have signed . This consent would say that the committer grants processing rights to their personal data associated with commits in perpetuity (or "for as long as the lawful basis remains", or other such lawyer words). This is somewhat different when Fossil is used for non-open source code, a case not addressed here. This also would by default need to have a pointer to a particular version of the wiki page.

Scope of Records of Agreement - code and docs

The above two documents (or a single mashup of them both) apply to two cases:

a. the source code commit case, and

b. the documentation commit case

Fossil treats these as separate authentication domains by default, and it is unusual for source code copyright licenses to also be used for documentation copyright licenses, on the assumption that the contributor agreement also specifies the copyright. Therefore I assume that most often there will be two set of these two links/documents, ie 4 in total for every user who both commits code and documentation. They can be implemented as checkboxes, pretty much.

Scope of Records of Agreement - tickets

Ticket commits are somewhat different. A ticket submitter still needs to consent to their data being processed, even if we don't know who they are. Tickets may also be private by default, and some tickets may never become public. We are still handling their IP address at a minimum, and it is common for personal data to be pasted into a ticket including accidentally. Fossil has a way of almost-deleting tickets by default, and Fossil's Merkle tree hole-blowing devices such as "shun" also apply to tickets. More thought is needed about how to implement good privacy practice in the case of tickets.

Public-facing Source Trees

A line should be included by default with each commit (in a similar way that some boilerplate commit comment text is provided today) referencing the documented links (1) and (2) above. This can be a very terse few words accompanied by an object hash.

Public-facing Wiki Pages and Tickets

Every commit should also be accompanied by the same terse agreement line as for source code, only with the agreement text appropriate to documentation.

Caveats to This Document

The GDPR does not assume the people implementing it in code and practice are lawyers. However, a GDPR lawyer's review is definitely needed, including on points such as the exact wording of the default documents proposed here, and whether there is some sneaky clause that might endanger the Merkle tree I haven't thought of, and so on. Evidently IANAL, but there is nothing surprising or radical proposed above.
Anonymous contributions (such as when a Fossil repo chooses to accept anonymous tickets, and perhaps Wiki entries) may contain personal data. Therefore consent needs to be sought from the user. In addition an anonymous user still has their IP address handled by Fossil and perhaps other things, so that user still needs to grant consent. They may also contain illegal information or indeed anything else and so this problem goes beyond privacy and copyright.
This document is not a privacy policy for Fossil nor is it comprehensive.
The GDPR specifies a particular kind of privacy impact assessmenti (PIA), and Article 25 of the GDPR is "Data Protection by Design and Default" and that bears on Fossil's implementation of privacy. It is quite wooly, but it would be good for someone familiar with PIAs to review Fossil.

(2) By John Rouillard (rouilj) on 2020-10-21 15:15:43 in reply to 1 [link] [source]

For tickets is there a way to add a checkbox to the ticket form:

[ ] I have read and accept the <privacy document wiki link>

and reject the update if not checked? This would show up only if the user is using the anonymous or nobody role. It is not needed for actual users as presumably they would have agreed to the privacy document at the time of account creation.

(4) By Dan Shearer (danshearer) on 2020-10-23 19:14:06 in reply to 2 [link] [source]

John Rouillard (rouilj) on 2020-10-21 15:15:43:

[ ] I have read and accept the <privacy document wiki link>

Yes that seems like a simple solution for tickets.

As to "...not needed for actual users as presumably they would have agreed to the privacy document..." that is one of the main points of my design proposal. To take out the "presumably" and instead have enduring evidence that a given user definitely did sign the privacy document, and it was version XXX they signed on $date. (And we can cover copyright too in that process since the user is signing things anyway.)

Dan

(3) By sean (jungleboogie) on 2020-10-21 16:04:23 in reply to 1 [link] [source]

Alternative ideas...

/setup_config could have a new entry for "privacy rights", and it would be up to the person running the Fossil site to update this and put whatever link they want in for it - or none at all (my preferred default).

If you really want to be annoying with your fossil instance, you can put some text in the /setup_adunit section and have this displayed on all pages of your fossil instance.

(5) By Dan Shearer (danshearer) on 2020-10-28 17:13:08 in reply to 1 [link] [source]

Dan Shearer (danshearer) on 2020-10-21 09:01:32:

Fossil doesn't publish email addresses by default, and it has a "View PII" capability, and it generally ticks many other boxes.

One of those ticks being fossil scrub , a command I just had reason to use and I really have to commend it. From the help:

The command removes sensitive information (such as passwords) from a repository so that the repository can be sent to an untrusted reader ... if the --verily option is added, then private branches, concealed email addresses, IP addresses of correspondents, and similar privacy-sensitive fields are also purged ...This command permanently deletes the scrubbed information.

This attention to detail by default helps Fossil stand out from a privacy perspective. "And similar privacy-sensitive fields" should be enumerated, but that's about the only nitpick I have.

Dan Shearer

(6) By Richard Hipp (drh) on 2020-10-28 17:29:08 in reply to 5 [link] [source]

An independent audit of the "fossil scrub" command, to make sure it isn't missing anything, would be a good idea. In particular, the scrub command was written many years ago, and we've added a lot of features to fossil since then, and perhaps those added features have also added sensitive information that might need to be removed by scrub but which was overlooked at the time.