Saving OpenDocument XML files in fossil without having to zip/unzip them

(1) By Stephan Beal (stephan) on 2019-12-12 18:23:34 [source]

The topic of storing ODT (OpenDocument) files in fossil has come up before in the mailing list/forum, and it's been pointed out that fossil cannot do a great job of producing small deltas because ODTs are binary ZIP files which contain a mix of compressed XML and possibly other binary data.

A few minutes ago, in the context of searching for a workaround for that very situation, i learned that ODT has a sub-format, called Flat XML ODF (.fodt), in which it stores the document in a single, uncompressed XML file.

To try this out, open an ODT with LibreOffice (or, presumably, OpenOffice or any similar software), do File ==> Save As... select FODT as the format.

My test file is a 160-odd-page ODT, 277kb in plain ODT format (exported from Google Docs, so it likely has conversion-related formatting baggage which inflates its size). Exporting it to FODT results in a single 2MB XML file, which is something fossil can ostensibly use to create smaller deltas between versions, especially if the word processor keeps the "internal bookkeeping markup" more or less stable, so that the real changes are primarily in user-entered text.

Compressing that 2MB XML with the venerable Unix compress tool, which is which is roughly equivalent to fossil's own internal compression, results in a 395kb binary file (which roughly represents how much space that original version would take up when imported into fossil).

There's one apparent catch with regard to using FODT in fossil: long lines. When writing a paragraph in a word processor, it gets exported as a single line (just like fossil's wiki, or even this forum post). Fossil has a heuristic which says "any file with lines longer than X is binary," and that may interfere with one's ability to produce visual diffs but should not inherently keep fossil from producing small deltas for each each version.

FODT... Fossilizable ODT!

(2) By Warren Young (wyoung) on 2019-12-12 19:02:54 in reply to 1 [link] [source]

fossil cannot do a great job of producing small deltas because ODTs are binary ZIP files

Yes, in the Fossil docs. Anyone storing binary data in Fossil should read that article.

ODT has a sub-format, called Flat XML ODF (.fodt)

Yeah, that came up several months ago.

especially if the word processor keeps the "internal bookkeeping markup" more or less stable

There's probably a few ways to modify my Jupyter script from the first link above to automate one of these document editing platforms to produce the equivalent data set. Instead of changing one pixel to a random color, you'd change a character in the document and save it back out. Everything else in the test would stay the same, so you'd get comparable results.

I'm tempted to do another editing pass on that doc, since the recent discussion about Fossil repo size after fossil init would let me improve my explanation of why my test script discards the first three data points: that's where Fossil is still filling the empty 4k initial pages.

long lines

Post-process the XML with a pretty printer before checking it in.

One option not listed there is HTML Tidy, which despite the name also understands XML.

The Makefile processing method given at the end of the article linked above would help here.

(3) By Stephan Beal (stephan) on 2019-12-12 19:13:07 in reply to 2 [link] [source]

Yeah, that came up several months ago.

Thank you for that link - it's buried deep inside the infamous File Locking Meltdown of 2019, a thread i bailed out of early on in its devolution ;).

i will certainly take a look at the pretty printer option. As of yet, i'm still struggling with a justification for migrating this doc from gdocs to FODT: gdocs is 100% convenient to use and allows me to edit directly from my tablet in bed, but it keeps randomly breaking intra-document links, which is infuriating to no end.

(4) By MG (mgr) on 2019-12-12 22:48:25 in reply to 1 [link] [source]

For most of the ZIP style formats like OpenDocument, Office Open XML (Microsoft Office), EPub, ... one could try something along these lines:

Recreate the ZIP archive

without compression (STORE)
in a canonical order of the internal files
keeping special internal files at the first position ('mimetype' for OpenDocument & EPub, '[Content_Types].xml' for MS)

before checkin.

OpenOffice, LibreOffice, MS Office seem to happily accept these uncompressed files directly (probably recompressing on save).

Thanks to the zipfile extension of sqlite, also available in fossil itself you can even do it without external tools (depending on dirty internal ORDER BY for the zipfile aggregate):

select writefile('output.odt',zipfile(name,null,mtime,data,0))
from (
  select * from zipfile('input.odt')
  order by case when name in ('mimetype','[Content_Types].xml') then 0 else 1 end, name
);

or using ORDER BY more legally? for virtual table `INSERT

create virtual table temp.zip using zipfile('output.odt');
insert into temp.zip(name,mtime,data,method)
select name, mtime, data, 0 
from zipfile('input.odt') 
order by case when name in ('mimetype','[Content_Types].xml') then 0 else 1 end, name;

ymmv

(5) By Stephan Beal (stephan) on 2019-12-12 23:50:02 in reply to 4 [link] [source]

i'm wanting (should i migrate, which is as-yet undecided) to avoid any build-process steps like rebuilding a zip file, but that is a terribly interesting feature which is new to me... and i have other uses for it. Thank you!

(6) By refaqtor on 2019-12-14 18:00:28 in reply to 5 [link] [source]

If it helps... I use the .fodt format of open/libre office docs to keep it more Fossil friendly.

(7) By Warren Young (wyoung) on 2019-12-14 18:38:55 in reply to 6 [link] [source]

The major open question we have on that is how OpenDocument Flat XML formats behave in the face of multiple editors, merges, etc.

Are you working solo, or do you have some experience to share on how it works in a collaboration, refaqtor?

Further, if there's a merge conflict, how bad are the consequences? When you cause a merge conflict because two people are editing the same point in a Markdown file, you can disentangle it manually, but if it happens to a Flat XML file, how does OpenOffice/LibreOffice react to the change markers and such that are inserted? How difficult is it to extract yourself from the situation?

(8.1) By refaqtor on 2019-12-15 05:00:47 edited from 8.0 in reply to 7 [link] [source]

ah, yes. I've always used it solo. I expect conflict resolution is sub-optimal. and, forgive me for not having read the whole thread.

(9) By anonymous on 2023-11-23 07:53:15 in reply to 7 [link] [source]

A very small anecdata point: I've just merged a .fodp file after forgetting to commit it in one checkout before starting to edit it in a different checkout.

My actual edits to the content were merged without a problem because I edited different slides, but I also got 7 merge conflicts elsewhere in the file, spanning a few lines of XML each. These arise from the fact that LibreOffice stores information like currently displayed slide, current zoom settings, editor version, printer settings and (?) pre-rendered field values for some of the fields in the master slide (which render differently in different versions of LibreOffice).

So, disaster averted, but this process isn't zero-friction. This trade-off is currently fine for me, but in many other cases it won't be.