pdf ".pdf" rendering searching unversioning downloading

(1) By m___ (mikemike) on 2022-11-14 12:35:51 [link] [source]

There is such a setting as binary-blob. As in "recognize as a binary file". Being considered a binary file "*.pdf" what are the consequences?

Environment: the login: "url/repo" as administrator (#fossil server ./repo)

The pdf file considered as the example is sitting in the root of the fossil server say "/test.pdf" In settings: added to the list in binary blobs: "*.jpg,*.pdf"

Browser firefox rendering pdf files through it's module. The same pdf file will render when on a regular apache server documentroot.

The fossil used webserver does not render the pdf (as it does text/plain text/html files). As a second, the file cannot be downloaded. It is unclear if the binary-blob setting makes the pdf extension files "unversioned". By the way, unversioned files does not have a setting as binary-blob. Are binary-blob files unversioned? The example in the documentation "url/uv" is unclear to me, i am a user, not a programmer. As far as downloading, rendering and knowing wether pdf extensioned files are unversioned. Pdf a binary format is not my favorite, but in our workspheres as academics an omnipresent nuisance (it requires additional code to be rendered and searched). I am most certainly not a fan.

A request apart, pdf files cannot be searched, but since there are many use cases for fossil as a self-contained attractive alternative to store documentation (as in the fossil team case), and this in many projects in science and others, this requires handling pdf files as to be rendered, included in searches for content and downloadable as required. This i think is a "module" that would make a util add-on inside the fossil binary. Xapian (not self contained, much larger binary/library), the browser applications, recoll(desktop search engine) and other search engines handle indiscriminately searches and rendering of pdf files.

There is nothing as the above "alternatives" that is quite contained, within user permissions on the 5Eur a month virtual server, where fossil is able to run as cgi application. It is a need to be considered. It would be a first. (Tunney, redbean.com has a twist on this idea that goes even further, but there is no version control, hence roll-back)

In clear-sign: pdf binaries as unversioned(optional), included in full text searches, downloadable/rendered.

Thanks for considering above, i am impressed as a non-programmer by the filosophy and design choices (C, self-contained, sqlite, sober web design, portability and more) of the fossil/sqlar/sqlite and the full-text search module. In case i am completely knockered, and all the above is indeed possible with the latest as of to date fossil binary, then the better.

(2) By Stephan Beal (stephan) on 2022-11-14 14:51:51 in reply to 1 [link] [source]

As in "recognize as a binary file". Being considered a binary file "*.pdf" what are the consequences?

Internally, binary and text files are equivalent. Certain operations cannot be performed on binary files, however, most notably merging and diff rendering. There are no file-type-agnostic merge and diff algorithms which work for arbitrary binary files. Fossil will also, with a few exceptions, not attempt to render the content of binary files in the web UI.

A request apart, pdf files cannot be searched, but since there are many use cases for fossil as a self-contained attractive alternative to store documentation (as in the fossil team case), and this in many projects in science and others, this requires handling pdf files as to be rendered, included in searches for content and downloadable as required.

Such features would require fossil to have explicit PDF support, which is never going to happen.

Thanks for considering above, i am impressed as a non-programmer by the filosophy and design choices (C, self-contained,...

"Self-contained" is an important feature of fossil for us. Adding dependencies to 3rd-party libraries is something we do only for critical features like data compression and SSL.

(3) By John Rouillard (rouilj) on 2022-11-14 17:04:39 in reply to 2 [link] [source]

TL;DR use the Doc link to view the pdf in browser.

I agree fossil shouldn't interpret and display a PDF.

However, if you put a pdf at the root of a directory served by apache, I can click on the file and it will be opened and displayed in the browser.

The user doesn't want to save/download from fossil, but view the file using the browser's native PDF viewing capability. I think viewing the raw file with the proper mime type will work for him.

Surprisingly, using the fossil file browser to display a .png displays the png on the page via an embedded image tag. Also, there is an "image" link/button displayed on the top menu to display just the image.

Similarly, an HTML file is displayed in an iframe with a menu option to display as text. So expecting something similar for PDF isn't totally out of bounds.

I thought there was a raw view that might work, but not for pdf.

However for the OP, if you click on the Doc link just before the download, you will be able to view the pdf in the browser. Doc isn't the most obvious labeling for this function though. I have a feeling I am abusing it somehow.

If the file you are "doc"ing isn't a type the browser knows how to recognize, you will be prompted where to save it (on chrome at least).

(4.1) By m___ (mikemike) on 2022-11-14 20:30:18 edited from 4.0 in reply to 2 [link] [source]

Thanks for your ready reply.

To get to this in smaller bytes: pdf, a binary format is used mostly (almost always as the equivalent of "html/css" to view primarily text and images in a single interface. I am not a programmer, but many tools (srcipts as [perl? pdf2txt], extract the text part of the pdf into a text file, which then could be searched by the full text search]. As a commenter/poster rightfully remarked, *.jpg files for one are rendered inside the fossil/chat interface, thus internally (this is as far as other binary testing done by me).

It is all a matter of terminology, fossil is a text tool, for now it extracts meaning out of meta data added on top of text files on a file system and meaning out of the content of the file (fts searches). It allows for content searches of it's blob, not a minor thing. Then it is also a version control tool, a sub-category which fits into both former mentions. When one pokes to what pdf is, mostly a text format of mostly text content rendered to the human interfacing, the logical consequence is that the fossil interface should render it, and make pdf's searcheable (attending not only the human but the machine on the other side of the equation).

One approach would be to (if/as an option) store a copy of the pdf as text (what most languages/scripting can do (perl,java), a file in an extra table, to render that part in the common interface, to thus make the pdf content fully text searcheable as ordinarily (txt/html). As is the case for html (not a text only interface in any case either (inline images in it's most desirable earlier concept).

Not a lover of pdf, in any way but it is a historical, important container of text mostly. All search engines i know of, again none are self contained mostly (the big advantage and clean approach of sqlite/sqlar/fossil to the difference) do attend that part of the equation: rendering(done by the browser anyway) and including the content in indexed searches.

I would like the opinion of the developers on how this is difficult/impossible, goes against policy, philosophy. We are dealing with a browser interface, not a terminal. If fossil is used terminal only, that would not make sense either (internally scripting pdf to text and searches (pdfgrep) is no new code. But as a browser interface is offered one should? include a mostly text centric container of content as pdf is, in searches(fts) and rendering within the interface or pass it on to the client browser.

The core as i see it is terminology: fossil is a tool to keep meta data extra/exo to the "file <FILE> output and some other command line tools". File-system data on reads/writes, content(searches) in a web interface utmost accessible, and even allowing another layer of additional meta markup (wiki/fora/chat/tags/tickets...) so the content becomes meaningful and lasting over long time-spans and over a broader user base.

Again my excuses as this means bullocks, i am not a programmer or IT/network/kernel-developer/C-guru but the thing has been done before (Xapian, Lucene, Recoll, Desktop search engines) but only not in a self-contained (mostly) package without splashing over the filesystem and requiring permissions outside the jails offered by virtual servers, or added layers of docker/etc. dependency complications. Why continue giving google an ever lasting advantage of being the uniting content search/list provider and poison the supply line of data with interested data of their own for gain. If some-one is interested in your data: offer the possibility to "fossil clone" (the instructions) from within the interface and done. The receiving end has now a readily accessible blob locally of data and fully accessible in a streamlined local workflow. No need to have an a priority of coding.

I can see that the sqlite community has the right mentality to think for what is still possible autonomously and power the quest for compact and rational workflows. Thanks forehand for attending the issue at debate.

signed: a dinosaur academic.

Post-script: ideally the binary fossil should ride inside the "repo" sqlite database itself. How many MB of extra code can it be to extract a pdf internally (an extra included script, configuration option(s) added, some script/code to search over the added text version of the pdf and locate it) In all this would mean some hard work in fine-tuning for a while, but it would add disproportionally added functionality without stepping out of the boundary of "organic" scope of such a tool (fossil). Putting meta data on top of the filesystem, what version control is/adds for one means nothing if these data are not easily available and further processable.

(5) By John Rouillard (rouilj) on 2022-11-14 21:00:06 in reply to 4.0 [link] [source]

As a commenter/poster rightfully remarked, *.jpg files for one are rendered inside the fossil/chat interface, thus internally (this is as far as other binary testing done by me).

Well sort of. Under the hood, an img tag with a link to the raw file is generated by fossil. That happens to be a structure that HTML and browsers support. As a result you see the .jpg/.png rendered by the browser, not by fossil. If browsers didn't support the img tag you would see nothing.

the logical consequence is that the fossil interface should render it

It looks like a pdf rendered in an iframe should work. But fossil would need to be modified to present a pdf file in an iframe similar to how it handles HTML files. So in theory it could happen. However the difference between theory and practice is often large.

Note the Doc link on the pdf page displays the pdf in the browser. Adding an iframe just makes it a little more obvious. In neither case is fossil doing any interpretation of the pdf. That's all done by the browser not fossil. Exactly as though the pdf was served directly by apache.

Fossil is first and foremost a version control system. It does have other aspects needed to manage a software development effort: tickets, embedded docs, wiki, forum, chat, and search. However, search is limited to text-based (non-binary) items such as source code (in VCS), and the text components of tickets, wiki, and forum.

It allows for content searches of it's blob,

It doesn't try to index/parse any file formats other than text. So metadata is not extracted from .jpg, .mov, .mp3, .mv4, .oog, MS Word, Libre/Open Office docs, or .pdf.

Under the hood, a pdf is enhanced postscript. However extracting the text from the file and not extracting the pdf directives requires interpreting the file which is very much not in scope for fossil. If the PDF has text rendered as bitmaps, that's a whole other kettle of fish, well, well outside of fossil's purpose.

include a mostly text centric container of content as pdf is, in searches(fts)

Since fossil uses SQLite's full text indexer, somebody (not me or the fossil developers, including Stephen who answered you originally) might be able to make search work by extracting the text from the pdf. Then with an understanding how FTS5 text search works in fossil, adding those words and associating them with the PDF document. But this would be processing done externally to fossil but made available for fossil to use.

This would be exactly like how a web crawler runs programs to extract text and then inserts it into a database that links a URL to search terms. Not sure if it could be done but....

(6) By m___ (mikemike) on 2022-11-15 12:25:16 in reply to 5 [link] [source]

Thanks for your "raw" in browser suggestion (you suggested this twice), this is an acceptable solution in all workflows. The files *.pdf are accessible, "opened" and can be read without further ado.

In all, this leaves the inclusion of versioned/unversioned pdfs contained in the repository in the sql full text search and the editing (optionally converting the edited text back to pdf but this is more complicated and un-necessary a feature, the text version of the pdf being editable transparently would do).

I see that the fossil binary <b>already makes calls to some external libraries (most probably or are the libraries within static binaries contained within the sqlite blob?) such as grep and dependencies alas "fossil grep".</b>

What is left is include the pdfgrep binary, or "pdftotext | grep <string> <the outputfile of pdftotext>" and render this text in the browser. Optionally the output file can be stored and rendered in the repository and linked to the pdf file. These functionalities are already there and within the intentional scope (adding meaningful meta data on top of the filesystem file headers) of version control tool-sets.

Again i am not a programmer, but i did this:

#!/usr/bin/bash
string=$1
pdf=$2
base=$(basename $pdf)
pdftotext $pdf
grep $string ${$base%.*}.txt | view -
#

Concerning the above nothing very different from "fossil grep" and it's equivalent in the browser interface.

My suggestion, not more, and admirative of the work done by fossil is that full text searches is the "human" way to retrieve meaning, in many cases, and the systemic, structural meta data of version control aspire to the same. The combination of both, within the same interface is "fossil ui/server/fossil.cgi"'s task. Content management systems, a simple text editor, scm, vcm, it all amounts to the same: "seamless" interaction within the same interface between man and his external brain, the machine. Alas the vision of Tim Berners Lee, the power of goooogle, AI.

(7) By Stephan Beal (stephan) on 2022-11-15 12:46:15 in reply to 6 [link] [source]

I see that the fossil binary already makes calls to some external libraries (most probably or are the libraries within static binaries contained within the sqlite blob?) such as grep and dependencies alas "fossil grep".

To clarify...

Fossil does not require any external applications but can make use of a handful of specific optional uses:

A browser for the UI.
tcl for the "diff -tk" feature (which has since mostly been obsoleted by "diff -by" because one major OS vendor installs a broken tcl).
An optional external diff UI.
ssh connections

What is left is include the pdfgrep binary, or "pdftotext | grep <string> <the outputfile of pdftotext>" and render this text in the browser.

Fossil is designed to run in an empty chroot jail and never (ever) runs external binaries when running in server mode. It only does that from CLI mode.

Content management systems, a simple text editor, scm, vcm, it all amounts to the same: "seamless" interaction within the same interface between man and his external brain, the machine.

Alas, fossil cannot sensibly seamlessly interact with any and all data, and the interactions you describe for PDF are well outside of its scope as (first and foremost) a software development tool. It's gone 15 years without any sort of PDF-specific support and PDF is a rare developer-level documentation format for software.

i'm not denying that there might be 1-3 users who could get some benefit from tighter PDF integration, but i am arguing that that's not in scope for fossil.

Sidebar: rendering of PDFs in an iframe is unportable. Some browsers render them, some don't - they just leave a big empty iframe with no indication of what's supposed to be in it. Rendering of images, on the other hand, is portable and fossil outputs image binary files in a way which their browsers can reliably deal with.

(8.3) By m___ (mikemike) on 2022-11-16 15:39:25 edited from 8.2 in reply to 7 [link] [source]

Thanks for your fast reply, and respecting your concepts and designs,

"Fossil does not require any external applications but can make use of a handful of specific optional uses:"

That is what makes it less fragile, the fossil binary i mean. I think that is the sane approach as much. A mostly static binary as core, ready to write different binaries for different architectures.

"Fossil is designed to run in an empty chroot jail and never (ever) runs external binaries when running in server mode. It only does that from CLI mode."

That was the stupid me part in my query for added functionality. What on hindsight, i see as a sane suggestion: include the functionality of the unix binaries pdfgrep and pdftotext inside the core "fossil" binary. So fts (full text searches) can be done over the text files that were output, and the "fossil grep" command (already present in the fossil binary blob) can directly search the output text files. Maybe some glue to make it an option that "whenever a 'fossil add' is done, this can be automated.

The fossil binary would not need many modules of the c-libraries already sourced into as extra cargo, and fossil would statically compile without bulging noticably. The whole of the copies to text of the *.pdf files would be versioned and available in the web interface search interface as by magic, without further ado.

This could be justified since there is no cli full text search command at all, a sqlite3 search command, exempt from raw SQL. Not very practical. Xapian and friends include all a cli A-N-D web interface to full-text query their databases alas repositories in the case of fossil. Fossil lags here not allowing for full text searches in cli interface, but then proning the same functionality in the browser environment. Not a logical deference.

"It's gone 15 years without any sort of PDF-specific support and PDF is a rare developer-level documentation format for software."

I beg to differ. From the fossil documentation online: "For example the administrator of a wiki-only Fossil repo for non-developers could treat the "developer" user category as if it were "author", and a forum only repo..."

Concerning non-developers: intellectual abstraction is not the sole priviledge of software developers, the processes of query-ing large amounts of data for meaning and being able through meta mark-up revert errors, branching, collaborating fits other domains. There is abstraction in any intellectual effort, ...and sadly (a binary format for text, and thus highering the barrier for the machine to comprehend the human speak) as pdf, as caduque as it is, ...is "omnipresent" in the academical world, the world of science. And yes the needs for developers (of any kind) of abstract concepts and logical output are colluding in all sorts of ways. Your general documentation holds this in mind, as the snippet above refers as much.

Above (adding functionality to the fossil core binary) absolves the need for iframe and pdf portability. My suggestion was unprecise, it is the fossil binary blob that should cope with the burden of exempting the user to script and manually manipulate the check-out.

Secondly not making a "fossil query/fts/search" cli command available could be seen as a defect, since the browser interface testifies of the developers choice that yes, full text searches are an acceptable functionality.

Again i am not a programmer, not a developer, just a consumer of yours and tool conscious in fields of mine. Take above as a suggestion, no personal bone in this. The concepts of fossil/sqlite/sqlar/adding meta data to'dumb' file headers are core to many cycles of absolving frustration, and querying efficiently leads to creation, also outside the field of developing software. Take my input for what it is worth to you.

(9) By Stephan Beal (stephan) on 2022-11-16 15:45:21 in reply to 8.0 [link] [source]

The concepts of fossil/sqlite/sqlar/adding meta data to a 'dumb' file headers are core to many cycles of frustration and querying efficiently leads to creation, also outside the field of developing software.

That is a core point in fossil's history and ongoing development: probably 98%+ of fossil's features were added by someone who got tired of tedious cycles and decided to add that capability or automate existing capabilities. The remaining features were introduced because a non-developer convinced at least one developer that it was something worth adding to fossil.

For the case of integrating PDF-based docs better into fossil, it will be a matter of either writing and submitting that code or convincing one of the developers that the added value is worth not only the implementation effort but also the obligatory ongoing maintenance¹. FWIW, i don't believe that feature to fall within fossil's scope as a software development tool², but will admit that it straddles the border of some of the existing features. Regardless of how i feel about it, though, working code written by someone who feels otherwise would be happily considered for inclusion.

^{^} Non-trivial code is never 100% maintenance-free, and only people from the Marketing and Sales departments will claim otherwise (while simultaneously trying to sell a support contract).
^{^} While acknowledging that it does have uses outside of software development, but they are incidental/coincidental, not specifically something we aim to support.

(10.3) By m___ (mikemike) on 2022-11-18 13:59:47 edited from 10.2 in reply to 9 [source]

Thanks for answering. I think preliminarily i partly grasped the concept of fossil [that specific intrinsic part] as the following.

The binaries, sqlite3, sqlar, fossil, "translate" the filesystem into a "table system". Keeping data in an sqlite database is the result. Filesystems are not database tables, which allow for adding meta-data, and some other manipulations on top of file headers to their advantage, hence version control being a fitting application. Compactness as an argument, less so, compactness [portability] as not readily available to the unix tools more so. The translation back into the filesystem format for now? has gate-keepers fossil/sqlar/sqlite3. That could be an argument for "security", and it certainly is an argument for discretion (another layer on top of the file system), Apple, Microsoft, the Corporate world as ready consumers is evidenced on the public sqlite pages proper.

To have any application that operates on the file system (crawlers, search engines) to work with data inside sqlite databases, querying, searching, rendering, they need a filter, alas middle-man. I remember having read somewhere in the documentation that this is another asset of sqlite: search engines at the time of the remark made, do not access sqlite.db-ses.

Is this so? It is certainly ( mostly [an expiring time line]) an explicit choice.

Writing a filter for a search engine based on the file system ( as one example again xapian [C-coded], which can handle pdf and many other formats yet) would be quite possible to the developers. It would be a matter of an opository design choice though. What amazes me is that i know of no efforts to write a filter to sqlite allowing full text search by an existing search engine. This is of course subject to my ignorance, i am not an insider. (There seems to be a cgi Perl module that translates to the xapian code.)

It seems to be a policy concept of the entity @sqlite to hold off such intents, not providing filters to outsiders. The alternative? Putting a search engine inside the static binary would make sense on the proprietary incline. That would satisfy the "security" and proprietary element and still cater to functionality.

My argument stays the same, full text searches of many existing mime-types would be a major asset to the functionality of the fossil combination of being an sqlite database - version control - web server - full text searches for .txt and .html confined apparatus.

Some-one who seems to think in that direction is Justine Tunney, and her redbean project [there are additional facets to her effort]. It's all about sockets.

The above are just hap-hazardous thoughts of a non-insider, i do not want to ruffle any feathers, but it is just a way of trying to comprehend what goes on, what keeps stable "open" projects from quickly advancing to become straightforward tools. It is then that the real work can start! I am not a programmer, software is not my domain, so give me some lean-way for misinterpretation. Still i see the issue well and open-mindedly handled here by my answering party as a design question rather then a coding problem.

(11) By Stephan Beal (stephan) on 2022-11-18 15:00:21 in reply to 10.3 [link] [source]

The binaries, sqlite3, sqlar, fossil, "translate" the filesystem into a "table system". ... The translation back into the filesystem format for now? has gate-keepers fossil/sqlar/sqlite3. ... Is this so?

Neither the sqlite3 shell nor sqlar are involved in fossil. Fossil embeds the sqlite3 library and can emit sqlar files, but doesn't use any external applications to do so. In fact, fossil embeds its own copy of the sqlite3 shell application, accessible via the "fossil sql" command.

What amazes me is that i know of no efforts to write a filter to sqlite allowing full text search by an existing search engine.

The problem isn't as simple as that. Databases may require custom tokenizers and custom collation sequences specific to individual applications and requiring C code which the search engines don't have. If a search engine were to try to filter those using inappropriate tokenizers and collation sequences, they'd likely make a mess out of it.

The alternative? Putting a search engine inside the static binary would make sense on the proprietary incline.

That's the only approach which reliably and portably works for the reasons mentioned above.

(12) By m___ (mikemike) on 2022-11-25 09:51:13 in reply to 11 [link] [source]

Thanks for your points of view. I keep exploring the fossil cms.

More acutely: can the password of the default (setup) user be changed from the command line?

Does a sync alter configuration settings, including setting the default user?

Are the global configuration settings manually transferable from remote locations on the file system to local and back-forth?

Can sql into the sqlite database of the fossil alter password settings? Or is the only interaction with passwords table the web interface?

Where are global settings written in the case of a cgi fossil application?

All issues that are not directly addressed by the documentation, hence my impertinence, ...i did lock myself out.

(13) By mark on 2022-11-25 11:01:48 in reply to 12 [link] [source]

can the password of the default (setup) user be changed from the command line?

fossil user password <user> <passwd>

see fossil help user