Can I save fossil forum posts to a text file?

(1) By anonymous on 2025-03-23 02:39:10 [link] [source]

I was trying to scrape my personal fossil forum site to back up posts offline, but I ran into honeypot issues when I used selenium and beautiful soup to try to scrape my own site. Is there a way to a) backup forum posts to an offline file or b) turn off the honeypot features so I can scrape it using my own python script?

Thanks

(2) By Stephan Beal (stephan) on 2025-03-23 04:50:05 in reply to 1 [link] [source]

turn off the honeypot features so I can scrape it using my own python script?

If you will host the repo from a local clone using fossil server -localauth it will treat you as logged in and you shouldn't trigger the honeypot.

(3) By anonymous on 2025-03-23 05:30:29 in reply to 2 [link] [source]

Right, I figured out how to log in using the scraper script, but I still get a long list of lines like:

/timeline?ss=v&y=f&vfx&u=anonymous /forumpost/dfde8efafcb24051?raw /timeline?c=dfde8efafcb24051&y=a /info/dfde8efafcb24051 /timeline?c=d2b8f231441dfa91&y=a /info/d2b8f231441dfa91 /timeline?c=c91848a02c9ca030&y=a /info/c91848a02c9ca030 /timeline?c=ccaa2a651bc2a7d8&y=a /info/ccaa2a651bc2a7d8 /timeline?c=19e4056410ebc195&y=a /info/19e4056410ebc195 /timeline?c=c0e96665fa9c54cc&y=a /info/c0e96665fa9c54cc /timeline?c=6cdb286a41ad7f73&y=a /info/6cdb286a41ad7f73

and further down

/timeline?u=anonymous&c=2025-01-12+21%3A29%3A24&y=a /timeline?c=cbe54753187b19a4&y=a /info/cbe54753187b19a4 /timeline?u=anonymous&c=2025-01-12+02%3A02%3A08&y=a /timeline?c=cfe394c1fcfebaaa&y=a /info/cfe394c1fcfebaaa

Which I think means I am still triggering the security measures even though the selenium library is using a chrome browser instance to log in? I think it still ended up grabbing all the posts, but with thousands of those extra "timeline" lines included. Just wondering is there still something, like timing between requests, that I should change?

(5) By Stephan Beal (stephan) on 2025-03-23 11:11:11 in reply to 3 [link] [source]

Which I think means I am still triggering the security measures even though the selenium library is using a chrome browser instance to log in?

Those links are perfectly valid. Having a bot crawl a fossil repo will cause it to crawl countless links which a human never would, and many pages generate near-endless new combinations of links. For example, links from every forum post lead to /timeline?u=the-user, like you are seeing. Those lead to /info pages, which lead to diffs and more timeline links, ad infinitum.

From even a small repository you can crawl gigs of garbage, wasting tons of the bot's storage and the server's CPU. That's why fossil has to have bot-defense measures.

(7) By Mike Swanson (chungy) on 2025-03-23 11:55:37 in reply to 5 [link] [source]

From even a small repository you can crawl gigs of garbage, wasting tons of the bot's storage and the server's CPU. That's why fossil has to have bot-defense measures.

That got me curious to put the "grab-site" crawler against a small local repository. I discovered the this day in history feature from the URLs it was trawling, I believe a naïve crawler will never finish due to this reason. I can put that page on an ignore list, but I don't care to see what the next infinite loop is. :)

Perhaps more in-line with the present topic, it should be possible to construct SQL queries and run them through Python, or whatever your choice of programming language is, to reconstruct the forum posts from the local repository and save them into plain text files.

If the purpose is a mere backup though, a local clone is not only suffice, but by running fossil ui, you can replicate the exact web interface used to browse the forum.

(8) By Stephan Beal (stephan) on 2025-03-23 12:23:18 in reply to 7 [link] [source]

Perhaps more in-line with the present topic, it should be possible to construct SQL queries and run them through Python, or whatever your choice of programming language is, to reconstruct the forum posts from the local repository and save them into plain text files.

Kinda. You can get the full text of an artifact using the content() SQL function, e.g.:

$ f sqlite "select cast(content('rid:1') as text)" 
'C initial\sempty\sbaseline
D 2007-07-21T14:09:59
P
R d41d8cd98f00b204e9800998ecf8427e
U drh
Z c62e3619f7c41ee77b33f00df8ec26f4
'

but there is no(? AFAIK?) SQL which will (A) extract only the text of a forum post or wiki page and (B) resolve any markup formatting for it.

That said: forum posts derive from wiki pages, and it would not be difficult to add a CLI command to export forum posts, just like the wiki command does for wiki pages. That would need to be implemented by someone who's inspired to do so, though (not me) ;).

(9) By Richard Hipp (drh) on 2025-03-23 22:55:16 in reply to 8 [link] [source]

Perhaps we can add a new SQL function that takes the raw artifact text as input and returns a JSON decoding of the text. Given that, the built-in JSON capabilities of SQLite should be usable to extract whatever information the user needs.

(10) By Stephan Beal (stephan) on 2025-03-23 23:00:20 in reply to 9 [link] [source]

Perhaps we can add a new SQL function that takes the raw artifact text as input and returns a JSON decoding of the text.

We have a manifest-to-json routine somewhere in the json api, and that should be trivial to port out of that code. i'll look into that tomorrow.

(11) By Richard Hipp (drh) on 2025-03-23 23:45:42 in reply to 9 [link] [source]

Here is a crazy idea, that I write here as a record my thoughts, not necessarily as a recommendation that we do it: Suppose we create a new SQL function that takes a structural artifact as inputs and writes out its content into a series of SQL tables in a given schema. Then you would be able to run SQL queries on the result, enabling queries that would be difficult to achieve, and which would likely require supplemental procedural code, if the same data was available only in JSON

For an example of what I have in mind, suppose you wanted to decode the forum post that this post is replying to. You could run:

ATTACH ':memory:' AS decode;
SELECT artifact_decode(content('fc4dd726ec'), 'decode');

This would load up the new in-memory "decode" schema with a series of tables (the exact structure of which is TBD) containing all of the content of that message. You could repeat the artifact_decode() call for other artifacts, which would store more rows in the same set of tables.

(12.1) By Stephan Beal (stephan) on 2025-03-24 17:15:06 edited from 12.0 in reply to 9 [link] [source]

Perhaps we can add a new SQL function that takes the raw artifact text as input and returns a JSON decoding of the text.

src:/timeline?r=artifact-to-json

$ ./fossil sql "select json_pretty(artifact_to_json('tip')->'F')" | head
'[
    {
        "name": ".editorconfig",
        "uuid": "132c5a213aa3ce13dcc9c19f8a7ea306e3640bec4ae693378116cee339c34a1a",
        "perm": null,
        "oldName": null
    },
    {
        "name": ".fossil-settings/binary-glob",
        "uuid": "679cf0b793f345a23e6dab55174511758d5fa28dd7724f61bab41d2b75c288c2",
...

$ ./fossil test-artifact-to-json 78f0813458dc -pretty 
{
    "rid": 63175,
    "uuid": "78f0813458dcbfeb873c3c4462b8bd0c878e759844f30e30fcd94545e2ccf50a",
    "type": "forumpost",
    "D": 2460742.331518,
    "G": "de425201e8abc831e452de33a9ee293ab21ebe1c34537e007dd90cf6995636b5",
    "H": "Forum Title > & ' \" < Test",
    "N": "text/x-markdown",
    "U": "drh",
    "W": "This forum post exists for the purpose of verifying that\nampersands in the title of a forum post are rendered correctly.\nThe title should read:\n\n> ~~~\nForum Title > & ' \" < Test\n~~~\n\nSee also:\n\n\n  *  [](/doc/trunk/www/title-test.md)\n  *  [](/doc/trunk/www/title-test.wiki)\n  *  [](/wiki?name=Test+Wiki+>+%26+%22+%27+%3c+Title&p)"
}

Sidebar to clarify why we have entries with null values, where we could ostensibly simply elide those: all fields of multi-token card types (like F-cards, seen above) are made explicit so that there's no risk of users of downstream higher-level languages inheriting like-name properties from their language's base-most "Object" class. If, for example, T-cards (tags) elided "value" if it were null (which most are), it would be possible to accidentally use an inherited property (if any) named "value" in some programming languages.

(16) By MG (mgr) on 2025-03-26 12:13:34 in reply to 12.1 [source]

Thank you Stephan

This is a very usefull tool you added fossils toolbox. I noticed the following things - are there specific reasons behind these?

D cards as well as the time part of E cards are represented as julian dates instead of the ISO 8601-ish representation in the manifest itself
on technotes, wikis and forum posts, the P card is missing in the json (see eg. 45c52e84 or ab6acbd9 on fossil proper or ea6017bcab67f8de on this forum). These would be quite to track the ancestors ...

(17) By Stephan Beal (stephan) on 2025-03-26 12:22:48 in reply to 16 [link] [source]

D cards as well as the time part of E cards are represented as julian dates instead of the ISO 8601-ish representation in the manifest itself

That was per Richard's request, his justification being that the JSON form should be as close to the canonical form as it can be without going overboard. An example of where we went the other direction: T-cards and J-cards are slightly reformatted so that clients don't have to parse their leading prefix char (which is optional on J-card).

That is:

T *foo bar baz

Becomes {"type":"*", "name":"foo", "uuid":"bar", "value":"baz"}. And...

J +foo bar

Becomes {"append": true, "name": "foo", "value": "bar"} (noting that the "+" prefix is actually very rare on J-cards)

on technotes, wikis and forum posts, the P card is missing in the json

That's a bug and will be fixed momentarily.

(18) By Stephan Beal (stephan) on 2025-03-26 12:34:52 in reply to 16 [link] [source]

see eg. 45c52e84 ...

Aha...

That one is actually supposed to be a technote but is victim to an historical bug:

P 1517330221f2859af4770ae42be6f7a502b4acdebf925284793b8ae1c9e39b32
N text/x-markdown

The P and N cards are out of order. There's a story behind that, told in and around src:/info/5e67a7f4041a36ad.

In any case, the missing P-cards and sometimes-missing W-card are now fixed in trunk. Thank you for the report!

(19) By MG (mgr) on 2025-03-26 12:55:45 in reply to 18 [link] [source]

Interesting. Nice catch, I did not even notice the missing W, was just focussed on the Ps.

Thank you

(4) By spindrift on 2025-03-23 07:26:39 in reply to 2 [link] [source]

If you have a local clone, the posts are backed up anyway, of course.

(6) By Jörgen Kosche (jkosche) on 2025-03-23 11:50:53 in reply to 2 [link] [source]

Would be nice if viewing forum posts was possible via CLI. That would make the web scraping obsolete. Still need a local clone.

(13) By brickviking on 2025-03-25 07:54:56 in reply to 6 [link] [source]

I came up with this quick snippet that lists the last hundred messages (tunable, of course), filters the card names and filters the Z card out. Hope it's useful as a model to see how to do it from bash on a Linux/WSL2/FreeBSD machine. You'll have to massage this, as it had massaging for presentation's sake:

for t in $(fossil timeline -F "%H"  -n 100 | head -n -1 | tac)  # This bit spits out forum artifacts
 do
  fossil artifact $t   # for each forum artifact \
 | egrep -v -- "^G |^N |^W |^Z "   # Remove these cards from the output \
 | sed -e 's/^H /Topic: /' -e 's/^D /Date: /' -e 's/^U /User: /' -e 's#\\s# #g'  # Rename card names with their equivalent words and swap \s for spaces
; echo -ne "==== done ====\n\n\n";   # Inter-message spacer (not mandatory)
done | less -j1 # Feed the output into less, or you'll have headaches

Cheers, brickviking (Post 64)

(14) By Jörgen Kosche (jkosche) on 2025-03-25 12:50:19 in reply to 13 [link] [source]

Oh wow, this is cool. A bit involved, but it shows a way to get the data without scraping the web-interface (which is always a bit icky).

(15.1) By Stephan Beal (stephan) on 2025-03-25 12:58:11 edited from 15.0 in reply to 13 [link] [source]

I came up with this quick snippet...

As of yesterday, you can use something like...

.mode box -noquote
WITH artifact(j) AS (
  SELECT artifact_to_json(objid) j FROM
    (SELECT objid FROM event WHERE type='f' ORDER BY mtime DESC limit 10)
)
SELECT
datetime(j->>'D'), j->>'H'
FROM artifact;

Then feed that to fossil with something like:

$ fossil sql ".read that-file.sql"

To get something like:

┌─────────────────────┬──────────────────────────────────────────────────────────────┐
│  datetime(j->>'D')  │                           j->>'H'                            │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-25 12:29:51 │ Missing help for settings "timeline-truncate-at-blank" and " │
│                     │ timeline-tslink-info"?                                       │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-25 07:54:56 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 23:34:33 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 23:34:14 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 23:16:53 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 19:41:30 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 19:26:33 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 19:21:46 │                                                              │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 19:09:29 │ Fossil-Pages                                                 │
├─────────────────────┼──────────────────────────────────────────────────────────────┤
│ 2025-03-24 19:03:45 │                                                              │
└─────────────────────┴──────────────────────────────────────────────────────────────┘

The blank subject lines are responses, not not-level posts, so they have no "H" card. That can certainly be solved via a Common Table Expression, an exercise left to the reader! (With this, we're probably just a single CTE away from having a usable, if basic, text-mode forum browser.)

(Edited to remove an extraneous CTE field.)