Helping friendly robots

(1) By Slatian (slatian) on 2024-08-12 11:30:19 [source]

Hello,

I'm here because I'm trying to build a web search index and found fossil to have … some problems that make well intended bots behave in a not so great way (and trip the honeypot).

The relevant fossil instance is sqlite.org (I'm using sqlite so I though why not test it by letting it crawl its databases documentation).

I've read some of the threads in here that less well intended bots are a problem of relatively strong consideration and probably for a good reason. Well behaved crawlsers seem to be welcomed though, even if just for cache warming 😄.

If my crawler isn't welcome please don't tell me here, it listens to robots.txt and being opinionated software there is not override flag.

Enough talking the problems my (well intended crawler ran into):

My crawler tripped the honeypot … I like the idea of the honeypot, but I'd also put it in the robots.txt file so well intended crawlers don't have to disable it by local configuration.
depsite best efforts to prevent this, my crawler found links like https://sqlite.org/src/timeline?from=version-3.45.2&to=version-3.45.3&to2=branch-3.45 or highlighted lines https://sqlite.org/src/info/30475c820dc5ab8a8?ln=999,1026 , these look kind of expensiveish on the backend and on the other hand aren't useful as they are either subsets or duplicates of information found elsewhere or information not useful in an index.
Related: It indexed links like the /src/annotate, /src/blame and /src/hexdump endpoints, which arguments.
I manually blocked the /forum/forum path as it would have indexed each thread n+x times where n is the number of replies.

My recommendations as the developer of a friendly crawler:

Return a 401 or 403 code for captcha pages, so a well behaved crawler knows that it has gone where it isn't supposed to be.
Put pages not intended for crawling in /robots.txt (especially the honeypot) (Yes, I've read the part about bad bots abusing robots.txt for finding entry points, take with a grain of salt) (sqlite.org already does that, but only for one section).
Make use of the canonical link tag on pages with highlighted lines and pages that really are other pages that are different views of some other data (i.e. individual messages in a forum thread), a good crawler can use this to find the actual, useful page and will not the non-canonical page isn't interesting -> not crawling it in the future.
Make use of the robots meta tag with the noindex and nofollow options in addition or as an alternative to robots.txt, this won't prevent a page from being crawled, but a well behaved crawler will not follow any of the links on the page and will know for future crawls that that page shouldn't be crawled.
Set and evaluate the ETag-header, this not only helps browser caches, but also save your backend from generating a response if a crawler comes back to check if content has chaged (it'll see a 304 Not Mofified and be done).
Set the crawl-delay in robots.txt, #When unset my crawler uses a round-robin queue to maximize delay between requests to any origin while still going as fast as possible without any concurrency, but if that queue becomes very short it might get fast enough to cause unwelcome load on a fossil instance with its default delay.

Now that I've written it, most of it reads like generic advice that may or may not have already been considered, if any of this has been actively decoded against I'd be interested in the reasoning.

I'll update this post as I find more, if you are interested in the crawl database, I'd be happy to provide it in full or in part.

(2) By Stephan Beal (stephan) on 2024-08-12 11:50:44 in reply to 1 [link] [source]

The relevant fossil instance is sqlite.org (I'm using sqlite so I though why not test it by letting it crawl its databases documentation).

Crawling the static sqlite.org site should pose no particular issues for you vis-a-vis fossil, but its web server (althttpd) does have some small amount of anti-spider protection. Its own robots.txt is very permissive. Crawling the /src or /docsrc sites, though, both of which are fossil instances, will prove difficult for any bot, well-behaved or not.

Note that fossil does not currently support a robots.txt. It relies on the underlying web server, if any, to do host that.

depsite best efforts to prevent this, my crawler found links like ...

Unfortunately, the documentation and forum posts have double-digit-years' worth of links like that. The new redirect-to-login behavior, however, is only a few weeks old. Many such links cannot be reformulated in ways which will not trigger the login-redirect and no effort is currently being made (or planned) to weed out and rewrite those links which could be made bot-/guest-friendly.

My recommendations as the developer of a friendly crawler: ...

By and large, we don't want bots crawling fossil repositories.

To help put the "why?" of that into perspective: the canonical sqlite documentation repository is now roughly 46MB, but contains 616MB of individual artifacts, meaning that any robot which starts to crawl it will eventually suck down a bare minimum of 616MB of bandwidth, not counting links which generate dynamic content (namely diffs). For the corresponding source repo, those values are approximately 490MB and 11GB, respectively. For fossil's own repository those values are approximately 147MB and 7.1GB, respectively.

Given the number of bots scouring the network, that's an untenable load for the hoster.

A counter-recommendation: the anti-robot settings in fossil are opt-in at the repository level, meaning that you, the bot hoster, can clone the repository and then crawl that to your heart's content, using up your own bandwidth and CPU time instead of the upstream server's.

There are countless bots on the net now (IIRC, Richard recently calculated that some 80%(?) of the sqlite.org traffic is bots), and it's the many ill-behaved ones which drive the need to limit access. (Trivia: that's also the reason we use a custom-written forum software, rather than a mailing list - bot/spam traffic became uncontrollable on the mailing list.)

(3) By Slatian (slatian) on 2024-08-12 16:53:28 in reply to 2 [link] [source]

Crawling the static sqlite.org site should pose no particular issues for you vis-a-vis fossil.

I just need some way to tell the two apart …

Note that fossil does not currently support a robots.txt. It relies on the underlying web server, if any, to do host that.

Which makes sense if it's intended to only be part of a website.

Unfortunately, the documentation and forum posts have double-digit-years' worth of links like that.

That was exactly my point, besides that those links could have been on an external site like a blog explaining some code. (Apologies if I didn't manage to illustrate which problem lead to which recommendation).

By and large, we don't want bots crawling fossil repositories.

A counter-recommendation: the anti-robot settings in fossil are opt-in at the repository level, …

That's fine with me, if something doesn't want to be crawled I most probably don't want it in my index anyway. (I have limited server resources too)

But I'd prefer if my crawler was automatically stopped by a robots meta-tag before it runs into any spam and robot protection and an appropriate http status after. The bot not realizing that it ran into such mechanisms can cause quite a bit of traffic too and the operator realizing takes some time (in this case I didn't notice at all until testing the search index).

Another thing that would help is some documentation on how responsibly crawl a fossil instance (I didn't find one, but maybe I looked in the wrong place) even if it is as simple as a "don't crawl anything with a query parameter, and avoid path prefixes x,y and z".

Including how to automatically and reliably recognize one. (preferably from metadata, as that is easier to evaluate at the crawling state)

That would help me and others as a crawler developers and operator. Especially when running into smaller fossil instances, of which there should be quite a few out there.

To help put the "why?" of that into perspective:

That makes sense, thank you!

There are countless bots on the net now …

Wow, that's a lot, not exactly the fine way of saying »Thank you for making an awesome piece of software«.

(4) By brickviking on 2024-08-15 23:11:34 in reply to 3 [link] [source]

For a completely different take on things, why not create your own copy of the sqlite docs from the sqlite docsrc repository? I've had passable results for that, stuffing the rendered results into a fossil of their own that I can then self-host without any attempt to reach out to the canonical site except for docsrc updates. The instructions for rendering the documents can be followed certainly on a low-end Linux machine with the requirements installed. I simply render to a directory under fossil management, then update the local fossil to suit.

There is one disadvantage to this approach, it works for one specific site and one specific set of circumstances. For crawling, say, the TCL site, my suggestion may not work at all.

I hope this provides a slightly different perspective.

Regards, brickviking
(Post 33)

(5) By Slatian (slatian) on 2024-08-16 11:49:54 in reply to 4 [link] [source]

There is one disadvantage to this approach, it works for one specific site and one specific set of circumstances …

And solves the problem for me only.

If I would be happy with the "me only" workaround solution I wouldn't have started this thread.

Also having looked at the source code: No I'm not even going to try and implement that myself, there too too many mechanisms going on and the fossil code (and quality) is no match for my limited C-skills.

(6) By Stephan Beal (stephan) on 2024-08-16 13:56:04 in reply to 5 [link] [source]

Also having looked at the source code: No I'm not even going to try and implement that myself

No need: Richard added /robots.txt capability. Its utility is, however, necessarily limited to fossil instances which run as standalone servers (which is not a common configuration, and is definitely not the case for any of the sqlite.org-related projects). See the notes in that checkin for more details.

(8) By Slatian (slatian) on 2024-08-16 14:43:03 in reply to 6 [link] [source]

That's useful information, Thank you!

I'll probably combine that with a "if is has a query parameter, don't crawl it" policy for fossil and see how it turns out.

@Daniel: 🙃

(7) By Daniel Dumitriu (danield) on 2024-08-16 14:18:14 in reply to 5 [link] [source]

fossil code (and quality) is no match for my limited C-skills.

I thought no Rust programmer fears C... 😁