Hard Fault Life
HFL
Let's Review My Server Logs
µstack doesn’t have any analytics features built in. This is a design choice.
My goal is to write good blog posts, and µstack exists for the purpose of making it as easy as possible for me to write good blog posts. As far as I can tell, there’s no meaningful correlation between the quality of work published on a website, and the amount of engagement that website receives. I could track clicks or traffic or RSS feed queries, but that data would only tell me how many people are reading what I write - it wouldn’t tell me if my writing is any good.
Although I don’t collect analytics about this site, I do regularly check my server logs. Usually I’m just checking to make sure nobody’s up to any funny business, but today I decided to pull all my Nginx access logs to see what’s been going on over the last couple weeks. Here’s what I found:
Poorly-behaved RSS feed readers
Over the last 2 weeks, I received a total of 8190 requests to the /rss
endpoint. I really wish I had logs going back to the beginning of the year -
Rachel By The Bay has a significant effort going on to make popular feed
readers less stupid, and I’d be curious to see how much it has affected my site.
Regardless, here are some of the dumbest feed readers showing up in my logs:
Feedly made a whopping 2078 requests to my RSS feed in 2 weeks. Every single request was missing the appropriate If-Modified-Since header, so every response contained the entire feed. The user-agent strings suggest that several of you are using Feedly, and the requests came from a few different IP addresses. So, although Feedly made the largest number of wasteful requests, I’m not giving Feedly the grand prize for “dumbest feed reader.”
Instead, that prize goes to a single person using Tiny Tiny RSS. It made 1051 requests in 2 weeks, all from the same IP address, at predicable 20 minute intervals. Not only are they requesting the whole damn feed every 20 minutes, they also request my whole home page AND favicon! Not a single If-Modified-Since header to be found here. Every 20 minutes! WHY? There’s another IP address that’s also using Tiny Tiny RSS, but that IP is only pulling my feed once per day.
Tiny Tiny RSS bills itself as a “self-hosted” feed aggregator, which suggests that a single person is responsible for all these requests. I’ll address you directly: Because of the unconditional favicon requests, you used almost 3 times more data than all the Feedly users combined. Based on your user-agent string, you’re running a version of Tiny Tiny RSS that’s about 3 years old. There’s a good chance your feed reader will behave better if you update it. Also, thanks for reading my work! I appreciate you.
I have two honorable mentions:
- SpaceCowboys for Android. Lots of requests, all of them unconditional. Not as many as Feedly, though.
- News Explorer, whose behavior is totally erratic. Sometimes it doesn’t request anything for 6 hours straight, other times it makes 10+ unconditional requests in a single hour. Every once in a while, it makes 2-3 requests within a few seconds. When it does that, the first request is always unconditional, and then subsequent requests have a valid If-Modified-Since header. All of its requests come from the same IP address, which makes this behavior even more confusing.
Other RSS feed readers
Here are a few of the other RSS feed readers I saw:
Miniflux is very well-behaved. Despite making up only 0.2% of all RSS requests, Miniflux accounts for more than 6% of 304 “Not Modified” responses. Very efficient. Great job, Miniflux!
NextCloud, much like Tiny Tiny RSS, makes silly unconditional requests on a predictable time interval. Every once in a while it gets a 304 response back, but clearly something is wrong with its cache implementation. On the bright side, it isn’t polling my server nearly as quickly. It’s tolerable.
FreshRSS, Inoreader, and Akregator - same story as NextCloud. Inefficient since they don’t implement caching correctly, but at least they aren’t making requests every 20 minutes.
Newsboat appears to support partial feeds, based on the small response payloads the server is sending back to it. I didn’t notice another reader that supports partial feeds. Good job, Newsboat!
Bots, bots, bots
I found at least 50 different bots that identified themselves as such. There are definitely more. Robots.txt is supposed to make bots behave nicely, but if you’ve ever run a web server for more than a few hours, you’ll know that some bots just don’t behave.
I have three different approaches for dealing with bots:
- Allow the bot through, and let it do what it wants.
- Block the bot. Give it the silent treatment. Immediately close the network connection, returning no data whatsoever.
- Screw with the bot by returning bogus data.
As far as I can tell, nearly all bots fit into one of the following categories:
- Search engines.
- Security companies.
- SEO/marketing companies.
- AI training.
Each category gets slightly different treatment.
If a search engine appears to obey Robots.txt, then it’s free to index my site as it pleases (approach 1). If it doesn’t obey Robots.txt, then it gets blocked (approach 2).
All the security companies I noticed seem benign. I’m letting them through for now.
SEO/marketing companies get blocked unless they provide their data freely to the public. I didn’t find a single one that offers free access to their data lake, so all of them got blocked.
Bots used for AI training will get a mixed bag depending on my mood. Some AI training bots will get blocked. Others, however, get the “bogus data” treatment. The bogus data is a few AI-generated paragraphs of meaningless text that get returned to the bot when it tries to access any of my blog posts.
Bot User Agents
Many bots, when identifying themselves, also provide a link to a website where you can read about the bot’s purpose. Here’s what that looks like for Bingbot (you’ll have to scroll a bit):
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/100.0.4896.127 Safari/537.36
The majority of bots I saw in my logs provided a link to a web page like this. Some of them just provided a link to the company’s home page. A couple provided an email address instead of a website.
One bot, for some bone-headed reason, actually uses its user-agent string to carry multiple complete English sentences:
Expanse, a Palo Alto Networks company, searches across the global
IPv4 space multiple times per day to identify customers' presences
on the Internet. If you would like to be excluded from our scans,
please send IP addresses/domains to: scaninfo@paloaltonetworks.com
My favorite, however, was SeoCherryBot. Instead of an informational website, contact info, or literally anything useful at all, SeoCherryBot provides a link to the Wikipedia article about web crawlers:
Mozilla/5.0 (compatible; SeoCherryBot/1.0; +https://en.wikipedia.org/wiki/Web_crawler)
Interesting Bots
Not all bots are bad, and some of them don’t fit into the categories above. My favorite one I found is TenMillionDomainsBot, which was created by Tony Wang as a way to see how many supposedly-active domains on the web actually lead to dead sites.
Another crawler I saw in my logs is from Fedistats, which appears to be somebody’s fediverse trend-finding thingy.
Hackers
About 8% of all requests made to my server include .php
somewhere in the URL.
Another 7% of requests were looking for a file named .env
, 1.5% of requests
were looking for a .aspx
file, and another 0.6% were looking for a path ending
in /.git/config
. Fail2ban is constantly banning IP addresses that blast my
server with hundreds of requests within a few seconds.
The web is working as usual.
Obviously, all of these requests are trying to break into my server. I don’t use PHP, ASP.NET, or .env files. I do use git, but it isn’t accessible through µstack’s HTTP server, and µstack also has safeguards in place to prevent it from returning sensitive data. These are all pretty lazy attacks - shotgun attempts to hack unmaintained legacy systems. Most of these attackers don’t even bother masking their user agent. It’s always “curl” or “python-requests” or something. A surprising number of them have the user agent “Hello world.”
Here’s a weird one:
GET /?%3Cplay%3Ewithme%3C/%3E
If we decode that URL, it’s actually /?<play>withme</>
. I’m not sure what the
goal is. Maybe it’s trying to crash a server by jamming up its URL parser? Hmm.
Here’s one of my favorites. This seems to show up every few hours:
GET /index.php?lang=../../../../../../../../usr/local/lib/php/pearcmd&+config-create+/&/<?echo(md5(\x22hi\x22));?>+/tmp/index1.php
GET /index.php?lang=../../../../../../../../tmp/index1
I think it’s trying to take advantage of this CVE. It’s trying to write a
magic number into a file at /tmp/index1
, then read it back. I’d assume that,
if it read that magic number back successfully, it’d proceed with pwning you. It
makes me wonder what would happen if I intentionally set up that second URL to
return the payload they’re expecting to see.
Real, Actual Human Beings
If all of this seems profoundly boring, well, I agree. To me, the most surprising aspect of my server logs is that there wasn’t much to be surprised about. In fact, the only thing I didn’t expect was the volume of traffic coming from what appears to be real human beings. Maybe they’re just very convincing bots. I didn’t do anything to try to figure out how many people actually read this blog (again, it’s not in the spirit of what I do here), but the answer clearly ranges between “a few people” and “several people.” That’s pretty neat!