In the past, I have encountered this with some websites.
It is relatively rare, for sure, but it is nothing new.
In these rare cases, it is impossible to disable compression from the client side. The Accept-Encoding header is ignored. I frequently use clients that are not popular browsers nor even popular programs such as curl. I suspect that people using popular browsers would never even notice that varying the Accept-Encoding header had no effect, and that is why the CDNs believe that this is OK.
> I suspect that people using popular browsers would never even notice that varying the Accept-Encoding header had no effect, and that is why the CDNs believe that this is OK.
I feel like this is a trend for pretty much everything on the internet these days, the 1% of anything doesn't matter to any service, not using google chrome? "please use a supported browser - aka google chrome", using a VPN? CDN level block or "here's 1000 captchas because Google hates you", You want en-gb? not from that IP enjoy your crash course in the native language, You aren't in the USA? no content for you.
Excluding the 1% might make business sense as an optimisation abstractly, but when that 1% could affect different people each day it can affect everyone eventually and you will actually piss off all users.
not only in tech, but also in business practices. I live in Germany and don't speak German. Spotify refuses to send me any communication in English. Amazon Prime refuses to serve me English-language content, and so on. When trying to get my language switched on both I got the stock response of basically "we know this is a problem, but we're not going to fix it because it affects too few people".
It's far worse, back then stuff just rendered a bit weird and looked crappy but for the most part would still actually work (outside of corporate intranets jscript and other M$ only tech was not as widespread on the open web ). The difference is in the message "best viewed" vs "not compatible" vs "infinite captcha loop with americanisms" vs "not available in your country".
Now it's really common for stuff just to straight up not work or you are actively blocked and considered collateral in protection against spam and botnets, the latter is an entirely new phenomenon.
It’s a bad design but historically I saw that to reduce cardinality - with different clients there were dozens of different header values around which would cause your hit rates to be dismal. Typically I avoided that by having Varnish squash to “gzip yes/no” before my CDN got smart enough to store compressed & encode however the client requested.
> Luckily, the Dotbot clearly identifies itself in the User-Agent header, and they have a working support email, so after a five month communication in a ticket I discovered the reason.
I laughed at the "five month", then realized it is actually impressive that OP got any response at all. What a time to be alive.
If by offender you mean the bot, then I'm confused. The bot asked for robots.txt in a plaintext format. The bot was delivered some binary garbage that it couldn't parse. The bot continued to ask for an updated robots.txt, and continued to be told that the binary garbage was the intended content. What more was it supposed to do, exactly? The offending party here is the broken hosting platform.
I was thinking about that, but didn't find an easy way ban a crawler. Google App Engine has a firewall, but it works based on IP addresses. Banning based on User-Agent would need to be done in the app code, and that essentially handling a request, even if in a cheaper way. I didn't want to touch the application at all, hoping to resolve this on the crawler side, whom I suspected being an unintentional "offender".
Speaking about the five months - that's fine. We were not communicating every day of course. And indeed impressive that I had my case handled at all.
I knew for years that unwanted crawling happens by various crawlers, and was reminded of that in metrics from time to time. One day I was in the mood to study deeper, found two crawlers in the access logs, studied their web sites and emailed them.
One didn't respond at all. The moz.com created a ticket, four days later a support engineer replied, a week later I replied. We had some back and forth. I supposed they don't recognize `User-agent: *` and need `User-agent: Dotbot`. David - the support engineer - expressed several other hypotheses. There was a period of silence, then I raised my issue again, David had it reviewed by some other people at moz.com and they pointed to the gzipped response.
So if we make an HTTP request, unless we explicitly specify `Accept-Encoding: identity` we'd better be prepared to inspect the Content-Encoding in the response and decompress data if necessary.
But since Google App Engine returns gzipped content even for requests with `Accept-Encoding: identity`, I accepted that the the failure is on my side and went on with the config changes. Still, left a recommendation for moz.com to support gzip on their end.
# enwiki:
# Folks get annoyed when VfD discussions end up the number 1 google hit for
# their name. See T6776
Disallow: /wiki/Wikipedia:Articles_for_deletion/
Robots.txt isn't for hiding/suppressing information.
Often times you can have whole URL structures that are redundant with other ones, mainly database-generated pages with all sorts of possible query parameters often disguised as paths. Robots.txt is extremely useful in ensuring crawlers can make life easier for themselves by limiting to the "real" content, as opposed to the redundant stuff. Crawling the 5,000 real pages, not the 500,000 additional URL's that return the same content.
Also for ignoring "interactive" pages like login pages that make zero sense to be crawled.
People "give a crap" about robots.txt because it's useful for that.
In light of the recent surge in scraped content trumping original content on Google search, this is so true... people who scrape your site do not care about your preferences in a txt file.
While robots.txt is not a good security measure, it's a good tool for preventing pollution of search results with pages like wp_login.php. More to steer users to the subset of pages that are useful to them, away from the pages which are only an implementation detail.
Also, some poorly rate-limited crawlers actually abide by robots.txt, so it's useful to prevent unnecessary load.
Honestly, that's what you get for using AppEngine, one of the most insane platforms I've ever had the displeasure of using. Hopefully this has changed, but you couldn't even make a raw HTTP request from an AppEngine application. You had to tunnel your HTTP requests through a proprietary "fetch" service. Now that Kubernetes exists, Google should really deprecate AE.
While AppEngine isn't... ideal, it would make more sense to fix the errors and shortcomings of AppEngine. Comparing the general concepts of AppEngine vs. Kubernetes, Kubernetes is a major step backwards. The promise of AppEngine is that I can basically just deploy my code, and the system takes care of everything else. With Kubernetes I need to rebuild my containers constantly, to do security updates, I have to configure network, storage and load balancers.
Yeah, AppEngine isn't what it could be, but deprecating it would be a step backwards.
Don't the Google Cloud docs (that you linked to in your post) explicitly say how to handle this issue? They say that you need to add a "Vary: Accept-Encoding" to your response headers if you don't want to cache the gzip content for clients that don't send an "Accept-Encoding: gzip" header. This is actually the exact example that was given in the docs. Maybe I'm misunderstanding something and you already tried this, but if you did it wasn't mentioned in your blog post.
It's been a while since I've done web development but IIRC some web servers (I seem to recall Apache doing this) do this implicitly, i.e. you don't need to add a Vary header for Accept-Encoding since the web server is smart enough to know that this is what you mean. And sending a "Vary: Accept-Encoding" response header likewise seems silly since it's extra data in every request that ought to be implied. Nevertheless I think that by a strict reading of RFC 2616 the behavior of Google here is allowed, and that to be pedantic you should send a Vary: Accept-Encoding header in this case.
> If the Accept-Encoding field-value is empty, then only the "identity" encoding is acceptable.
> If an Accept-Encoding field is present in a request, and if the server cannot send a response which is acceptable according to the Accept-Encoding header, then the server SHOULD send an error response with the 406 (Not Acceptable) status code.
I agree with your strict reading of the spec, but can’t relate to your attitude towards that reading, nor your attitude towards placing the burden on users to work around it. The spec’s allowance here is very probably intended to produce graceful successes in likely usage scenarios when the client might be able to handle a response it didn’t request and when the server logic is insufficiently robust.
It’s (my speculation here) a transfer of the robustness principle to the client on the basis that clients are fewer and better resourced to be robust than servers in this case. That reasoning applied to Google as compared to its own customers is untenable. There would very little burden placed on Google, or anyone else for that matter, by expecting them to honor the intent of the spec and the intent of requests. Even the most naive solution would somewhat less than double their cache size (which for smaller orgs would be a real burden, but for Google that’s laughable) and at worst would degrade to uncached performance for an initial cache miss. Deferring to Google’s documentation, however clear, relieves them of probably a single engineer’s sprint time, some budgeting consideration… and costs M/N engineering hours of frustration and contribution to burnout while people think they’re convenienced by offloading work to Google as a reliable vendor. Then after shaving so many yaks, if they have the temerity and energy to post about their frustrating experience, they’re criticized on HN for not reading the documentation which they referenced after struggling to find it.
Google is, from my understanding of the post, fully standards-compliant and your reading is correct. But Technically Correct isn’t the best kind of correct when a giant megacorporation gets megabucks to provide a service which doesn’t do what even the spec says is not ideal, and their “solution” is that every single one of their customers must discover this fact individually and cater to them.
It's hard to read that second quote as allowing the server to send a response encoded with anything other than what the client specifically accepts. I see the SHOULD as merely allowing the server to send an empty response or some other error (or even just ignoring it, I suppose), but no way can I read it as allowing the server to send a response in some other encoding.
To your point about savings, it seems like it would be most cost effective, over time, to simply send an empty 406 response, saving the machine power to read and return the cache and the network traffic of sending a file that can't be used anyway.
Some services go a step further and specifically ignore an 'accept-encoding: identity' header and return gzip instead, because they feel that asking for uncompressed content is broken.
As I'm reading it, GCP is ignoring accept-encoding as a caching variable. Certain services (stackexchange API calls for one, but I'm pretty sure it's come up elsewhere) just ignore it entirely from the application.
I thought "A robots.txt Problem" was going to showcase my hack-the-planet idea of destroying someone's online store by _only_ finding a way to edit their robots.txt file to say:
User-agent: *
Disallow: /
And then no one can find them via search and they go broke.
robots.txt prevents crawling, but not indexing and returning as search results. Previously discussed e.g. [0]. But of course shop updates won't be seen easily anymore.
When would you ever want to serve gzipped content to a client who cannot accept it?
From what OP linked [0], it clearly is a problem Google is aware of and are trying to communicate a workaround for.
I suppose the answer is that it is hard to change a default behavior once it has been released, even if it is wrong.
[0] https://cloud.google.com/appengine/docs/legacy/standard/java...