Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A robots.txt Problem (avodonosov.blogspot.com)
158 points by avodonosov on Aug 15, 2022 | hide | past | favorite | 50 comments


How can "Vary: Accept-Encoding" not be the default?

When would you ever want to serve gzipped content to a client who cannot accept it?

From what OP linked [0], it clearly is a problem Google is aware of and are trying to communicate a workaround for.

I suppose the answer is that it is hard to change a default behavior once it has been released, even if it is wrong.

[0] https://cloud.google.com/appengine/docs/legacy/standard/java...


In the past, I have encountered this with some websites.

It is relatively rare, for sure, but it is nothing new.

In these rare cases, it is impossible to disable compression from the client side. The Accept-Encoding header is ignored. I frequently use clients that are not popular browsers nor even popular programs such as curl. I suspect that people using popular browsers would never even notice that varying the Accept-Encoding header had no effect, and that is why the CDNs believe that this is OK.


> I suspect that people using popular browsers would never even notice that varying the Accept-Encoding header had no effect, and that is why the CDNs believe that this is OK.

I feel like this is a trend for pretty much everything on the internet these days, the 1% of anything doesn't matter to any service, not using google chrome? "please use a supported browser - aka google chrome", using a VPN? CDN level block or "here's 1000 captchas because Google hates you", You want en-gb? not from that IP enjoy your crash course in the native language, You aren't in the USA? no content for you.

Excluding the 1% might make business sense as an optimisation abstractly, but when that 1% could affect different people each day it can affect everyone eventually and you will actually piss off all users.


not only in tech, but also in business practices. I live in Germany and don't speak German. Spotify refuses to send me any communication in English. Amazon Prime refuses to serve me English-language content, and so on. When trying to get my language switched on both I got the stock response of basically "we know this is a problem, but we're not going to fix it because it affects too few people".


It's just a come back of "Best Viewed With Internet Explorer"


It's far worse, back then stuff just rendered a bit weird and looked crappy but for the most part would still actually work (outside of corporate intranets jscript and other M$ only tech was not as widespread on the open web ). The difference is in the message "best viewed" vs "not compatible" vs "infinite captcha loop with americanisms" vs "not available in your country".

Now it's really common for stuff just to straight up not work or you are actively blocked and considered collateral in protection against spam and botnets, the latter is an entirely new phenomenon.


It’s a bad design but historically I saw that to reduce cardinality - with different clients there were dozens of different header values around which would cause your hit rates to be dismal. Typically I avoided that by having Varnish squash to “gzip yes/no” before my CDN got smart enough to store compressed & encode however the client requested.


> Luckily, the Dotbot clearly identifies itself in the User-Agent header, and they have a working support email, so after a five month communication in a ticket I discovered the reason.

I laughed at the "five month", then realized it is actually impressive that OP got any response at all. What a time to be alive.

Also, why not just ban the offender?


If by offender you mean the bot, then I'm confused. The bot asked for robots.txt in a plaintext format. The bot was delivered some binary garbage that it couldn't parse. The bot continued to ask for an updated robots.txt, and continued to be told that the binary garbage was the intended content. What more was it supposed to do, exactly? The offending party here is the broken hosting platform.


I don't think banning the offender is in order to punish them for doing something wrong so much as to manually ensure it's not crawling your site


> Also, why not just ban the offender?

I was thinking about that, but didn't find an easy way ban a crawler. Google App Engine has a firewall, but it works based on IP addresses. Banning based on User-Agent would need to be done in the app code, and that essentially handling a request, even if in a cheaper way. I didn't want to touch the application at all, hoping to resolve this on the crawler side, whom I suspected being an unintentional "offender".

Speaking about the five months - that's fine. We were not communicating every day of course. And indeed impressive that I had my case handled at all.

I knew for years that unwanted crawling happens by various crawlers, and was reminded of that in metrics from time to time. One day I was in the mood to study deeper, found two crawlers in the access logs, studied their web sites and emailed them.

One didn't respond at all. The moz.com created a ticket, four days later a support engineer replied, a week later I replied. We had some back and forth. I supposed they don't recognize `User-agent: *` and need `User-agent: Dotbot`. David - the support engineer - expressed several other hypotheses. There was a period of silence, then I raised my issue again, David had it reviewed by some other people at moz.com and they pointed to the gzipped response.

BTW, what I learned, is that "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding." (https://www.rfc-editor.org/rfc/rfc9110.html#name-accept-enco...).

So if we make an HTTP request, unless we explicitly specify `Accept-Encoding: identity` we'd better be prepared to inspect the Content-Encoding in the response and decompress data if necessary.

But since Google App Engine returns gzipped content even for requests with `Accept-Encoding: identity`, I accepted that the the failure is on my side and went on with the config changes. Still, left a recommendation for moz.com to support gzip on their end.


> Also, why not just ban the offender?

Surely the offender here is Google AppEngine.


Because what if other people use the same software?


This reads as more of a Google App Engine problem than a robots.txt problem.


I'm surprised anyone actually gives a crap about robots.txt.

Don't put it online if you don't want it crawled. Welcome to the internet.


HN does. https://news.ycombinator.com/robots.txt

So does Wikipedia. https://en.wikipedia.org/robots.txt . That points out:

  # enwiki:
  # Folks get annoyed when VfD discussions end up the number 1 google hit for
  # their name. See T6776
  Disallow: /wiki/Wikipedia:Articles_for_deletion/
As does GitHub - https://github.com/robots.txt

Even the Internet Archive, which doesn't honor directives in the robots.txt files, has one - http://archive.org/robots.txt .

Welcome to the internet.


Having one doesn't imply you care about it.

Also, cargo culting has never been a good reason to do anything.


What makes you think Wikipedia doesn't care about robots.txt?

Also, sloths can hold their breath underwater for up to 40 minutes.


Nothing, I don't make assumptions. That's something you do.


You assumed people don't give a crap about robots.txt, otherwise you wouldn't have been surprised.


You don't need to be snarky.

Robots.txt isn't for hiding/suppressing information.

Often times you can have whole URL structures that are redundant with other ones, mainly database-generated pages with all sorts of possible query parameters often disguised as paths. Robots.txt is extremely useful in ensuring crawlers can make life easier for themselves by limiting to the "real" content, as opposed to the redundant stuff. Crawling the 5,000 real pages, not the 500,000 additional URL's that return the same content.

Also for ignoring "interactive" pages like login pages that make zero sense to be crawled.

People "give a crap" about robots.txt because it's useful for that.


In light of the recent surge in scraped content trumping original content on Google search, this is so true... people who scrape your site do not care about your preferences in a txt file.


Yeah, I feel like it only discourages legitimate bots, but for malicious ones it's a big red sign saying "SENSITIVE CONTENT HERE, SCRAPE IT"


While robots.txt is not a good security measure, it's a good tool for preventing pollution of search results with pages like wp_login.php. More to steer users to the subset of pages that are useful to them, away from the pages which are only an implementation detail.

Also, some poorly rate-limited crawlers actually abide by robots.txt, so it's useful to prevent unnecessary load.


Honestly, that's what you get for using AppEngine, one of the most insane platforms I've ever had the displeasure of using. Hopefully this has changed, but you couldn't even make a raw HTTP request from an AppEngine application. You had to tunnel your HTTP requests through a proprietary "fetch" service. Now that Kubernetes exists, Google should really deprecate AE.


While AppEngine isn't... ideal, it would make more sense to fix the errors and shortcomings of AppEngine. Comparing the general concepts of AppEngine vs. Kubernetes, Kubernetes is a major step backwards. The promise of AppEngine is that I can basically just deploy my code, and the system takes care of everything else. With Kubernetes I need to rebuild my containers constantly, to do security updates, I have to configure network, storage and load balancers.

Yeah, AppEngine isn't what it could be, but deprecating it would be a step backwards.


App Engine Flexible [0] (i.e. the new version) is a lot more permissive - it's basically a managed container service.

Also - Google is perfectly good enough at turning down services on their own, we don't need to give them any ideas!

[0] https://cloud.google.com/appengine/docs/flexible


Don't the Google Cloud docs (that you linked to in your post) explicitly say how to handle this issue? They say that you need to add a "Vary: Accept-Encoding" to your response headers if you don't want to cache the gzip content for clients that don't send an "Accept-Encoding: gzip" header. This is actually the exact example that was given in the docs. Maybe I'm misunderstanding something and you already tried this, but if you did it wasn't mentioned in your blog post.

It's been a while since I've done web development but IIRC some web servers (I seem to recall Apache doing this) do this implicitly, i.e. you don't need to add a Vary header for Accept-Encoding since the web server is smart enough to know that this is what you mean. And sending a "Vary: Accept-Encoding" response header likewise seems silly since it's extra data in every request that ought to be implied. Nevertheless I think that by a strict reading of RFC 2616 the behavior of Google here is allowed, and that to be pedantic you should send a Vary: Accept-Encoding header in this case.


> If the Accept-Encoding field-value is empty, then only the "identity" encoding is acceptable.

> If an Accept-Encoding field is present in a request, and if the server cannot send a response which is acceptable according to the Accept-Encoding header, then the server SHOULD send an error response with the 406 (Not Acceptable) status code.

I agree with your strict reading of the spec, but can’t relate to your attitude towards that reading, nor your attitude towards placing the burden on users to work around it. The spec’s allowance here is very probably intended to produce graceful successes in likely usage scenarios when the client might be able to handle a response it didn’t request and when the server logic is insufficiently robust.

It’s (my speculation here) a transfer of the robustness principle to the client on the basis that clients are fewer and better resourced to be robust than servers in this case. That reasoning applied to Google as compared to its own customers is untenable. There would very little burden placed on Google, or anyone else for that matter, by expecting them to honor the intent of the spec and the intent of requests. Even the most naive solution would somewhat less than double their cache size (which for smaller orgs would be a real burden, but for Google that’s laughable) and at worst would degrade to uncached performance for an initial cache miss. Deferring to Google’s documentation, however clear, relieves them of probably a single engineer’s sprint time, some budgeting consideration… and costs M/N engineering hours of frustration and contribution to burnout while people think they’re convenienced by offloading work to Google as a reliable vendor. Then after shaving so many yaks, if they have the temerity and energy to post about their frustrating experience, they’re criticized on HN for not reading the documentation which they referenced after struggling to find it.

Google is, from my understanding of the post, fully standards-compliant and your reading is correct. But Technically Correct isn’t the best kind of correct when a giant megacorporation gets megabucks to provide a service which doesn’t do what even the spec says is not ideal, and their “solution” is that every single one of their customers must discover this fact individually and cater to them.


It's hard to read that second quote as allowing the server to send a response encoded with anything other than what the client specifically accepts. I see the SHOULD as merely allowing the server to send an empty response or some other error (or even just ignoring it, I suppose), but no way can I read it as allowing the server to send a response in some other encoding.

To your point about savings, it seems like it would be most cost effective, over time, to simply send an empty 406 response, saving the machine power to read and return the cache and the network traffic of sending a file that can't be used anyway.


It was mentioned at the end of the blog-post as the solution the author used.

> Fixed the Gooble App Engine behaviour by adding an explicit configuration to the appengine-web.xml


That is rather a concerning caching behavior. Does anyone know if other popular cloud provider services have this issue too?


Some services go a step further and specifically ignore an 'accept-encoding: identity' header and return gzip instead, because they feel that asking for uncompressed content is broken.


That is what GCP is doing according to TFA.


As I'm reading it, GCP is ignoring accept-encoding as a caching variable. Certain services (stackexchange API calls for one, but I'm pretty sure it's come up elsewhere) just ignore it entirely from the application.


Not to be that guy, but there's a typo in one of the request headers: "Accept-Endiging: gzip"


Just add it to the standard next to Referer.


Thank you, fixed.


Typo here, too: "the applciation".

And "Gooble App Engine" near the end of the article.


"Gooble App Engine" made me laugh.

I'm imagining some kind of knock-off brand cloud provider.


In Guadalajara there was a car mechanic on my street with various car brands painted by hand on its facade which is common in Mexico.

There was a "Telsa" misprint that always gave me a chuckle. It made me imagine "Telsa, by Eron Muks".


If Wish did cloud services.


:) Thank you, fixed.


I thought "A robots.txt Problem" was going to showcase my hack-the-planet idea of destroying someone's online store by _only_ finding a way to edit their robots.txt file to say:

User-agent: * Disallow: /

And then no one can find them via search and they go broke.


No need to do that nowadays, the SEO manipulators will take care of it... you'll never find it on page 1 of google.


robots.txt prevents crawling, but not indexing and returning as search results. Previously discussed e.g. [0]. But of course shop updates won't be seen easily anymore.

[0] https://news.ycombinator.com/item?id=31892299


Google App Engine has so many problems and is also underinvested in at Google in my opinion.

We used it earlier on in a startup we were working on and had so many issues that I would never recommend it to anyone.

You'll be much better off using GKE or some other kubernetes variant


I think robots.txt needs to work the other way. You put in the robots.txt ONLY if you want the search engines to crawl your page.

The current way is akin to saying ... "You do not have a lock on your door. So we are welcome to come in ..."


The web is a public space. There are no doors.

Robots.txt is a way of saying "these are not the pages you are looking for".


The best part of this post is the solution at the end. I'd love to see more posts giving a quick how-to-fix.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: