I was thinking about that, but didn't find an easy way ban a crawler. Google App Engine has a firewall, but it works based on IP addresses. Banning based on User-Agent would need to be done in the app code, and that essentially handling a request, even if in a cheaper way. I didn't want to touch the application at all, hoping to resolve this on the crawler side, whom I suspected being an unintentional "offender".
Speaking about the five months - that's fine. We were not communicating every day of course. And indeed impressive that I had my case handled at all.
I knew for years that unwanted crawling happens by various crawlers, and was reminded of that in metrics from time to time. One day I was in the mood to study deeper, found two crawlers in the access logs, studied their web sites and emailed them.
One didn't respond at all. The moz.com created a ticket, four days later a support engineer replied, a week later I replied. We had some back and forth. I supposed they don't recognize `User-agent: *` and need `User-agent: Dotbot`. David - the support engineer - expressed several other hypotheses. There was a period of silence, then I raised my issue again, David had it reviewed by some other people at moz.com and they pointed to the gzipped response.
So if we make an HTTP request, unless we explicitly specify `Accept-Encoding: identity` we'd better be prepared to inspect the Content-Encoding in the response and decompress data if necessary.
But since Google App Engine returns gzipped content even for requests with `Accept-Encoding: identity`, I accepted that the the failure is on my side and went on with the config changes. Still, left a recommendation for moz.com to support gzip on their end.
I was thinking about that, but didn't find an easy way ban a crawler. Google App Engine has a firewall, but it works based on IP addresses. Banning based on User-Agent would need to be done in the app code, and that essentially handling a request, even if in a cheaper way. I didn't want to touch the application at all, hoping to resolve this on the crawler side, whom I suspected being an unintentional "offender".
Speaking about the five months - that's fine. We were not communicating every day of course. And indeed impressive that I had my case handled at all.
I knew for years that unwanted crawling happens by various crawlers, and was reminded of that in metrics from time to time. One day I was in the mood to study deeper, found two crawlers in the access logs, studied their web sites and emailed them.
One didn't respond at all. The moz.com created a ticket, four days later a support engineer replied, a week later I replied. We had some back and forth. I supposed they don't recognize `User-agent: *` and need `User-agent: Dotbot`. David - the support engineer - expressed several other hypotheses. There was a period of silence, then I raised my issue again, David had it reviewed by some other people at moz.com and they pointed to the gzipped response.
BTW, what I learned, is that "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding." (https://www.rfc-editor.org/rfc/rfc9110.html#name-accept-enco...).
So if we make an HTTP request, unless we explicitly specify `Accept-Encoding: identity` we'd better be prepared to inspect the Content-Encoding in the response and decompress data if necessary.
But since Google App Engine returns gzipped content even for requests with `Accept-Encoding: identity`, I accepted that the the failure is on my side and went on with the config changes. Still, left a recommendation for moz.com to support gzip on their end.