Not sure why everyone is going on about certificate transparency logs when the answer is right there in the user agent. The company is scanning the ipv4 space and came upon your IP and port.
Finding IP does not mean finding the domain. When doing HTTP request to IP you specify the domain you want to connect to. For example you can configure your /etc/hosts to have xxxnakedhamsters.google.com pointing to 8.8.8.8 and make the http request, which will cause Google getting the domain request (i.e. header Host: xxxnakedhamsters.google.com) and it will refuse it or try to redirect to http. Of course it's only related to HTTP because HTTPS will require certificate. That's why they're speaking about certificates.
But there's no evidence in the OP's post that they have, in fact, discovered the domain. The only thing posted is that there is a GET request to a listening web server.
The OP and all the people talking about certificates are making the same assumption. Namely that the scanning company discovered the DNS name for the server and tried to connect. When, if fact, they simply iterate through IP address blocks and make get requests to any listening web servers they find.
No they didn't. They said "How did the internet find my subdomain?" They're assuming the internet found their subdomain. They don't provide any evidence that happened, just that they found their IP address.
Depending on the web server's configuration, you very much _can_ find the domain which is configured on an IP address, by attempting to connect to that IP address via HTTPS and seeing what certificate gets served. Here's an example:
> Web sites prove their identity via certificates. Firefox does not trust this site because it uses a certificate that is not valid for 138.68.161.203. The certificate is only valid for the following names: exhaust.lewiscollard.com, www.exhaust.lewiscollard.com
That doesn't really matter, though. While OP is using Cloudflare, the actual server behind it is still a publicly-accessible IP address that an IPv4 space scanner can easily stumble upon.
I misunderstood, I thought the subdomain was an R2 bucket. If it's just normal Cloudflare proxying to some backend this is probably the most likely answer.
That said, while I think it's not the case here, using Cloudflare doesn't mean the underlying host is accessible, as even on the free tier you can use Cloudflare Tunnels, which I often do.
Also a valid point. I guess without more details all we can really do is speculate about the exact setup. That said, I do now agree that the most likely answer is "the underlying host was accessible and caught by an IPv4 scanner" since well, that's pretty much what it says anyway.
> When doing HTTP request to IP you specify the domain you want to connect to
No, you make HTTP requests to an IP, not a domain. You convert the domain name to an IP in an earlier step (via a DNS query). You can connect to servers using their raw IPs and open ports all day if you like, which is what's happening here. Yes servers will (likely) reject the requests by looking at the host header, but they will still receive the request.
It's rather hilarious that nobody mentioned this in 7 hours. What am I missing?
~5 billion scans in a few hours is nothing for a company with decent resources. OP: in case you didn't follow, they're literally trying every possible IPv4 address and seeing if something exists on standard ports at that address.
I believe it would be harder to find out your domain that way if you were using SNI and only forwarded/served requests that used the correct host. But if you aren't using SNI, your server is probably just responding to any TLS connect request with your subdomain's cert, which will reveal your hostname.
I was referring more to the fact that the user agent explicitly contained the answer, rather than suggestions that it was IP scanning. But you're right I do see one comment that mentions that. And many more likely assumed the OP already figured that part out.
The user agent contains a partial answer. IP scanning doesn't give you the actual subdomain, so the question is slightly wrong or there are missing pieces.
Judging by the logs (user agents really) right now in the submission, it's hard to tell if the requests were actually for the domain (since the request headers aren't included) or just for the IP.
It's very common for people to read only up to the point they feel they can comment, then skip immediately to the comment. So, basically, noone read it.
Just the default hostname. It won't reveal all of them or any of the IP addresses of that box. secret-freedom-fighter.ice-cream-shop.example.com could have the same IP as example.com and you'd only know example.com
If you've got one cert with a subject alt name for each host, they'd see them all. If you use SNI and they have different certificates, the domains might still be in Certificate Transparency logs. If a wildcard cert is used, that could help to conceal the exact subdomain.
There are a couple easy possibilities depending on server config.
1. Not using SNI, and all https requests just respond with the same cert. (Example, go to https://209.216.230.207/ and you'll get a certificate error. Go to the cert details and you'll see the common name is news.ycombinator.com).
Could be a number of ways for example a default TLS cert, or a default vhost redirect.
I actually had a job once a few years ago where I was asked to hide a web service from crawlers and so I did some of these things to ensure no info leaked about the real vhost.
That perfectly fits midwit meme. Lots of people are smart enough to know transparency logs - but not smart enough to read OP post and understand the details.