samcolvin's comments

samcolvin · on Oct 28, 2020

Victor Stinner (Python core developer focused on performance) has a great speech about this question:

There are a lot of reasons actually

- Performance limited by old CPython design. If you fork it you have to deal with all the legacy code.

- CPython is limited to 1 thread because of the GIL.

- Specific memory allocators, C structures, reference counting, specific garbage collector etc.

You can find that video in here: https://youtu.be/TXRPCZ7Nmh4

samcolvin · on Oct 16, 2020

I can confirm that, scraping Google instantly needs huge effort and money. In our best we can scrape 2500 SERP per IP.

But i must say using proxy services and other things did not helped us much. Because most of them were banned before we use.

core-questions · on Oct 16, 2020

Yeah, and then you recycle the IP back into the pool for the next guy to work with. An operation I know of was getting 6+ million SERPs a day, budget for proxies was hundreds of thousands a year.

Fnoord · on Oct 16, 2020

How does it wotk out for IPv6?

core-questions · on Oct 16, 2020

IPv6 is just not widely used, so when you do use it, you stick out like a sore thumb. Think like a bayesian: for Google, it's easy to just block whole /32s of IPv6 space.

p1mrx · on Oct 16, 2020

At what point would you consider IPv6 "widely used"? It's currently 30% of traffic:

https://www.google.com/intl/en/ipv6/statistics.html

core-questions · on Oct 16, 2020

I bet you half of that is scrapers!

FreshFries · on Oct 16, 2020

Complete /64's get blocked.

samcolvin · on Sept 24, 2020

Is it compatible with clickhouse?