Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This seems all very reasonable. Look forward to the post mortem.

Given their size on both the web in terms of employees this unusual for Wikimedia. They typically fly under the radar. How many times has Wikipedia ever been down?

I recall AWS, Google, Microsoft having more outages -- mind they probably are considerably bigger but still they're doing something right.



it's not hard to have high uptime for content that is largely static. Has less to do with size than static versus dynamic, and accurate (think bank transactions) vs fuzzy (search results)

edit: fixed a typo


I’ve not looked for a long time but the last time I edited it, it was close to realtime*

*in the web sense - don’t want to offend any of the real real-time people :-)


Sure, but if the dynamic stuff was down, you just wouldn't be able to edit, but the static part would keep humming right along and never get counted as an outage.


There is some non-edit dynamic stuff too; some templates are written in lua

These templates can pull information in from outside of the specific Wikipedia instance, like retrieving properties from Wikidata.

Edit: I guess I was wrong, seems like lua modules are only evaluated when there's a change to a page incorporating them:

>The programs are run only when the page is "parsed" (when it or a page it incorporates is changed or previewed), not every time you view the output.

https://en.wikipedia.org/wiki/Help:Lua_for_beginners#Input


Those lua templates seem to be constantly having problems.

https://en.wiktionary.org/wiki/a

>Lua error: not enough memory

appears on the page 250 times.


You were kind of right - page gets reparsed (and frontend cdn cache cleared, along with a backend cache cleared) anytime someone edits a wikidata entry the lua script uses.


> Sure, but if the dynamic stuff was down, you just wouldn't be able to edit, but the static part would keep humming right along and never get counted as an outage.

Frontend cache gets skipped if you have a session cookie (logged in or have been logged in recently or made an edit logged out). So if you edit something, subsequent views are not hitting the static site, so you would notice if it was down


Yeah, that's how reddit works too, but there are multiple layers of caching behind the CDN that would still hide certain types of outages.


Wikipedia has a second layer of cache after that of the article html without user interface which is stored in memcache & db (in mediawiki speak this is referred to as the "parser cache"). However typically if the site is down,that layer would go down too, so only the varnish servers really have the potential to hide outages.


You can also think reads vs writes. Like when there are many reads but few writes it makes sense to have replicas, but if there are many writes and little reads you are better off with "sharding". You can also think RAID, where you have mirrors vs stripes. Often though the two concepts are combined, like in RAID 0+1 or RAID 1+0. Scaling many reads are much more simple then scaling both read/writes though. The holy grail of computing/databases is to build a database that can scale both reads and writes while having decent performance and latency.


it isn't static - just wordy. You are confusing lots of text for staticness.


I explained what I meant by static here: https://news.ycombinator.com/item?id=24665764


SSR would be more accurate definition than static.


wikipedia pages are very easy to cache and caching them likely provides a massive benefit, so if we're talking about uptime static is probably a better description of what is happening than SSR.

You are right that technically it's SSR, but that's not what's relevant here.


I don't think they are mutually exclusive. I have no idea how wikipedia works, but I've run a lot of high volume relatively static sites. A simple thing that works very well is to do SSR, but serve it through a CDN like akamai, then configure akamai to serve a static/cached version if the backend is down. Assuming everything is working, you get a semi dynamic SSR model, but if something goes down, the site is still served and you have no customer facing downtime.


Wikipedia basically uses varnish as their own CDN (they having caching servers in SF, texas, Virginia, singapore and amsterdam. Backend servers are in Virginia with hot backup in texas)


Thanks for sharing. We did it that way for a site I worked on until a ddos brought varnish down. Then we put Akamai in front and never had a problem again. This was over a decade ago, it’s wasn’t as easy back then to auto scale a varnish layer in the cloud.


Wikipedia generally tries to take the approach of doing things they can themselves and using open source whereever possible. Most of the setup is documented at https://wikitech.wikimedia.org and there is a public puppet repo with all the server configs https://github.com/wikimedia/puppet

That said, i think they do now use cloudflate's bgp based magic transport ddos protection product to help against ddos


> it's not hard to have high uptime for content that is largely static

Largely static? There are edits happening all the time.


From [1]:

> Wikipedia develops at a rate of over 1.9 edits per second, performed by editors from all over the world. Currently, the English Wikipedia includes 6,167,378 articles and it averages 598 new articles per day.

Doesn't seem to be much, to be honest.

[1]: https://en.wikipedia.org/wiki/Wikipedia:Statistics


Note: edits sometimes affect multiple pages (in extreme cases, edits can affect millions of pages. The lua script (which is a wiki page editable like any other) Module:arguments is used on over 25 million pages).

There generally is a bit of a long tail effect. Popular pages get edited a lot, but they also get viewed a lot. It can be expensive when everyone is viewing and editing the same page (Micheal Jackson's death is a famous example that caused downtime, although changes were made to make things more robust so it wouldn't happen again)


The way they designed everything, this doesn't matter. its still static, in that the content is not generated at access time, at least for logged out users.

If the actual servers go down all that means is that wikipedia is read only and the caching reverse proxies that also receive a push update during modifications would just serve the last version of pages. (except anybody with login cookies, valid or not, would get 500 responses)


But everyone sees the same edited version when not logged in, which is the vast majority of users, so you can just throw a huge cache in front, which is what they do. And most edits only touch one page, so the churn is tiny relative to the cache size.

This is a much easier service to reliably engineer than something like Twitter. For SRE purposes, Wikipedia is mostly static.


Compared to something like Slack or Office 365 products it could as well be carved in stone. I'm guessing 99+ % of requests are non-authenticated, the data is easy to cache and freshness (on the timescale of minutes or hours) is almost worthless.

Even something as simple as HN probably have much much lower value of "usefulness if the service is served completely static from caches", due to upvotes and comments. If the front-page and comments stayed static both during breakfast and lunch, my WFH routine would sadly be impacted...


> freshness (on the timescale of minutes or hours) is almost worthless.

On the contrary, users get very angry if stuff isn't fresh.

Someone changes trump article to say he is a poopy head. If that gets fixed in 2 seconds, no big deal. If that gets cached, and the edit to fix it doesnt hit the caches for a couple hours, wikipedia is now the top story on CNN.

Generally wikipedia caches are expected to be updated within seconds or minutes at most.


Okay, that's fair, my view was too simplistic. But still, Wikipedia could probably get away with days of 1-5 minute cache refreshes if it was required? Especially if some banner informed users about it or something.

I think my larger points still stand. In comparison, almost all other services at the scale of wikipedia have critical almost-realtime components, and is almost useless without the possibility to authenticate users (which can't really be cached).

Not saying that the people who manage to keep Wikipedia so stable are doing an easy task, just that it's very different from almost all other things on the web.


I've sometimes heard wikipedia described as a "large scale static site plus a medium scale social network". The caching is a bit more complex than a naive static site due to churn rate and freshness requirements, but fundamentally you are right, without frontend varnish caching, wikipedia would be very different in terms of hosting requirements and scaling complexity.


I'm also wondering if the caching strategy they are using is a naive one (ie: cache is valid for a fix duration, like 5 minutes) or if it's a more active one (like stakeoverflow), with cache in validations each time a page is modified/commented on.


There is cache invalidation each time a page (or one of its dependencies. Pages depend on lots of other pages) is modified.

Assuming things havent changed, each varnish server listens for purges via multicast udp.


Purges have been migrated to kafka as a mean of transport, at long last. So now if a purging daemon crashes, purge requests are not lost.

You can see per-server stats on purges happening here:

https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1


Right, but if the backend goes down, and you serve a stale cached version of the page that is missing the latest edit, it's fine and you have no downtime. That's what I mean by static.

The opposite of static would be an e-commerce site where you can't take transactions if the backend is down and you really don't want to oversell your inventory, so you need the inventory management system to be up for the site to "work".

Also, the average wikipedia page probably isn't edited very often.


There are enough copies of Wikipedia that if it has ever been down you could just get the same content elsewhere, so people don't usually make the same fuss about it that they would about AWS/Google/Microsoft.

AWS/Google being down for even a minute or two is a big deal though.


How many people read Wikipedia other than from the main website? What's the source on that?


Prior to having a cellphone data plan (2013) I kept a copy of most Wikipedia articles on my laptop for use when traveling and being away from easy internet access.


Me? I've seen it down a couple times before, and when it was, I just Googled for another copy of the article instead of posting "Wikipedia is down! OMG the world is ending!" all over the internet as people do when AWS is down.


So that makes it anecdotal. I thought that's something others do as well. I personally don't even know where else to look for articles. I also don't trust other sources or possibly outdated mirrors.


Curious why do you trust (in cases where trust is even required) wikipedia ?

I am under the impression that anyone can build credit and write what they think is correct. Do you check the linked sources and verify as such ?


If it's a controversial topic (e.g. history of Tibet), yes.

If it's just an article about the history of pianos or CPUs or something, the probability of misinformation is much lower, the consequences of being misinformed are much lower, and I don't usually bother. Many times I just browse Wikipedia because I want to learn about weird animals or off-the-beaten-path places on Earth or culture or something like that.

(By the way, primary sources also sometimes have their drawbacks as well; they can often be politically motivated, biased and not tell you the full story, and Wikipedia is effectively peer-reviewed for a lot of articles.)


What do you mean by "Given their size on both the web in terms of employees this unusual for Wikimedia"?

(I knew this likely just a typo, but I genuinely didn't figure out what you meant.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: