Why Is Twitter Using A Database In The First Place?

staunch · on April 19, 2007

"(Or if you're a crazed wunderkind like LiveJournal founder Brad Fitzpatrick, you invent a memory-based distributed hashtable as a cache to put in front of the database.)"

The cool thing about Brad is that he released that creation as open source -- we can all benefit from his genius, like Facebook already has. Memcached is an amazingly effective way of getting the benefits of SQL storage in a simple, scalable, and reliable way. It's impossible to over-hype how much it kicks ass.

danw · on April 20, 2007

Using MogileFS (also by Brad) might be a solution.

ralph · on April 20, 2007

MogileFS looks interesting.

If a class of files requests N>1 copies, at what point after the HTTP PUT can the application be happy that N copies exist? It seems fine to think I've three copies of that file but what if machine failure occurs before MogileFS has created the other two?

Also, it's intended to operate on whole files at a time, although HTTP GET might be usable to fetch a run of bytes. If two web servers both try and write the same filename, doesn't the latest one win?

I can see it's great for certain things, e.g. storing the user's images, but not for the stuff traditionally in the database. Or have I missed something?

Cheers, Ralph.

timg · on April 19, 2007

SQL Databases are so astonishingly slow that I just switched my most intensive app to only use the database for backing up the data to disk and reloading from on startup.

This arrangement is so much faster and easy to understand/measure/optimize for me that I can't see myself going back.

staunch · on April 19, 2007

Can you be more detailed about what you're doing and the application? Has your solution opened your app to corruption or data loss in the event of app or server crashes?

timg · on April 19, 2007

I have a separate thread that writes changes back to disk at its leisure. There is also a special shutdown/restart mode that I can trigger which stops accepting input from the user and flushes everything to the db.

This can be implemented very simply and has lots of advantages all around.

ralph · on April 20, 2007

You're fortunate your data fits in RAM. And it seems you don't have to worry about machine failure causing unwritten data to be lost despite having accepted it from the user?

Cheers, Ralph.

timg · on April 20, 2007

"your data fits in RAM"

Not so. It's not too hard to check if you have some data and then retrieve it from the db as necessary.

ralph · on April 21, 2007

Oh, you said earlier "I just switched my most intensive app to only use the database for backing up the data to disk and reloading from on startup" so I thought you meant you only read data from the database on start-up, hence all data fitting in core.

If you're reading data from it whenever you find the data isn't in core then aren't you using the database?

As for the "thread that writes changes back to disk at its leisure", how do you guard against machine failure after accepting data for writing but before it's been written?

Cheers, Ralph.

mattculbreth · on April 19, 2007

Remember Paul Buchheit's advice at Startup School. "Maybe consider not using a database", or some similar statement.

ntoshev · on April 19, 2007

"use in-memory hashmap for small data, Amazon S3 or filesystem for large data. treat disk as sequental device."

mattculbreth · on April 19, 2007

Yeah there you go.

mdakin · on April 19, 2007

Twitter is using a database because the Twitter engineers chose not to prematurely optimize their system. They now fully understand their problem domain and thus now would be the appropriate time to make optimizations such as replacing the SQL back-end with faster, less flexible solutions.

herdrick · on April 19, 2007

The article does have a strong whiff of premature opt. However, using a SQL db in the first place is usually causes extra work to use something unhelpful. Usually it's way less code to just skip it.

mdakin · on April 20, 2007

A non-SQL solution surely can work well and be a fine solution to some problems. Especially problems that are very well understood (like problems at the optimization stage of the project).

But can a minimal non-SQL solution provide the basic features that people want and expect from a persistent storage layer: transactions, allowing multiple process concurrent access to the data, relatively foolproof failure recovery procedures, etc?

A decent RDBMS gives you those features out of the box in addition to allowing you great data manipulation flexibility. It is this sweet spot of data manipulation flexibility and fault tolerance that makes SQL/RDBMS such a ubiquitous tool.

I suspect code you're "skipping" is the code that would give you those extra features that help with reliability and fault tolerance.

While it is possible to implement those features without using the RDBMS-crutch it takes real code and real engineering effort to do it. If you are writing "way less code" you are likely not providing replacement features.

ralph · on April 19, 2007

Anyone know of detailed write-ups by people that didn't use a database. I too dislike the overhead of SQL as an interface. How did they cope with the problems that DBs solve? Did they ever have to scale to more than one "DB server"?

Cheers, Ralph.

dhyasama · on April 20, 2007

http://radar.oreilly.com/archives/2006/04/database_war_stories_2_bloglin.html

Check out the whole series.

ralph · on April 20, 2007

Thanks, I've read that and will continue through the others.

One of the cases given could fit all the data in core, the other used BerkeleyDB for its smallish "database" data, cutting out SQL, and a GFS-like system for the large amount of BLOB archiving it had to do. It's the doesn't fit in core, and is changing data not archiving, case where it seems harder to use flat files since the DB server is a convenient place for concurrancy controls.

Cheers, Ralph.

bootload · on April 19, 2007

'... up to 11,000 requests per second'. Jesus Christ, thats a lot. Where does that come from? ...'

One of the bottle necks is the continuous polling on the public timeline and its RSS file. It makes me wonder why they don't charge for the privilege.

'... polling for updates every 15 minutes, 24 hours a day, thats still only about 1,000 hits/sec ...'

try every 0-10 seconds per client per person using such clients. [0] If this was happening in other sites it would be throttled.

Reference

[0] google search, 'twitter updates timeline every seconds'

http://www.google.com/search?q=twitter+updates+timeline+every+seconds

jey · on April 19, 2007

Why not simply static render these things when something actually do change... huhu.

jaggederest · on April 19, 2007

As I posted over there:

Look at Prevayler and HAppS, two systems that don't use a database at all. In-memory persistence with write-ahead logging, and they handle give-or-take 1000 hits/s on a stock Xeon server.

ntoshev · on April 19, 2007

Having looked into Pevayler, I think it buys you more problems than it solves. Thanks for the happs reference.

Terracotta may be a good solution for Java.

jaggederest · on April 19, 2007

Yeah, prevayler is handicapped by the fact that it's java. Happs is really cool, haven't worked with prevayler, but it's cited as an influence.