"(Or if you're a crazed wunderkind like LiveJournal founder Brad Fitzpatrick, you invent a memory-based distributed hashtable as a cache to put in front of the database.)"
The cool thing about Brad is that he released that creation as open source -- we can all benefit from his genius, like Facebook already has. Memcached is an amazingly effective way of getting the benefits of SQL storage in a simple, scalable, and reliable way. It's impossible to over-hype how much it kicks ass.
If a class of files requests N>1 copies, at what point after the HTTP PUT can the application be happy that N copies exist? It seems fine to think I've three copies of that file but what if machine failure occurs before MogileFS has created the other two?
Also, it's intended to operate on whole files at a time, although HTTP GET might be usable to fetch a run of bytes. If two web servers both try and write the same filename, doesn't the latest one win?
I can see it's great for certain things, e.g. storing the user's images, but not for the stuff traditionally in the database. Or have I missed something?
SQL Databases are so astonishingly slow that I just switched my most intensive app to only use the database for backing up the data to disk and reloading from on startup.
This arrangement is so much faster and easy to understand/measure/optimize for me that I can't see myself going back.
Can you be more detailed about what you're doing and the application? Has your solution opened your app to corruption or data loss in the event of app or server crashes?
I have a separate thread that writes changes back to disk at its leisure. There is also a special shutdown/restart mode that I can trigger which stops accepting input from the user and flushes everything to the db.
This can be implemented very simply and has lots of advantages all around.
You're fortunate your data fits in RAM. And it seems you don't have to worry about machine failure causing unwritten data to be lost despite having accepted it from the user?
Oh, you said earlier "I just switched my most intensive app to only use the database for backing up the data to disk and reloading from on startup" so I thought you meant you only read data from the database on start-up, hence all data fitting in core.
If you're reading data from it whenever you find the data isn't in core then aren't you using the database?
As for the "thread that writes changes back to disk at its leisure", how do you guard against machine failure after accepting data for writing but before it's been written?
Twitter is using a database because the Twitter engineers chose not to prematurely optimize their system. They now fully understand their problem domain and thus now would be the appropriate time to make optimizations such as replacing the SQL back-end with faster, less flexible solutions.
The article does have a strong whiff of premature opt. However, using a SQL db in the first place is usually causes extra work to use something unhelpful. Usually it's way less code to just skip it.
A non-SQL solution surely can work well and be a fine solution to some problems. Especially problems that are very well understood (like problems at the optimization stage of the project).
But can a minimal non-SQL solution provide the basic features that people want and expect from a persistent storage layer: transactions, allowing multiple process concurrent access to the data, relatively foolproof failure recovery procedures, etc?
A decent RDBMS gives you those features out of the box in addition to allowing you great data manipulation flexibility. It is this sweet spot of data manipulation flexibility and fault tolerance that makes SQL/RDBMS such a ubiquitous tool.
I suspect code you're "skipping" is the code that would give you those extra features that help with reliability and fault tolerance.
While it is possible to implement those features without using the RDBMS-crutch it takes real code and real engineering effort to do it. If you are writing "way less code" you are likely not providing replacement features.
Anyone know of detailed write-ups by people that didn't use a database. I too dislike the overhead of SQL as an interface. How did they cope with the problems that DBs solve? Did they ever have to scale to more than one "DB server"?
Thanks, I've read that and will continue through the others.
One of the cases given could fit all the data in core, the other used BerkeleyDB for its smallish "database" data, cutting out SQL, and a GFS-like system for the large amount of BLOB archiving it had to do. It's the doesn't fit in core, and is changing data not archiving, case where it seems harder to use flat files since the DB server is a convenient place for concurrancy controls.
Look at Prevayler and HAppS, two systems that don't use a database at all. In-memory persistence with write-ahead logging, and they handle give-or-take 1000 hits/s on a stock Xeon server.
The cool thing about Brad is that he released that creation as open source -- we can all benefit from his genius, like Facebook already has. Memcached is an amazingly effective way of getting the benefits of SQL storage in a simple, scalable, and reliable way. It's impossible to over-hype how much it kicks ass.