Riak is an open source, highly scalable, fault-tolerant distributed database.
Armon Dadgar and Mitchell Hashimoto discuss how and why they are using Riak in production at Kiip, the problems they we facing (increasing demands - 25M+ ops/day) and their solution: migrating from MongoDB to Riak. The talk was recorded at the May San Francisco Riak Meetup, in May, 2012.
You can read the outline of the talk here.
Kiip is a mobile reward (mobile ad) network: it connects brands and companies (advertisers) with consumers through virtual achievements in their apps.
Kiip started their infrastructure with MySQL, but switched to MongoDB before they had any real traffic. As they were using it, and as their traffic and data grew, they experienced the following problems:
A global MongoDB write lock limited their incoming write operations. Their solution was to cache and aggregate the writes over 10 seconds, and with batched updates, they hit the lock less frequently.
They were still limited to 1000x read/sec, probably because some IO problem, but heavy caching solved that.
Slow analytics queries touch a lot of data, and they were IO-bound operation. Using Bloom-filters solved this issue for a while.
As they hit the global write lock again, they separated two clusters for different datasets. Analytics with heavy writes in the first one, everything else in the other cluster.
As the data on a given machine grew, the index size hit the memory limits. They have introduced an ETL (Extract / Transform / Load) data-processing workflow, to separate older data from the database, archiving older ones to S3.
ETL lagged behind more than its daily schedule, they needed further DB separation.
Eventually, after they have found that all their API response times were related to MongoDB, they started to research other DBs. Their finding and opinions were the following:
Fast growing data was the natural candidate to move first to Riak: they have started with session and device data, as they grew at exponential rate.
Sessions have the advantage to be key-value by nature, a good fit for Riak. They updated their data access layer, besides that, not application-level change was required. The python client had some problems in the protobuf implementation, and some errors that are already fixed now (e.g. keepalive header).
The switch was simple and without downtime:
The devices data was a bit different, eventually they have settled with a key-use that was generated from a hash-function. Their other attempt (e.g. using secondary indexes, map/reduce or id indirection) were too slow with unusable latency (0.2-2 seconds).
Kiip team found Riak extremely solid, with only a few pain points, and they have provided advices and tips that might help others:
Scale early. Once your system is at the saturation point, it is really hard to get done anything with the load. Latency explodes at heavy IO, and the new node will just increase that in the short term (while it gets the data re-balanced). Monitor latency.
Don’t use secondary index (2i) in real-time queries, only in background (5ms get, vs 2000 ms 2i). LevelDB adds a new disk seek IO at each new level, adds more time to the latency.
Don’t restart nodes in rapid succession. Bad automation might result in strange conditions, ruins performance and sometimes causes inconsistencies.
But nothing is a magic bullet, Riak is no exception. They still use MongoDB for geospatial data, however, they will migrate those to PostGIS. Kiip uses PostgreSQL for non-key-value, not fast-growing data.
Scaling is hard, but for the horizontally scalable K/V data, Riak seems to be a right choice.