Putting Big Data to Work

If you've read my post on the value of good statistics-gathering, you may have guessed that SingleHop is gathering a lot of statistical data for our clients. In fact, we have very extensive performance data on VMs and their nodes -- so extensive that it was overwhelming our ability to deal with the flood of gigabytes. As a result, I've been spending the last few weeks trying to figure out how to fix the problem.

Singlehop uses MySQL as the main persistent store for keeping track of our datacenters. This means information about dozens of different objects -- datacenters, cabs, switches, ips, servers, VMs, clouds, inventory, vlans, and more. We wanted to keep this DB small and limited for a number of reasons: faster backup and recovery, faster and more predictable performance, immunity from statistical DOS attacks. We decided to store our statistical data -- stuff like access logs, bandwidth stats, and of course virtualization stats -- in Mongo because we anticipated eventually having lots and lots of data, and Mongo looked like it would be easy to scale.

Of course, we made a number of rookie large-dataset mistakes. One of the biggest was that we were taken in by NoSQL's promise of schema-less designs and so we didn't really plan our document schema. We ended up with several different formats of data stored in the same collection, and having to cover our for past mistakes with logic in the code, creating code complexity. Standardizing the schema is extremely expensive, as it requires re-writing all of the old stats to the new schema.

Recently, however, we were able to rectify this mistake, because we were forced to aggregate our data. By standardizing on a common schema and using it throughout the code, we were able to create automated tools which go back into the data and reduce granularity; we now store raw stats in approximately 5-minute increments for only the first week, and then reduce the granularity to every hour and eventually every day. As a result, we were able to improve statics performance retrieval times as much as 90%. These are real improvements you can see in LEAP if you look at your historic data for your objects!

I'm probably bringing up an issue familiar to many people in today's software industry. We're all bombarded by massive amounts of data. The data is full of valuable knowledge, but the sheer volume of it makes learning anything elusive. Because everyone encounters these problems, though, there's lot of help on the internet with how to solve them. I particularly appreciated a pair of posts from music service SoundCloud: this one, on their troubles with Mongodb (which we also use) and this one, about how they fulfilled their requirements using MySQL.

The moral of both our stories is clear: think about your data, understand your performance bottlenecks, and use the right tool for the job.