The Value of Virtualization

 

You’re probably familiar with many arguments for virtualizing your systems. Virtualization can make your systems more secure, by reducing the number of applications and users on a single machine. It can make it easier to scale, utilizes your resources more efficiently, reduces costs, is faster to set up, and shields you from hardware failures, providing you with better uptime. VMs have another advantages over conventional servers, though, which is less commonly listed but still pretty important: they’re automatically instrumented.

Let me explain what I mean. Lets say you’re having some problems running your website on a traditional server. Your traffic has gone up, and now you are having outages during peak times. You speak to your tech support staff, and the admins agree that there’s a problem -- but they’re not exactly sure what the cause is.

Usually at this point, the admins will start ‘keeping an eye’ on the system in question. This often means being logged in and running top or vmstats. If the problem recurs, hopefully the admins will catch it, and the output from the monitoring software will give them hints as to what went wrong. If the admin is not around when the problem happens, though, they might not get the data they need, and then the process will have to start all over again.

Another solution is to start monitoring the server using a monitoring program like Cacti or Ganglia. This is a little more reliable than manual monitoring because the software won’t get bored or distracted and miss the fault event. But monitoring software has its own problems. It is often a hassle to setup. It requires punching holes in your firewall, making your server less secure. It takes up resources on an already precarious machine, possibly making downtime more likely. And if the problem affects the network, the remote monitoring machine might not be able to communicate with the trouble server to get any useful data at the exact time when the data is needed.

This is where virtualization comes to the rescue. The hypervisor — the software which makes virtualization possible — already has a lot of statistics about the virtual machine. Our Cascade cloud platform automatically gathers such statistics for every VM, storing the data in approximately five-minute increments in our own internal logging database. The data gathering happens in the context of the node, not the VM, meaning that the VM will not see a performance impact from the monitoring. Also, the fact that every VM is already monitored means that if a fault occurs, you won’t have to wait for a second fault to figure out what went wrong. The data to analyze the original fault might already be there.

Let me give you a concrete example. Just today, we had a problem with a customer’s VM; his PHP site went offline and the VM required a reboot to bring the site back online. The admins had a hunch that the VM was overloaded and couldn’t handle the traffic, but they didn’t know what resource was running low.

Here are some graphs generated by our internal system, Manage, which allowed our admins to get to the bottom of the problem. First, lets start with a graph of network bandwidth for the VM:

 

This graph illustrates the problem precisely. Around 8:50 PM, the VM stopped serving requests, or the amount of data served dropped precipitously. When admins logged in, they saw this in the kernel logs:

Oct 31 20:58:03 vm1 kernel: INFO: task php:41632 blocked for more than 120 seconds.

But why did this happen? Maybe the CPU usage for the VM was too high? Well, we can answer this question using our CPU graphs, which display CPU usage data for both the VM as a whole on its node and for each virtual CPU inside the VM:

Sure enough, the VM is pretty busy. It is making good use of all of its virtual CPUs, and its overall load on its node is often over 100%. However, the VM has 4 VCPUs, which means that if CPU were the limiting resource, the load would be as high as 400%. It looks like each VCPU is only being about 25% utilized. Also, the fault occurred at 8:50 PM, and we don’t see a CPU spike around that time. In fact, CPU usage for some of the virtual CPUs appears to drop around 8:50 -- VCPU 0, at least, had nothing much to do during the outage.

So what could be the problem? For an answer, lets turn to yet another batch of data we are able to get from the hypervisor: disk statistics

 

The VM is not a particularly major user of disk IO -- mostly steady writes consistent with saving log activity, with a few read spikes which might indicate someone searching through the file system or perhaps a scheduled backup. But here’s something interesting: right around the time the VM experienced its failure, swap file usage skyrocketed. Now we know exactly why the VM failed: it ran out of memory, and swap was too slow to fulfill the heavy traffic requests the VM demanded.

Getting data like this on a regular, traditional server would have required complex monitoring software, a steady stream of network traffic, a whole another monitoring server and skilled labor to set the whole thing up. On a Cascade VM, you get this kind of data for free, automatically. You won’t see these exact same graphs in LEAP, our customer portal, as these are generated for our internal interfaces only. But the data behind these graphs is also available to LEAP, which will generate much prettier, more usable visualizations, which permit you to easily drill down and explore what’s happening with your virtual machine.

The conclusion to this story is that we increased the amount of memory available to the VM from 4 GB to 8 GB. This required just a quick reboot of the VM, with none of the downtime or stress required for pulling a physical server out of the racks and opening it up. This solved the customer’s problems, with better performance and no outages. So here is yet another way virtualization with Cascade and LEAP makes your life easier.