May 6, 2011

It has been more than three months since we launched Cascade, SingleHop’s IaaS cloud platform.
The reliability and durability of the platform has come a long way since then.
As a result, I’ve had more time recently to focus on optimizing the performance of the virtual machines (VMs) running in the cloud.

For most applications, performance depends on three variables:

-Raw CPU and memory

-Network throughput and latency

-Storage throughput and IOPS

In this first post of a three-part series, I’m going to cover the first of these — raw CPU and memory performance.
I’m going to discuss testing methodology and then show you some benchmarks.
In subsequent posts, I’m going to cover network and storage/disk performance.

How to measure CPU performance

CPU is probably the easiest system attribute to benchmark.
All you need is a simple, self-contained number-crunching task to keep the CPU busy.
For this test, I used the cpu benchmarking mode of SysBench.
This test computes prime numbers up to a specific size.
SysBench has the advantage of being in the Debian repositories, so you can easily reproduce it on your own Debian machine without downloading, compiling or installing any additional software.

The main metric here is the time it takes to complete the test.
The less time it takes to compute the prime numbers, the better the performance of the CPU.

Comparing Node Performance to VM performance

Our goal is to ascertain whether and how much overhead the hypervisor, in our case KVM, has on the CPU performance of VMs.
To do this, I compared the CPU performance results directly on the node with the results from a VM running on the same node.

Setup

All of the tests for this and subsequent posts are done on a fairly standard, mid-range X3360 server.

  • Intel(R) Xeon(R) CPU X3360 4 cores @ 2.83GHz each
  • Super Micro X7SBL-LN2 motherboard
  • 8192 MB DDR2 RAM
  • 64-bit Debian Linux running kernel 2.6.38

The VM used in this section has the following properties:

  • qemu-kvm-0.14.0
  • 8192 MB Assigned RAM, 4 vCPUs (-m 8192 -smp 4,sockets=1,cores=4,threads=1)
  • 64-bit Debian Linux running kernel 2.6.32-5

I ran all the tests using the following command line:

sysbench --test=cpu --cpu-max-prime=50000 --num-threads=N run

There are four cores on this CPU, which to linux appear to just be four separate CPUs.
In Cascade, we always create VMs with as many CPUs as the node appears to have, so the VMs will also appear to have four separate CPUs.
By varying N, the number of threads started by sysbench, we can test the performance of all the cores in parallel.
By starting more than four threads, we can see the effects of CPU contention.

Results

1 Thread 2 Threads 4 Threads 6 Threads
Setup Total Time (s) Percentile Total Time (s) Percentile Total Time (s) Percentile
Node, single-user mode 85.3706 1.00 42.6902 1.00 21.3610 1.00 21.3474 1.00
Node, normal bootup 85.3440 1.00 42.6977 0.999 21.4187 0.997 21.3956 0.997
VM, single-user mode 85.6905 .996 42.8638 .996 21.6127 .988 21.5839 .989
VM, normal bootup 85.7055 .996 42.8538 .996 21.6071 .987 21.5850 .989

As you can see, KVM imposes virtually no overhead on purely CPU-bound tasks.

Impact of CPU Priority Setting

We provide a simple parameter to let you control the CPU priority of VMs.
This parameter is called “CPU Priority”, and it is, roughly, a proportional allocation setting.
A VM with a priority value twice as high as another VM should recieve twice as much CPU time.
I wanted to verify that this setting works as expected.

Setup

For this test, I wanted to verify that the control works as expected.
I created two VMs — pr100.singlehop.net, with priority set to 100, and pr50.singlehop.net, with priority set to 50.
Each VM had 3 GB of RAM assigned.
On each vm, I created a file called ~/test.sh containing the following:

#!/bin/bash
sysbench --test=cpu --cpu-max-prime=50000 --num-threads=N run >> /root/tests/priority.results 2>&1

I then started the CPU test on the two VMs simultaneously.
I did this by making sure the clocks of the VMs were synced up to the same NTP server, and then using the at daemon like so:

at 18:00 -f ~/test.sh

I varied N, the number of threads, from run to run to get the results below.

Results

pr100.singlehop.net Time (s) pr50.singlehop.net Time (s) Ratio of pr100 to pr50
1 Thread 85.9227 85.7653 1.00
4 Thread 32.9604 43.5084 .756
6 Thread 32.3740 43.1772 .750

You can see, the priority setting works exactly as expected.
With a single thread in each VM, there is no CPU contention, and so both VMs finish at the same time.
With four threads in each VM, pr100 initially gets twice the CPU time of pr50, and so finishes first.
After pr100 is done, pr50 still has half of the test to go, but the second half goes twice as fast since there is no longer any CPU contention.
It thus takes pr50 25% more time to run the test.

With 6 threads, the situation is identical.
This further indicates that CPU contention inside the VM is handled efficiently by the hypervisor.

Memory Performance

The other core component of the system we might wish to test is the memory.
The question is, “Is there overhead or performance penality to using memory from inside a VM compared to using it directly on the system?”
We can answer this question using sysbench’s memory testing mode, which simply reads or writes the specified amount of data to RAM.

Here, too, we might want to try using multiple threads.
Sysbench splits the total amount of data to be read/written between the threads and attempts to execute it in parallel.

Setup

The setup for this test was the same as for the CPU test, with a single running VM with all the node’s memory assigned.
I ran the test using the following commands:

sysbench --test=memory --memory-total-size=150GB --memory-scope=local --num-threads=N --memory-oper=X run

The value for X is one of the two operation modes — READ or WRITE — and I tried each of them.
I varied the number of threads, N, to see what effect this has.

Results

READ WRITE
1 Thread 4 Threads 12 Threads 1 Thread 4 Threads 12 Threads
Setup Bandwidth (MB/sec) Time (s) Percentage Bandwidth (MB/sec) Time (s) Percentage Bandwidth (MB/sec) Time (s) Percentage Bandwidth (MB/sec) Time (s) Percentage Bandwidth (MB/sec) Time (s) Percentage Bandwidth (MB/sec) Time (s) Percentage
Node, single-user mode 3107.59 49.4274 1.00 1497.71 102.5563 1.00 1341.90 114.4644 1.00 2321.50 66.1642 1.00 1412.96 108.7083 1.00 1331.96 115.3192 1.00
Node, normal bootup 3080.58 49.8607 .991 1418.44 108.2880 .947 1334.18 115.1265 .994 2318.60 66.2469 .999 1406.69 109.1923 .996 1181.07 130.0510 .887
VM, normal bootup 1179.12 130.2671 .379 1258.69 122.0312 .840 1224.09 125.4806 .912 1045.70 146.8874 .450 1233.83 124.4908 .873 1261.16 121.7928 .947

Here, we see the first signs that running a VM might impose some performance overhead.
The memory bandwidth for a single-threaded VM task is about on-par with what a multi-threaded task gets on the node.
This result is difficult to explain without going into a lot of details about the Linux memory subsystem.

Performance looks much better when comparing multi-threaded performance between the node and the VM.
There is only about a 10% penalty with 4 threads, and performance on the VM looks better than performance from the node at 12 threads.
This is because the error in the measurements is beginning to overwhelm the resolution of this test.

There is almost no real-world application which has a single thread trying to monopolize memory access.
In a busy webserver or DB machine, hundreds of threads are competing for memory access at all times.
The performance overhead from KVM appears exactly like a few extra threads in the fray — significant if there is only one thread of real work, but irrelevant in real-world applications.

Conclusion

As the numbers above demonstrate, in the raw memory/CPU space KVM has almost no overhead.
Moreover, the system we’ve set up to throttle CPU between VMs functions effectively.

Things will become less clear-cut in the next two instalments of this series, on network and disk performance.
Until then, rest assured that if you want your cloud to do a lot of number-crunching, you can’t do much better than Cascade.

Leave a Comment