RAID Arrays and Fault Tolerance Demystified

RAID arrays are an incredible piece of technology capable of a variety of things. However, they are often misused. This misuse stems from a lack of knowledge in exactly what RAIDs are used for. Let's first examine some common scenarios.

Home Server Scenario

Imagine you were  convinced by a box store retailer that RAID1 was a backup solution. You purchase a PC with RAID1 on a couple of matching drives.You do not update the RAID firmware - ever. After a year of regular use the RAID array reports as Degraded. You do nothing, confident that degraded performance happens to any piece of hardware that has been used for any length of time Lightning strikes the house and the PC is not on a surge protector. A TV in the same room now has a permanent watermark of a sailboat. But, the PC boots so you do not buy a surge protector because, what are the odds right? The RAID error prompt is getting annoying every time you reboot, so you decide to swap a drive. Assume all is well. Another year passes. Suddenly, the array is no longer detected. You are met with a lonely blinking cursor on boot.

Business Server Scenario

Your website is down, your customers are confused. You contact your hosting company. They report your server crashed and your trusty RAID array is no longer detected. How could this have happened? Isn’t RAID supposed to guarantee uptime? You realize your last backup was taken months ago. What do you tell your customers?

Unfortunately, this is common because there is confusion regarding what RAID is and how to maintain an array.

In the beginning, there was tape.

In ’87, some students at Berkley realized that CPUs were going to keep getting faster and saw an I/O issue on the horizon. Large tapes used for data storage were expensive so they came up with a way to increase performance and power efficiency using smaller, cheaper ones. RAID took off during a redundancy demonstration. They yanked a drive from a live system and the thing kept running.

Redundant Array of Inexpensive Disks

Hardware RAID will avoid a performance hit on the CPU and is easier to configure and maintain. SingleHop's new standard RAID card is LSI 6Gb/s 9260-4i/8i. If you have enough drives to flood the card, you can reach ridiculous speeds up to 2510MB/sec Read and 3005 MB/s Write (source). Here are some of the features built into this card:

Adaptive Read: This feature should be enabled regardless of your RAID type. This setting uses an algorithm that will decide if it’s faster to read ahead to the next sector to increase performance.

Write Policy: When an OS sends a write command to adisk it will wait for a controller to confirm saying, “yes, I guarantee this has been written to the disk and I’m ready for new instructions”. This is called Write-Through. You can also take advantage of Write-Cache. With Write –Caching enabled, the RAID controller will tell the OS, “I guarantee I will take care of this and I’m ready for new instructions”. The controller will then perform the slow operation of writing to disk while accepting new instructions to the cache. This can be risky if you lose power and don’t have a battery backup unit (BBU). With a BBU, the RAID controller will retain the information to be written to disk until power is restored.

Write Caching can be even further exploited by using a Solid State Drive (SSD). The RAID controller can dedicate the SSD to store the most used data as memory instead of writing it to disk. This is especially useful with larger arrays to maximize performance and extend the life of the main storage disks. This hybrid use of SSDs, called CacheCade, yields the best performance to cost ratio.

Direct I/O Read Cache Policy: Direct I/O will avoid data corruption by reading from the disk every time. If you use Cache I/O instead, the controller will read from the disk the first time, but any subsequent requests for the same data will come from the cache if it’s still there. We would rather have accurate data, than the wrong data delivered slightly faster.

Software RAID emulates a dedicated RAID card controller. The controller instructions are performed by the CPU. High performing CPUs have a few benefits over some lower end RAID cards (threaded rebuilds). However, if you can afford a RAID card, your CPU will thank you.

VMware is virtualization software that runs on top of a hypervisor controlling multiple servers or hosts. VMware has a feature called Fault Tolerance (FT). This protects your virtual machine from hardware failure by mirroring everything that happens to a VM on a separate physical server. Using VMs does make it easier to take and restore backups. But remember that just like many other RAID arrays, if you accidentally delete files, or if your application crashes, it will propagate to the secondary VM.

If you would like to learn more about how RAIDs can work for you, then contact a sales rep about it today.

I have assembled some facts as well as some solutions to common problems.

RAID Facts

RAID is not a backup solution. Yes, it may be possible to salvage data from a mirrored array, but it’s a time consuming process and data integrity is never guaranteed.

RAID will not reduce file system checks, OS related issues or protect you from file corruption, accidental deletion, or malicious software.

Each drive holds information about what type of RAID it belongs to on a special partition created during initialization.

If you use a BBU (battery backup unit) you can enable write cache which will improve disk performance.

The RAID management software can send you an email if there’s a problem detected.

Real world speeds vary depending on the transaction behavior of the data involved, the quality of the hardware, and even the physical topology of the data on the disk (performance is best when handling sequential data located in the outer rim of the disks).

You will see a performance reduction during a RAID rebuild. The controller is handling normal instructions as well as the rebuild instructions. If possible, conduct the rebuild as during off-peak hours, but as soon as possible.

Updating your RAID firmware can increase performance, reliability and rebuild rates.

There are very specific pros and cons for each RAID type. Use the one that suits your needs.

Some common RAID problems and solutions

  1. Drive failure – Some drives fail immediately due to manufacturing flaws, others years down the road. Automate monitoring. Most RAID cards and software RAIDs support email notification when a problem is detected. Detecting the problem early allows you to swap a drive during off peak hours (this will relieve performance headaches during the rebuild). Monitor the rebuild to completion.
  2. Raid fails to rebuild – Check the health of the drives. The source or target may be running into too many bad sectors. If the source is bad you can attempt to clone the drive and initiate another rebuild.  Cloning is a longshot. To increase your chances of success use a bit-for-bit (1 to 1) cloning method, the same brand, exact size drive. Even then, the clone may just copy the same corrupted image to the healthy disk. In this case you are left with creating a new RAID and restoring from backups.
  3. Device error – Update the RAID firmware and/or swap the drive.
  4. RAID array not detected, or missing raid partition - Update the firmware. If the issue persists on amirrored array, boot a single drive sans RAID and copy the data to a new array. Otherwise restore from backups.
  5. FSCK it – File system checks are common regarding rebuilds, clones and other disk maintenance. Let file system checks or CHKDSK processes finish completely. Schedule quarterly FSCKs to prevent downtime during peak hours.
  6. BBU failure – Set your BBU policy for protection duration that makes sense. How long will this server be without power? You can extend the life of your battery (by years) by changing the interval from 72 hours to 24 or 12. Take advantage of data centers like SingleHop that have backup generators to limit BBU replacements to 5 or 6 years (versus annually).
  7. Unable to update RAID firmware – Update the RAID management software and then reattempt the RAID firmware flash. I’m not kidding, this is a thing.

RAID and Fault Tolerance are not backup solutions. Use cloud storage, R1soft, or a separate local disk for backups. Use VMware Fault Tolerance if your VM needs to say online in the event of a host failure. Schedule backups at regular intervals and check your storage health quarterly. RAID can increase performance and keep your system online in the event of a drive failure. Know the facts about RAID, automate your system, and be prepared to take action when the time comes.