Storage Functional Testing

I’ve spent a fair amount of time in the last couple of months testing storage. At $job, we have some very interesting datasets and workflows. These result in large bursts of intense I/O. This can lead to I/O latency issues when multiple databases attempt to write to disk at once. In the physical server world, these workloads were isolated, not impacting each other. In the virtual server world, shared storage can lead to bottlenecks that impact many systems at once. For this reason, storage selection is critical in our environment.

We have a mix of storage arrays, primarily iSCSI with some fiber channel, and a small amount of NFS. At this point, my favorite storage arrays are the Nimble Storage arrays. We have 4 of the CS240G-X4 models, two each in two data centers. The performance is stellar, analytics are awesome, and the reliability on every front is unmatched in our environment. I’ve done firmware updates under load in our DR environment with absolutely zero downtime and no real noticeable impact in performance. They just work… <shameless plug over>

Now for the main topic… when testing storage, what should we do to validate a new array, particularly one from a new-to-us vendor, and one that isn’t as mature as EMC, NetApp or HP? I deal with a VMware-based environment, so that’s my focus, but most of this directly applies regardless of environment.

I’ll outline my steps…

  1. Manual Controller Failover Testing – both quiesced and under heavy load
  2. Controller Software Upgrades – this should be done under load
  3. Simulated Drive Failures – this applied to both spinning and flash
  4. Power Supply Removal – they say it is hot swap, so test it before you have to bet your data on it
  5. Forced Controller Failure – remove one and see what happens
  6. Network Connectivity – start pulling a cable or two while under load
  7. Multipath Testing – know how things work in the recommended configuration (MRU vs ALUA) and what happens when the opposite is used
  8. Performance – iometer is very useful, not only to generate load for the aforementioned testing, but also for knowing what the limits are for performance

Storage is physical and involves a lot of interconnects. There are many aspects where issues can appear. Test them all before you put production critical data on the array. Know what the impact is to performance when something fails.

When testing, I use the I/O Analyzer fling from VMware. Deploy, configure the IP, and start running tests. You can deploy a fairly large number of them on one host, or across many hosts. The options are practically endless.

I’ll go into more details on how I do performance testing at some point in the future. I know my process isn’t perfect, but I try to know what the limitation are and testing the relevant settings (NMP, queue depth, etc) to know the impact.

-David

Leave a Reply

Your email address will not be published. Required fields are marked *