Tuning a HP Smart Array P400 with Linux – why tuning really matters…
Today I had to do some tuning on a HP Smart Array P400 controller with 8x SAS 300GB 10K RPM HDD. It was already determined that this controller was *really* *really* bad at RAID5. This was a system that needed some decent performance so it was decided to use RAID10. We set the cache ratio to 50/50, and used 256k stripes. The controller already had write cache enabled with the battery.
This system was running RedHat Enterprise Linux 5.3 with additional support for the XFS file system using the « extras » repository from CentOS. The XFS file system was setup to match with the stripe of the RAID10, and then we mounted it with the following options:
Sometimes you have all the time in the world to test out every scenario, but this wasn’t one of those times and I had to turn this around fairly quickly. So, I decided just to run a select few test cases with the iozone benchmark (my preferred benchmark tool). These are the specific test cases I ran:
iozone -b results.xls -r 4m -s 8g -t 6 -i 0 -i 1 -i 2
For those not familiar with iozone, the « -i » indicates the test, with the numbers meaning:
0 : sequential write/re-write test
1 : sequential read/re-read test
2 : random read/write test
So, first order of business was to get a « baseline » run to see where we were at:
initial write: 299MB/s
random read: 108MB/s
random write: 306MB/s
What immediately concerned me here are the read speeds (both sequential and random); those are pretty bad numbers for 8x HDD in RAID10! In particular, this system needed both sequential and random reads to be fast.
Whenever working with Linux on RAID controllers, the first thing I like to try out is to change the I/O scheduler to ‘noop’. Basically, Linux has a modular I/O scheduler architecture and you can choose from 4 different options. The default in RHEL5 is ‘cfq’, which is actually pretty good in many cases. But, if you’re on a RAID controller, sometimes it’s better to let the hardware take care of the I/O intelligence, and that’s where ‘noop’ comes to play. You can change the I/O scheduler via the /sys filesystem:
echo « noop » > /sys/block/cciss!c0d1/queue/scheduler
Re-running the iozone tests above with ‘noop’ yielded some good improvements in both sequential and random read speeds. There was also a small gain in initial write speeds, but other write tests actually lost a little speed. Though, the lost in some write tests seemed like it might be worth the trade-off considering the gains:
initial write: 338MB/s
read: 201MB/s (from 123MB/s !!!)
re-read: 200MB/s (from 125MB/s !!!)
random read: 206MB/s (from 108MB/s !!!)
random write: 255MB/s
Looks like we’re headed in the right direction and gaining back some read speeds, even at the slight cost of some write speed tests.
To focus on tuning for read speeds, the next thing to tune is Linux’s read ahead cache. By default, this is set to 256k, but in my experience, this is really too little for every RAID controller I’ve worked with. Normally, you can incrementally increase this value and re-run the benchmark to see the gains (or loss). But I’ve been doing quite a bit of Linux tuning lately and my experience has told me that usually 8MB~32MB is where it’s worth playing. Also, I wasn’t given a lot of time to turn this around, so my first test was to increase the read ahead cache to 8MB:
/sbin/blockdev –setra 8192 /dev/cciss/c0d0
So, with ‘noop’ and 8MB of read ahead cache, here are the results:
initial write: 335MB/s
read: 324MB/s (from 123MB/s !!! +263%)
re-read: 325MB/s (from 125MB/s !!! +261%)
random read: 417MB/s (from 108MB/s !!! +386%)
random write: 256MB/s
Pretty impressive gains on the read speeds. I also tried using 16MB of read ahead cache, but it didn’t really yield significant gains, so I decided that 8MB was enough.
Just two simple tuning parameters that can gain quite a bit! Here’s a chart of the results:
This image has been resized. Click this bar to view the full image. The original image is sized 645×821.
Dual Quad-core E5420 2.5Ghz 12mb cache Motherboard
Intel 5000 chipset Memory
48GB FBDIMM DDR2 PC2-5300 667Mhz Graphics Card
8x500GB WD5002ABYS/RAID5 PERC6/I 256MB cache Power Supply