Spread it and forget

By | December 7, 2006

Im really quite keen to find out what some of your guys out there are doing, seeing and even recommending.  As for me, Im starting to see more and more storage setups where the design, if you can call it that, is just to spread your data and workload over as many array resources as possible and hope for the best.

First let me explain what I mean – lets assume you have a new storage array that you are going to run some Oracle databases on.  Instead of planning in some workload isolation, where some of your array resources – such as host ports, raid controllers and even disks – are isolated for different workload types or performance requirements, Im seeing more and more people opting to simply spread all of their data over as many array resources as possible and assume they will never have to think about it again. 

Of course the principle of load balancing across as many array resources as possible is a solid foundation for good performance, however, its not the be-all and end-all of a good storage design.  It certainly doesn’t mean that you can remove your brain when managing your array, and that’s what I’m seeing a lot of people doing.

Let me give some examples of what I mean –

Probably one of the most common examples of storage related load balancing these days is using a host based volume manager to stripe a logical volume over multiple physical volumes, usually the more the merrier!  So we might stripe our database files over 4 disk groups, which in turn sit behind 4 RAID controllers – good design so far!  However, our storage array only has 4 RAID controllers and we also stripe our log files over the same 4 disk groups behind the same 4 RAID controllers and then later on we might throw a few filesystems on for good measure – still a good design???

We know that disks are not at their happiest when you mix in random workloads with sequential workloads as this has a tendency to cancel out the disks natural ability to efficiently service nice long sequential I/O and use its buffers optimally.  And although tagged command queueing allows a disk to somewhat manage its work for optimal head movement, it still introduces situations where the disk is not operating under ideal circumstances.

Most RAID controllers will also optimise resource usage depending on the type of workload, such as preloading cache with read-ahead data if they are working on sequential I/O.  So constantly asking them switch between sequential and random workloads doesn’t get the best out of them either.  Once you mix in enough random with your sequential it all starts to look random – and disks don’t like random.

Similar things could be said about load balancing I/O across multiple HBA’s in a server.  It might seem like a no-brainer to implement round robin load balancing, or some of the more flashy variations of round robin.  But many modern arrays run effective learning algorithms that can detect different workload types, but only if these I/O patterns aren’t cannibalised before arriving at the array.

I also see setups where people have been told by their vendor/supplier that their performance will be fine as long as they don’t attach more than 10 hosts per array port.  Then when I get brought in and look at performance stats I see that some ports are sitting doing almost nothing while others are relatively flat out.

One last one – there are the natural side effects of zoned data recording techniques used on fixed block architecture disks – where there are often up to twice as many sectors on the outer tracks of a disk than on the inner tracks.  Obviously data placed on the outer tracks is accessible quicker, especially in nice sequential accesses.  I don’t see many places where people even consider this ancient but excellent optimization trick.

Good storage design appears to be a dying art!

I’m not slamming this “spread it as wide as possible” approach, it’s very popular these days, and certainly has its merits.  I for one am quite a big fan of having my LUNs touch as many disks as possible.  I also see that this approach makes it much simpler to manage storage especially in dynamic environments.  But how can you predict performance in this type of environment?  Surely many of the factors mentioned above must also be considered – I would never just take an array with, for arguments sake, 100 disks installed and spread all of my LUNs across all 100 disks and assume because I’m spreading over all disks that I’m getting the best out of my array and don’t have to think about anything else.

I must admit that in many of these environments I see, the storage staff are often Unix guys who know a bit about the arrays tools and how to present a LUN but lack the important knowledge about the way their arrays work under the hood.

If you've made it this far, thanks and well done.  As always thoughts are welcome – even ones slating my opinions.

Nigel

PS.  I know that truly sequential workloads are the stuff of fairy tales for most people.  And heck why should I care anyway – I, Im sure "we", get paid to go in and fix bad designs Wink

5 thoughts on “Spread it and forget

  1. JM

    I agree that ideally there’s a balance to be struck between spreading and isolating workload. However, I would argue that at some point, when you end up managing enough storage it becomes impractical (and expensive!) to play the isolation game. Everyone thinks their application is more critical than the next and should be isolated, DBAs want separate spindles for everything, etc. If you’re dealing with a single application or just a handful of applications and one or two storage arrays, fine tuning is probably worth your time, but when you move into larger environments with hundreds of applications the “remove your brain” method of managing storage becomes a lot more attractive.

    Array vendors also seem to be heading this direction. Look at 3Par who spreads LUNs of differing RAID protection across the same spindles. HP EVA spreads all LUNs of the same RAID level across as many disks as you ask it to. Pillar will not only spread your IO across as many disks as it can, but it’ll try to optimize your IO by placing high QoS LUNs on outer tracks of disks. They’re all trying to do this work for us, and I for one appreciate it. The geek in me wants to tune every app by hand, but it can be a monumental task. There will always be exceptions for highly critical or IO intensive apps, but in general I’m a proponent of spreading IO as far and wide as possible and letting your arrays do the best job they can.

  2. Liho

    It’s possible to create disk groups in EVAs to spread data over all disks in dedicated disk group. Minimal number of disks in such group is 8. HP engineer told me that customers are using one or two groups only. Thus it’s possible to isolate workloads but nobody does it.

    I also like this idea to spread similar workload over common disks, but here is one more issue. Chunk size is 2MB in case of EVAs. Thus such striping is good for Oracle and similar databases only. Remember Oracle’s SAME strategy (Stripe And Mirror Everything), where recommended chuck size was 1MB… I’ve heard that other applications became slower after migration to EVA.

    EVAs as another modular storages don’t have a lot of opportunities to isolate workloads: use different controllers and different disks groups only.

  3. Nigel Poulton

    Thanks for the great comments guys.

    JM I agree that the isolation game is a non starter in large environments. Many of the environments Ive worked in recently are very large and one of the commonalities in these environments is that nobody seems to have a clue what storage requirements are around the corner – you can turn up at work one day and find a request for adding 10 new server with 800GB each that needed to be done yesterday 😉 That’s one of the things I was getting at when I said “this approach makes it much simpler to manage storage especially in dynamic environments”.

    Ive also worked quite a lot with EVAs in the past and initially did try to separate workloads to different spindles – one big disk group for random and a few smaller 8 disk (single RSS) disk groups for sequential. But because of the limited architecture on the controllers I soon stopped doing this and just went for the single large disk group approach.

    I remember first using a Compaq EVA and thinking that it was a breath of fresh air and ahead of its time with some of its technology.

  4. Jesse (SanGod)

    You know it’s hilarious – I’ve heard this story time again and I think it all comes down to one basic premise.

    People just *LOVE* to overmanage their disks.

    In the EMC world, people talk about wanting to know what physical disk each hypervolume is on, then when you start working on 8 and 12-way metavolumes, you’re talking 16-24 back end disks for each lun presented forward. (In a Raid-1 configuration that is, actually more if you’re talking Raid-5)

    Either way, in a cached array the only time you’re dependant on back-end performance is

    A> Random Reads – unfortunately there is no way around a random-read. It’s going to wait for the disk. But any intelligent array can pick and choose which disks are being read from based on activity level of the mirror pairs.

    B> Cache Saturation. Yes, it does happen, even in boxes with 48-64G of cache. (Though I don’t see it happen that often)

    In the EMC world, and I’m guessing this applies to all high-end arrays, the write has been acknowledged back to the host long before the physical drive enters into the picture. *THIS* is what actually makes a cached array faster than JBOD. When write speeds are measured in nano-seconds and not miliseconds, all is good.

    Most arrays also have some form of predictive read-ahead, based on odds that once you’ve read X number of tracks, you’re more likely to read the next Y tracks in series.

    But yet, (and back to the point of the response) I still see people trying desparately to micro-manage their disks. Even in my company I have daily fights with our DBA because he wants to see two dozen 8G volumes presented instead of the 200G volume spread across many spindles on the back-end, despite the fact that the 8G will end up being slower because you can’t use a ‘scatter-gather’ read on a single volume.

    In the EMC world the rule is to use sequential symmetrix device ID’s for volumes which, by virtue of the way the Enginuity software lays out the disks by default, guarantees that you won’t step on the same physicals.

  5. Cleanur

    On the EVA chunk size is fixed at 128KB (256 blocks of 512 bytes each). This is often confused with the minimum PSEG size of 2MB. A PSEG is the minumu space the EVA can allocate internally when creating VDisks. Each PSeg being equal to 16 chunks (16 X 128KB = 2MB).

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.