Of EMC, RAID 6, mud and Dragons

By | January 10, 2007

So, a while back I apparently showed my baboon ass while slinging mud at EMC over their lack of support for RAID 6 .

Shortly after I threw the mud, Storagezilla got out a dirty old rag and attempted to wipe some of it away .  Can’t blame the guy for trying, but he only succeeded in smearing the mud – there was plenty of mud still there after his clean up attempt.

My issue back then, and might I add the reason that I threw the mud in the first place, was because it annoyed me that EMC weren’t giving their customers the choice.  Heck, Im aware that opinion is divided over the need and benefits of RAID 6 and I wasn’t attempting to preach and convert anyone – I know a lot of people can be religious about stuff like this.  People can make their own decisions – if they don’t have EMC kit that is 😉

Anyway, in all honesty I was interested in the post that Storagezilla wrote attempting to cleanup the mud that I threw – especially the part where he (I’m assuming it’s a “he”) talked about “EMC being the only vendor spending a lot of time and effort working to mitigate the chances of double disk failure related data loss… regardless of RAID type”.  Of course EMC are not the only vendor spending a lot of time and effort on this front, Im sure most vendors are.  Still its good to hear that EMC are working hard at resolving the root causes of “the double-disk failure dragon” and not simply providing recovery mechanisms.

Soo…… in a comment to Storagezillas post cleaning up my mud I asked if he/she/it could tell me more about what EMC are doing to and what significant resources they are putting to the task.  But of course he/she/it wouldn’t say more as that is for others at EMC to do……. And I wasn’t about to hold my breath for any further clarification from the Beast of Hopkinton.

However, since then, Chuck Hollis VP of technology alliances at EMC, has been to the local supermarket and purchased a bucket and a bunch of new cloths, rolled up his sleeves and had a go at removing some of my mud .  In seriousness though it’s a good article and an excellent and rare insight into how a huge market leading company goes about improving its core products and competencies.  Obviously it’s written to make EMC look good but I wouldn’t expect anything else.

I just want to make one comment though – both Sotragezilla and Chuck talk a lot about RAID 6 not solving problems, only providing fixes, essentially papering over the cracks and that EMC don’t do that.  Instead EMC invest their efforts in solving the underlying problems.  Well of course that’s great (honestly) but what about in the mean time?  As Storagezilla calls it the “double-disk failure dragon” I will make an analogy around that  –
(Warning! I’m about to get a little carried away here)

Imagine a village, lets call it Hopkintonfieldvilleshire, that is terrorized by a fearsom dragon that steals people in the night who are never seen again.  The village folk meet to discuss how to stop the dragon coming into the village and stealing their loved ones.  After much deliberation they decide that the only way to keep the dragon out is to build a huge moat around the village.  But it will take a whole year to build the moat if everyone in the village helps.  There is also another quick fix option to build a large bell that will frighten the dragon away when it attacks the village.  The bell will take a month to build but will delay the finishing of the moat by a month.  The village folk decide to have a vote –

  1. Should they put everyone in the village to task building the moat, all the while allowing the dragon to roam freely into the village and steal their loved ones for another whole year.
  2. Or should they take some people off the moat project and build a big bell to scare the dragon away and save the lives of the friends and families here and now, BUT delay the completion of the moat?

Not sure if you’re still with me after that random piece of fiction, but if I were in the village I would vote for option number two.

Chuck also mentions that we may very well be seeing a RAID 6 offering from EMC before long – although I think we all expected that anyway.

He also mentions that there is always more to do to improve on, and hints at aiming for six nines.  Of course its always good to aim high – but why not go all out and aim for 100%  😉

All in all, good banter and very insightful.

Nigel 

AUTHORS NOTE: Something that Jesse mentioned a while ago in response to one of my other posts was that when he runs RAID 5 on his EMC Symms he often doesnt bother to provision spare spindles.  Instead when a disk fails he allows the RAID set to run degraded until an engineer arrives with a replacement.  His thinking behind this is that, especially with larger disks, the engineer often arrives on site with te replacement before the sparing process has completed or at worst shortly thereafter.  If the engineer arrives before the sparing completes he has to wait before he can tell the box to spare back to the new disk.  All of this obviously consuming processor cycles.  And who am I to argue with Sangod!?

He also mentions that he knows of other people who do the same.  I have my reservations and would be interested to know if anybody else does this????

10 thoughts on “Of EMC, RAID 6, mud and Dragons

  1. JM

    I don’t have much to say about RAID6, but as far as your author’s note at the end on running spares with Symmetrix… I’ve also heard of people running no spares until the engineer shows up. Unless you’ve got performance concerns and want to schedule your rebuild to happen off-hours, I don’t think I agree with the practice. I understand what the argument is, but for me it’s all about how much time I’m vulnerable to that second disk failure. I’ll grant you that second disk is more likely to fail during the rebuild rather than under normal conditions, but every minute past when that first disk drops out of the RAID set means another minute my data is unprotected. If a spare is configured and is invoked, that amount of time is minimized as best it can be. Personally, I don’t care if the engineer has to wait before he can tell the box to spare back to the new disk or not. He can sit there and twiddle his thumbs and wait or he can remotely dial in later and do it. We’ve paid for the support contract, his inconvenience is included.

  2. SanGod

    If it’s a disk FAILURE, you’re just as unprotected during the rebuild as you are waiting – more so if the engineer can’t replace the failed disk until after the rebuild completes, which is the case with EMC.

    If it’s a predictive failure, meaning that the hardware has experienced multiple single-bit failures during the scrubbing process, the spare invocation does protect you during the rebuild, as the failing disk can then be replaced without compromising redundancy.

    There are a million thoughts on this, I inhereted the practice from an EMC Customer Engineer when I worked in California. A few friends of mine who are still architects have maintained. I am however considering dropping them an email to see if it’s still practiced.

  3. JM

    That makes sense, but say your engineer takes 3 hours to get on site to replace the disk and your rebuild takes 2 hours. If the Symm fails a disk and you don’t have spares configured, your time unprotected is 5 hours. If you have spares, the rebuild kicks in automatically and your time unprotected is 2 hours. The amount of time unprotected is less with spares than without. Am I missing something?

  4. Nigel

    Personally I see the benefits being for larger drives where rebuild times are huge, and lets face it, disks are getting bigger by the day as are rebuild times.

    Another issue that I would have though is that I have had several occasions where “there ahve been no spare disks in the country” and we have had to wait until the next day. This obviously is not acceptable.

  5. snig

    I guess Chuck isn’t going to post my last response. I posted it last Thursday and it hasn’t been accepted yet. I’ll have to remember what I said and post it here later.

  6. richard

    Mackem,

    Well… what contractural reponsibilty does EMC assume when their customer loses data…. probably none.

  7. snig

    I don’t know about EMC, but I know HDS guarantees 100% no data loss for at least 1 year. Seems some companies are more confident with their hardware than others.

    I’m sure it’s not widely known that HDS does this but it wouldn’t hurt for every user to put this in their Master Purchase Agreements with all their disk vendors.

  8. richard

    I guess that HDS can do very little when it happens … unless they provide free remote replication for the first 12 months. This may not be a bad marketing concept.

  9. Pingback: infos über Raid Systeme

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.