Check, check and triple check!

By | January 3, 2007

Well, a happy new year to everyone!

I remember when I first started doing storage that unallocating storage from a host was one of the most nerve racking things Id done professionally.  I used to check, double check and then check again before clicking that Apply button and would then sit at my desk for 10 minutes in case my phone rang.  If it did ring it would send my pulse racing in case it was one of the DBA’s telling me that their production database had gone down complaining of disk errors – considering that, it’s a wonder Im still in the game!

Granted, these days Im a little older and a lot less nervous.  Ive also unallocated a lot more storage than back in those first days.  However, I’ve never totally lost that rush, if you can call it that, and to be honest I think its healthy to be cautious when playing with storage.  

Anyway………. Im currently working on a project decommissioning two HDS USP 1100’s, and as part of the project all hosts that are currently accessing disk on the USP’s must be migrated to other storage arrays.  And as part of that there is a lot of unallocating being done.

The way we are approaching the migration is as follows –
The hosts are currently accessing storage on one of two USP 1100’s which are to be decommissioned.  We are adding in the new storage arrays to the correct zones so that the hosts can see and access both the storage on the USP and on the new storage arrays.  We are presenting identical LUNs from the new storage arrays and using Veritas Volume Manager on the hosts to create mirrors.  Once the mirrors are created we are splitting off the plexes on the USP, performing some quorum swings and then removing visibility of the USP’s.

The funny thing is, Im finding myself checking, double checking and triple checking everything that Im doing.  After all, some of the servers are live servers and the last thing I want to do is remove the wrong half of the mirror and then see everything fall over when I remove the USP’s form the zones.

Here is a quick list of all of the double checks Im having to do –
When creating the mirrors making sure that one half is on the USP and the other half not
Making sure I remove the correct half of the mirror
Making sure I unmap the correct storage ports from the Emulex cards
Making sure I remove the correct zone members when removing visibility of the USP’s
Making sure I delete the correct Host Groups on the USP’s

Im sure the list could go on, and to be honest it wouldn’t bother me if it was just a handful of servers, but we are migrating everything off of two pretty huge USP 1100’s so are frantically performing multiple migrations every day.  And at the moment and its taking an age because of how careful I’m having to be.  Still better safe than sorry!

Then of course there’s the eventual spinning down of the USP’s!  Thankfully it wont be my responsibility to flick the “Off” switch – that’s a job guaranteed to get any responsible persons heart rate going.  However, I imagine I will still feel some responsibility in declaring the USP’s clean and making sure we haven’t left any hosts accessing storage on them.  Still even with all that said, Im sure I’ll sleep well the night before they flip that switch Wink

Nigel

14 thoughts on “Check, check and triple check!

  1. JM

    Double-checking and triple-checking = job security for storage administrators! Storage administration must be one of the more stressful jobs in IT. A quick story – I was once a new storage administrator and extremely nervous removing storage from hosts just like you were way back when. Then I got comfortable with it. Too comfortable. About a year ago I needed to unmap some devices from a Symmetrix . No problem, done it a hundred times. In order to unmap, the devices need to be write disabled, so I wrote myself a nice script to do this across all of them at once. I double-checked that the devices I had specified were correct and then I ran the script — against the wrong Symmetrix. Turns out neither Veritas Volume Manager or Oracle like devices that suddenly decide that writing is no longer allowed. What a sinking feeling it was when the Solaris admin poked his head into my office and said 4 different applications just tipped over. Like you said, there’s always that 10 minute window where you really hate for the phone to ring. Luckily it only took me a couple minutes to figure out what I had done and it was a simple find & replace with my original script to change “write_disable” to “rw_enable”. Unfortunately, it took the Solaris guys over an hour to get VxVM disk groups re-imported and filesystems mounted. That’s one mistake I don’t plan to make twice!

  2. Storagezilla

    At least these days we can bloody well do it online. I remember performing a move back in 98 of a huge amount of data which was set up in a fashion which left me no option but to perform a ufsdump/restore.

    The downside is that it took ages, the upside is that it gave us who knows how many hours to test that everything was working fine before the old devices were taken away and the array was powered down.

    Just a few weeks ago I sat down in front of PowerPath Migration Enabler and moved things from non-virtualized storage to virtualized with Invista using just a couple of commands. Clickety-click, Carriage Return, take it away boys.

    Though I did bug the hell out of the other guys for a good few minutes before that last carriage return. Nothing wrong with a second set of eyeballs no matter how easy things get. 😉

  3. Chris M Evans

    I agree this is a permanent headache. I’ve done plenty of moves from old to new equipment and despite my certainty that I’ve done the right thing, it doesn’t stop me double checking everything. Same with fabric refreshes (currently running at over 2000 zones), I do one fabric of two then check servers to make sure they’re up!! Paranoia can be increased by the lack of management support when thngs go wrong. No harm whatsoever in a little CYA.

  4. Nigel

    Well at least it looks like Im not the only one 😉

    May be there is a need for a storage pro’s version of AA…. “Hi my name is Nigel and Im paranoid because I work with storage”

  5. Jesse (SanGod)

    yes, there is a reason that EMC requires that you write-disable a device before you can unmap it from a front-end port. that way if there is any screaming to be done, it’s done before the data is dissolved.

    I’ve trained a few SAN administrators and the universal rule always applies.

    1. Check the volumes
    2. Write_disable the volumes
    3. Wait 1 hourthen unmap from the front-end directors
    3. Wait 24 hours and then dissolve the metavolumes.

  6. snig

    Yeah, HDS won’t let you remove a LUN from a port unless it is totally removed from the host.

  7. c2olen

    It’s like deja-vu all over.
    A few years back I was reassigning some LUN’s from within our IBM Shark. I must add that the Java based Storage Specialist GUI from the IBM Shark is one of the worst I’ve ever worked with.
    While reassigning a couple of LUN’s I goofed up and removed a Oracle production LUN. My heart started pounding in my throat. What a rush. I raced over to the DBA’s and told them what I did. The database in question turned out the be some management information system which was pumped with data once a month. It became clear that no one knew exactly what sources were used to fill the database, since it was running like this for years, without need for intervention. Due to the read-only aspect, a regular backup wasn’t made, because all the data was fed from other sources. This concept was changed after the incident though.

    It was a good exercise afterwards, because the complete design was re-documented.
    It was my learning curve on doing multiple checks before removing/reassigning volumes.

    The IBM hardware requires you to additionally add the force option to the unassignment, when the LUN is still attached to a host. This kind of assumes the admin thinks before he acts, because the force is not done automatically. Although the force option could be scripted. We have made it key not to script these activities though.

  8. Nigel

    C2olen,
    A force option to force the removal of a LUN assigned to a host that is still accessing it is something that Ive wanted for a long time on HDS enterprise kit.
    In the past, on almost every project Ive worked on, Ive had at least one LUN that I couldnt remove because of stray I/O coming from the host. When Ive digged into this along with HDS we’ve often found that there is stray I/O that appears to be generated from HDLM (multipathing software) and the answer is usually to upgrade to the latest version of HDLM. Now I know that being up to date with this software is important but that kind of change is usually outside of the scope of the project I was working on….. blah blah blah. Anyway, a force option on the array would be nice!
    Not to be used by junior team member though 😉

  9. c2olen

    Being an USP fan, I have been getting more and more curious on what storage device was able to replace the USP’s on your current project. What storage device are you moving the USP volumes to, if I may ask?

  10. Nigel (mackem)

    Haha – feel free to ask.

    Its actually not very interesting………… we are migrating the hosts ont two older 9980V subsystems.

    This is because the company I am doing this work for, call them Company-A, is contracted to provide IT services for Company-B. However, Company-C is coming in and taking over the running of Caompany-B’s IT services. But Company-A like the USPs so much that they want to take them with them when they leave and re-use them elsewhere. Company-C are ok with that.

    Hope that makes sense

  11. c2olen

    This makes perfect sense.
    No replacement then, just making sure they keep the good stuff.
    Smart move.

    Thanks.

  12. snig

    Nigel,

    You can actually block I/O on a per LUN basis now on the USPs. You can go into the VLL screen in storage navigator, click on LDEV Status, and then right click on the LDEV you want to remove and “Blockade” it. This will allow you to forcefully remove the LUN from the host.

    I did this once when a LUN wouldn’t remove from a host. The engineer swore that the device was totally removed from the host and that it wasn’t being used for anything so I forcefully removed it and the Oracle DB went down. Luckily I hadn’t reallocated the LDEV yet so I was able to just re-add it to the host and the DB came up fine.

    Now if the storage subsystem says that it still has host I/O then I don’t remove it unless the server is shutdown.

  13. Vincent

    Hi Nigel,

    While searching on Google I came accross your blog. Very interesting, you seem to be very experienced in your SAN data migration. So I was hoping if you may be able to point me in the right direction on the following as I'm currently absolutely lost and have been pushed to do the following:

    Goal – migrate data from old Clariion to new VMAX
    The Sparc Sun server is running Veritas Volume Manager with Veritas DMP.

    I've been told to move to EMC PowerPathfrom VxDMP so that the data can be migrated by use of EMC Migration Enabler.
    From your experience is this really required to move to PowerPath in order for Migration Enabler to work? Or is it just easier to mirror the storage accross frames with Veritas?

    Many thanks in advance!

    A Solaris Admin lost in the SAN jungle.

  14. Nigel Poulton Post author

    Hi Vincent,

    EMC can no doubt offer you some good migration methods from CLARiiON to VMAX.

    However, I would personally be loath to replace my multi-pathing solution solely on the grounds of a one-off migration.  The market seems to be moving away from prioprietary multi-path solutions (such as those form the storage vendors) more towards open standards and MPIO as supplied by the OS…

    It's difficult to know without the specifics but if there aren't any major reasons why you can't create mirrors with Veritas then I would stick with Veritas and create the morrirs to do your migration.

    Just my penny's worth.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.