Sorry Nimble Storage – I Don’t Believe You!

By | November 24, 2014

So a few weeks ago I was at Nimble Storage HQ for a briefing with the Storage Field Day crew. Nice site, nice breakfast, and thanks for contributing towards my travel and hotel costs (that’s the disclaimer out of the way).

nimble logo

The Claim!

Anyway, as part of the presentation, Rod Bagg (VP of Customer Support) made the claim that ~60% of Nimble storage arrays get upgraded during business prime time hours. Seriously! During the middle of the business day! And Rod clarified that these are production systems, not just test systems in a lab.

Now….. call me old fashioned, but I couldn’t swallow that!

The evidence is here 12:00 – 15:00 minutes on the video….

A Great Product or Bad Admin Practices?

The Nimble angle is that this is a huge testament to the quality of their product. To me….. it’s either incorrect data, or testament to dangerous administration practices.

All I can say is…. with what I know about IT infrastructure, that would not happen on my shift!

Don’t get me wrong, I’m not saying that the Nimble product isn’t good – to be honest I actually like it. But to me, no matter how good a critical shared infrastructure product is, it’s still too risky to upgrade during business hours.

I know that the world is changing, but has it changed that much?!?!?!?

You see…. no matter how reliable a product is, it only takes one thing to go wrong to bring it to its knees.

Put another way… if I signed off on an upgrade to a shared Nimble storage array in the middle of the business day, and that upgrade *went south*, I’d expect to be marched off site and asked never to step foot in the building again.

My Experience With Upgrades

Now I know things can go wrong in the middle of the business day even when a system isn’t being upgraded. But in my experience, the risk is a lot higher when performing an upgrade. I’m used to doing things like –

  • making sure we have spare drives on site during upgrades
  • making sure we have the vendor support duty managers mobile phone number
  • making sure my technical staff don’t have any other plans for the day….

…all as preparation for the worst.

And I’ve had bad things happen during upgrade. And every time, I was damn grateful that we’d started our upgrades early Saturday morning and had all day Saturday and Sunday to mop up if things went wrong.

I had one time where I was at the cinema watching the new Star Trek film when one of my team called me to tell me about a certain storage array that had gone down. That was at about 8pm on a Saturday evening and it only got fully back up and running at about 09:30am Monday morning – after core business hours had started! It wasn’t great.

Am I Wrong… Do I Need To Get My Butt Out Of the 1990’s?

I’m not doubting the Nimble product here. But I am doubting the practice of upgrading core infrastructure in the middle of the business day.

Now….. am I living in the past? Nimble aside…. is it safe to be upgrading core infrastructure components in the middle of the business day?

 

30 thoughts on “Sorry Nimble Storage – I Don’t Believe You!

  1. Zack

    I can believe it. I would bet 80% of their arrays are at SMBs with less strict and less critical maintenance procedures than enterprise companies.

  2. Michael Murphy

    Hi Nigel,

    Yes and No. Yes, you are right. No, we shouldn’t upgrade systems in the middle of a business day. Since 80% of system outages are the result of human error, we need to err on the side of caution. However, we NEED solutions out there that will change that reality. In America, businesses are pushing IT to do more with less, and people are over worked. What’s needed are systems that are self-healing, easier to manage, and systems that don’t contribute to downtime. I’m not saying that we should make changes during the day, but Nimble are basically saying that they don’t want to be part of any downtime.

    The other point I’d like to make is that people are making changes all the time all day long, you just aren’t seeing it. Anytime someone adds a users to a group, assigns an IP address, hires a new employee, builds a server, etc., they are making changes. If companies don’t make changes at a rapid pace, they will not be able to keep up. If they make too many changes, they will fat-finger something, or roll something in untested. That’s the rub. So Nimble are just saying that their solution will be on the side of not causing any downtime. And that’s a good thing.

  3. Chris M Evans

    Nigel

    I think you’re not living in the past but have an enterprise focus where upgrading during the day would be a career limiting move. However perhaps may of Nimble’s customers aren’t enterprise but more to the SMB end of the scale.

    I had an interesting “discussion” last year with a customer who ran a public cloud infrastructure. They didn’t see a daytime vMotion as a change and so weren’t disclosing it to their customers. I pointed out that performing a vMotion to another server which had been incorrectly configured and thus caused the VM to be inaccessible would in fact be a failed change as the configuration was altered. They chose to disagree with me.

    I think there’s an interesting view on risk that people have who are recently into the industry. They think redundancy of technology is going to save them should things go wrong. Sadly they are misguided and will likely get fired the first time their attitude results in an outage or data loss.

    Chris

  4. Nigel Poulton Post author

    @Michael Murphy

    I agree that systems that don’t contribute to down time are needed….. I’m just struggling to see how a piece of core shared infra like a Nimble array can fit that. It’s dinner time with the kids here so my mind my figure that out later tonight 😉

    And I get that we’re making changes all the time. But there are certain changes where that’s fine…. and other where it isn’t. To me, major code upgrades don’t fit – especially moving to someting like a dot zero release!

    @Chris Evans

    That’s a good point about SMB vs Enterprise. But racking my brains back to when I worked in the SMB….. in the early days of me working in SMB change control hadn’t really been conceived. But even towards the end of my time in the SMB when we had change control, I remember SAN upgrades and the likes were weekend jobs – the overtime was good!

    In fact I remember when we used to delete LUNs on Compaq EVA arrays in the day. I always used to dread my phone ringing in case Id deleted the wrong LUN. We soon moved to a 2-stage approach – unmap the LUN today, insert a two day cool off period, then delete the LUN.

    One point that was made at Storage Field Day was that may be Nimble arrays in the SMB market were being managed by the Windows or Virtualisation teams…. and that they may have a different view on risk.

  5. John_H

    They update during the day, but do they apply those updates at that point, language and marketing have always been fast friends. I do agree about Nimble customer base it is more toward the SMB space and yes lots of relatively new people in the industry seem to rely on the technology and especially virtualization to keep them out of trouble. In a shared storage environment optimism is usually inversely proportional to experience. Hey but don’t worry if it goes wrong you still have a single controller ughhhh, losing a single controller out of a pair even for a reboot to apply the update should really discourage this taking place during the day.

  6. Nigel Poulton Post author

    @ John_H Made me laugh……. optimism is inversely proportional to experience 😉 and ….good point about single controller!

  7. PiroNet

    That practice is neither good or bad as long as you’ve set the objectives, addressed the pre-reqs, evaluated the risks and plan a failback.

    If one of your objectives is to avoid users noticing it, either perf degradation or lost of connectivity, then do it out of business hours.

    If you have to chose a moment where your storage is less used, big chance it is within business hours, you would be surprised that storage arrays tend to be more heavily used at night actually 🙂

    And if your business run 24/7, well you don’t have the choice. It will never be a good time to upgrade right. Here I just hope architects have addressed that constraints!

  8. Paul Hutchings

    I guess I’m curious if you see a distinction between arrays where you *have* to have the vendor come and do the firmware upgrade vs. those where as the customer the official upgrade path is that you hit their support site and download it and the upgrade is on you and the assumption is the vendor only gets involved if it goes wrong?

    Chris also makes an interesting point about things like vMotion where I’ll think nothing of migrating VM’s during the day so I can update a host – not quite in the same league as a storage firmware upgrade but interesting point on whether it should fall in the same risk category as presumably your customers don’t care why they can’t access something they only care that they can’t access it.

  9. Mirco

    Where is the difference?

    If you are working for an international corporation, the business runs 24/7, so there are no off-business-hours. Keep in mind that there are countries working on saturday or sunday as regular weekdays.

    In these cases it doesn£t make much sense to have you upgrade changes in the middle of the night on a weekend. Somebody will be working and if something goes wrong, chances are good, the one specialist you need is not at work and won’t pick up the phone.

    From my view it does make sense moving these changes to anytime of day, if the product can be changed/upgraded in a predictable and reliable manner with no or very little impact to the frontend user.

  10. Pingback: Sorry Nimble Storage – I Don’t Believe You! - Tech Field Day

  11. John_H

    Many storage arrays can accomplish online upgrades in some cases with minimal performance and availability degradation throughout the whole process. However it’s extremely difficult and expensive to make this process completely transparent to the attached hosts, as such many online updates rely on host based multi-pathing to mask the upgrade from the O/S / applications perspective.
    The problem with this is that the storage vendor / administrator performing the upgrade has minimal information on the status of each hosts multi-path integrity. The more hosts attached and the more O/S versions supported the bigger the problem. So this may well be feasible for smaller environments where the risk of mis-configuration at the host is minimized simply by the numbers involved and also in this space some level of unplanned downtime can probably be tolerated.
    If your firmware upgrade process is reliant on this then outside of those very small environments you really are playing Russian roulette and at some point you will come unstuck. Even the high end solutions which don’t rely exclusively on multi-pathing to mask upgrades can’t fully guarantee the integrity of your host environment. So while it may well be feasible to do this during production hours, it’s probably still not a good idea.

  12. Pingback: Sorry Nimble Storage – I Don’t Believe You! | Storage CH Blog

  13. Ichdenke

    Approximately 60% of IT marketing naïve over-hype is spoken during normal business hours.

  14. Rob Lyle

    @Michael_Murphy Yes, people make changes all day long, but there is a world of difference in the risk profile of open heart surgery compared to sticking a bandaid on a skinned knee. It’s all about risk appetite. There is no right and wrong.

    Given the complexity of software these days, anyone who thinks the world is getting simpler, needs to think again. Why don’t software companies offer warranties for their software products? Ever read an EULA from Microsoft or IBM that *doesn’t* say their software comes with NO WARRANTY for any purpose you might think you want to buy it?

    Until the vendors gets more skin in the game and grow a pair, customers will remain risk averse. If a vendor does engage with the USP such a warranty would create for them, product quality and security issues would be substantially smaller. Now, please pass me the unicorn tear cordial – I’m thirsty.

  15. Steve

    Nigel,
    Disclaimer: Nimble Storage Employee
    First, Thanks for attending our storage field day.
    Please keep in mind that we’re not forcing customers to update firmware or scale their systems during business hours. We’re just providing the facts that a due to the way the architecture is built, we see a certain percentage of customers upgrade and scale during business hours. We’ve simplified a lot of tasks that not only use to require downtime but many times vendors to come onsite to assist in these processes.
    Every customer can decide when to upgrade or scale based upon their business, requirements, risk, etc…
    Our statistics are facts. That’s all.
    Thanks again for participating in storage field day.
    Steve

  16. Nigel Poulton Post author

    Hi Steve.

    Appreciate your comments. Also….. fully appreciate prime-time upgrades aren’t mandatory 😉

    I think the facts are testament to a few things. First your customer base (predominantly SMB), but I also don’t doubt that in some cases it’s a reflection of the product quality and customer confidence.

  17. Zack

    So do you now believe Nimble, Nigel? Perhaps you should change the blog post to “Sorry Nimble Storage – I didn’t believe you!”

  18. Nigel Poulton Post author

    Hi Zack….. In a way I think I always believed them (I consider Nimble a serious storage company and one with good people working there).

    I kinda figured writing about it might generate some discussion and shed some light on the detail behind the headline if you know what I mean.

    However……… now that FC is supported I wonder how that stat might change as and when bigger customers (previously out of their reach due to lack of FC) start replacing VNX bock with Nimble block…… Will be interesting.

  19. Brian

    I tend to only upgrade mine during off hours, more out of habit than anything else. However, I have done the upgrades during business hours. The only risk that you take by upgrading during business hours is that if something goes wrong during the upgrade, your standby controller is down and you don’t have fail-over capability if your active controller has an issue.
    Your system isn’t going to fail over to the new updated controller until the upgrade is complete. There is no down time involved. Once the upgrade is complete on one controller, it fails over and the other controller is upgraded.
    Chances are that you have been running on the same active controller for months. What are the chances that this controller fails right at the moment you are upgrading the standby?
    My last upgrade failed due to a bug where we had a couple of snapshots with identical serial numbers. Nimble support took over and found the problem and corrected it for me. Once the issue was found, my standby controller successfully upgraded and then failed over before upgrading the other controller.
    We were a very early Nimble adopter and during the first year I had a hardware (Memory) failure during a code upgrade. They determined that the controller was faulty and shipped me a new one overnight. While I was not in a High Availability status during this time, I never had any downtime.
    We have 2 locations. We installed an array in each location 3 years ago, added an array at each location last summer, and we have had 100% up-time on all 4 arrays the entire time.
    Nimble Support also offers a Spare Parts kit that you can purchase and keep on hand that includes a spare controller and drives. And From experience, if you call their support, you are connected to a Tier 3 support engineer immediately. That is all they have.

  20. Jake

    I hope I can offer some admins on this thread some peace of mind.

    My datacenter is 24/7 365 business hours. There is no “scheduled downtime window” period. We run and we run 24/7, no questions asked.

    Nimble was specifically chosen due to its “non-disruptive” updates to the controllers.

    Have we had Nimble issues on upgrades, yes, they are mentioned here already.
    Did we suffer any production outage? NO – the biggest problem suffered so far was a 9 second outage on a redundant iSCSI mapping.

    So, I don’t want to get into a “best practices” discussion because its a general rule of thumb, but by no means should it be a rigid core principal to live by. If you don’t entertain the thought of making your datacenter flexible and nimble(pun intended) you may be living in the datacenter of the past. Time to move to today’s modern datacenter practices and techniques.

    Keep in mind my mission is = there is no acceptable downtime, regardless, so plan accordingly.
    So I’m not an apples to apples comparison.

    Good luck folks!

  21. Dan

    As other people have stated, most environments expect 24/7 operation so the premise is floored somewhat. It makes more sense to upgrade during normal working hours to ensure best possible support in the event of issues.

  22. Brent

    I agree with Jake, we upgrade whenever i feel like it. I do preliminary calls with support to ensure everything is in order. If any corrections need to be made, they make them, then we proceed.

    Our company used to have downtime windows, but with physicians wanting the flexibility to work 24/7 remotely and sites open on Saturday, our downtime windows have shrunken to Sunday morning wee hours. Course i see people on even then. So as with Jake’s company, we are essentially a 24/7 shop and we need equipment that can run 24/7 and that means it must be able to be upgraded without downtime. With Nimble and VMWare, we are able to do this.

    I really wish Microsoft would get on board with this. We need a 24/7 OS now. Clustering is great, but not everything can be clustered.

    Oh and yes, i have upgraded at noon on a Friday. Dont get me wrong though, i don’t upgrade on Mondays due to high loads. But i have upgraded our entire VMWare cluster in the middle of the day on a Wednesday before. To me, if the tech cant handle it, then we don’t need it…. unfortunately, i cannot do that with everything…

  23. Carter Fields

    Ahh Nope! I did an upgrade of our CS500 last night (Sunday 3:30pm) going from 2.2.0.8 to 2.2.1.0. It started with the standby controller (B), upgraded it and sent it to restart. The B controller never came back. I called support who wanted me to drive to the datacenter to manually reboot the B controller. Before I could leave, the A Controller rebooted!!! My entire VMware server infra went down. Production down. Once the A controller came back up Vmware reconnected and I had to fix a few servers, but we were down hard for about 30 min on some servers. They still have no idea why the A controller rebooted before the B controller came back online, becoming the active controller and thus releasing A to proceed with the reboot. Epic failure.

  24. MRDC

    I’d like to add my experience.

    We run a critical / HA Datacentre which can have no downtime. Our entire VM estate runs on UCS / Nimble. It not only runs critical internal services but also external customer facing services.

    Last week we added another 30TB Nimble shelf. The shelf was installed and connected, then made available after peak hours.

    No disruption to services, no downtime, no issues.

    Products like Nimble have made tasks that would have been very difficult to orchestrate now very easy. and leaves us time to concentrate on more critical projects.

  25. Dave K

    My company runs about 7K VMs and we use Nimble for 10% currently and are looking to increase that significantly.

    I load tested a CS300 with 22,000 IOPs from 12 different load generating VMs during an upgrade and during a SP failover. The CS300 worked as advertised ~5-10 seconds of pause. About the same response time for any active/passive Tier 2 solution. The CS series aren’t VMAX or VSPs but for the money they do an amazing job.

    I’m also part of the old guard, installed my first SAN in 1999, and have worked in may 24/7/365 shops and there is no way I’d upgrade any enterprise infrastructure during prime time, I enjoy full time employment. But at the same time assuming you’ve followed best practice with your OS timeout configuration and you have support validate you’re CS is healthy I would consider a Nimble OS upgrade a non-disruptive change.

  26. Vlad V

    We are doing a PoC with Nimble and we tested manual failover and appliance updates. From our testing we found out that in-guest stuns are between 20 and 30 seconds. Right now we are searching for the cause but haven’t found any yet in our infrastructure.

  27. Greg

    I wouldn’t upgrade during business hours, but the upgrades I have done during off hours have gone flawless with no downtime. We had a guy change a failed drive during business hours, and it tanked the system. That sucked.

    I love Nimble though.

  28. PC

    we just bought Nimble storage and we are still implementing and testing. maybe i’m still in the period of seeing everything great from them. the engineer from nimble that came for the implementation actually previously worked for emc on xtremio so he did lots of comparison with xtremio. he even wrote docs for xtremio.
    from past experience (not so far, it happened 2 weeks ago) we shutdown our PROD vmax for a relocation and it took emc lab 2hours to shutdown the vmax safely. our downtime was 7hours and our core banking with >1 million bank acc reside on the vmax.
    i understand the 60% upgrade figure comes from infosight, so it’s real world figures not just marketing figures… so i will tend to believe them…
    PC

  29. Because I Can

    Can you do upgrades during the day, sure. Should you, probably not. Having said that I have updated Nimble and Pure arrays without shutting down hosts that use the array, even systems that boot from SAN. The system does work as they state. I think the big change is coming from the EMC/NetApp world where it takes a ton of effort and deep production knowledge to do an upgrade.

    Having worked on all 4 (and some HP) wouldn’t go back without being dragged there. The old ways are just that OLD. I really enjoy upgrading a storage array without having to get the entire IT department involved.

  30. Jesus

    upgrade process:
    Passive controller gets upgraded
    Passive controller becomes the active controller
    Nimble code tests all connections and controller
    The only two options are to go ahead and upgrade second controller, or to go back to old one

    This process is unheard of any other storage i know. I have done three upgrades live and NO SYSTEMS LOSE A HEART BEAT. That is remarkable!

    Now i wouldnt try the same with netapp, equallogics, or any other one. Bad things have happened in my case while upgrading these storages during maintenance hours….

    The don’t live in the 1990s (unless you work for Wall Street and you have the money to have two identical computational systems and have the luxury of upgrade one then move to it but have the old one to go back to if anything fails)!!!

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.