VSP and VMAX Tier 1 Shenanigans

Hu Yoshida, CTO of Hitachi Data Systems and long time legend of the storage industry is a person I respect a lot.  Hu and I recently engaged in a discussion around the validity of architectures like VSP and their designation as Tier 1, which Hu summarised in a recent blog post he cut and pasted wrote. Hu has asked that I summarise my thoughts on the topic in a blog post so that he can fully digest and potentially answer.  Here goes.

NOTE: Although the discussion Hu and I had was over Twitter, I should point out that I have met Hu many times in person and he is an absolute legend and despite being over 70 years old would kick my rear in a fist fight!

Tier 1 Shmear 1 Zzzzzzzz

We’ve been here a million times before and I wont drag it out. Although there is no strict definition of what Tier 1 is, most people have a good idea…. usually VMAX, VSP (and P9500 OEM’d by HP) and occasionally DS8x00 from IBM. Enough on that.

My point is that traditional definitions of Tier 1, and hence the platforms that cling to this accolade, are becoming less and less relevant by the day. And the reason is clear….. SSD!

SSD Has Changed Everything

Before the rise of SSD, it was accepted that Tier 1 storage sucked in performance compared to RAM in a server, often by several orders of magnitude. Those were hideous days!

Fortunately those days are drawing to an end. Application owners, and even some business folks, are becoming savvy to the delights of SSD.  The cat is out o the bag so to speak, the horse has bolted.  And once the horse has bolted there is no reigning it back. Beleive me this is a good thing!

While I’m a fan of much of the architecture behind VMAX and VSP (especially VSP/P9500 as I’m a sucker for customised designs) there is no getting away from the fact that the way VSP/P9500 and VMAX implement SSD is a bit of a fudge.  They have basically lashed 3.5-inch and 2.5-inch SSD drives onto their existing back-ends. Back ends that were designed for rotating rust, not SSD.

And its not just back ends, these so called intelligent controllers have been designed and honed over the years for spinning media, again, not SSD.  What am I saying….?

Well…Take the spec sheet numbers for a ZeusIOPS SSD from STEC.  We’re talking 120,000 read IOPS and 75,000 write IOPS. Say what you will about vendor spec sheet numbers, these are crazy figures compared to 15K FC/SAS spinning disk. 

The question then begs…. how many of these bad-boys can a traditional controller architecture like VSP/P9500 or VMAX drive across its back end?  Let me cut the suspense, not many!  In fact I might challenge whether a VSP/P9500 or VMAX can even drive a single one of these drives across its back end!  Correct me if I’m wrong, but I wont be holding my breath for a corrective response. 

And of course, none of the above stops any of the vendors from selling them to you, at a cracking price too.

Do They Need to be Driven?

While I appreciate that there is some foundation for a counter argument that traditional controllers do not necessarily need to drive SSD hard due to auto-tiering technologies such as FAST VP, Hitachi Dynamic Provisioning/HP Smart Tiers.  However, then why not sell something more suited to the controller and back end architecture. Say something more like Mach16, Mach8IOPS etc also from STEC.  They seem cheaper and more suited toward the specks of traditional T1 controllers.

So What’s the Answer

I’m certainly not a planet sized brain guy that works for a vendor, but I’ve seen some stuff that looks like it will fit the bill nicely….

Technologies such as SCSI Express and NVM Express look like cracking candidates. Technologies/protocols designed especially for SSD.

Keep the SSD closer to the controller.

There is absolutely place for SSD in the server, but also in the shared storage array… just not hidden away on the back end loop of a shared storage array like some unwanted child.

Keep it closer to the controller where it can be accessed faster and driven better.  From here it can be used as a 2nd level cache (as in FAST Cache, only not starved on a back end loop like existing FAST Cache implementations) or as a LUN or extent of a LUN, as in FAST VP or HDP/Smart Tiers.

Hell ,we wouldn’t dream of exiling VSP or VMAX DRAM on a back end loop. Why do the same with SSD!? 

Sure I appreciate that it was done as a quick fix, “get SSD in their no matter how you do it”, and disk drive form factor on a back end loop was the simplest and path of least resistance. But it’s ugly!

Todays Gen1 Implementations Are Not All Bad

While I admit that todays SSD implementations in VSP/P9500 and VMAX do serve some customer requirements. They do not do it very efficiently.  And I also admit that the vendors have done some work to smooth the road to SSD in their existing T1 architectures such as tuning caching algorithms and making sure that SSD access doesn’t hog the loop (pun intended). 

However, existing implementations are somewhat like driving a Ferrari around central London. Build a freeway! 

If things don’t get better, the rising generation of arrays bearing the mark “Designed for SSD” such as WhipTail, Kaminario, Violin Memory etc will start to crop up in traditional Tier 1 accounts.  These guys are working hard at implementing many of the traditional Tier 1 features into their products, features like NDU code upgrades, replication…. oh and cloud integration.

Yes, Tier 1 is about robust replication, caching, N+1 or higher…. but its also about performance. High performance storage is still a bit nichey today, but it wont be tomorrow. Oh and it’s getting late!

UPDATE: (3rd April 2012) It seems I need to add a clarification to this post.

I am absolutely not saying that performance is the numero uno characteristic of Tier 1 storage systems. What I am saying is it is one characteristic. No high performance, no Tier 1 badge.

My point being that the rise of SSD is redefining everybody's performance expectations – massively. To the point where if the traditional Tier 1 vendors dont keep pace, they will no longer be considered performant. No high performance, no Tier 1 badge.

Also. Im not saying the current crop of SSD vendors are better than the current crop of Tier 1 vendors. I am saying that the current crop of SSD vendors are disrupting the industry and that disruption will only increase as each day passes. The current crop of tier 1 guys need to up the ante or risk becoming dinosaurs of the industry.

Oh and Im not bashing Hitachi or EMC, hopefully that came across early on in the post when I stated that Im a fan of the architectures, especially VSP/P9500

10 comments for “VSP and VMAX Tier 1 Shenanigans

  1. April 2, 2012 at 6:59 pm

    I think the fundamental issue that is driving the new technology is VDI's requirement to have really fast random access storage at 1/10 the traditional price point.  Disk performance hasn't grown beyond 15k rpm in years, you only want to group so many drives together, & growing DRAM front-end cache is only cost effective to a point. Hence we are seeing VC capital pouring their money into startups like WhipTail where clients which are providing what clients want high performance at dramatically lower price point compared to the traditional vendors.

  2. Nathan
    April 2, 2012 at 7:43 pm

    Long-time listener, first-time caller..
    As is everything in IT, it depends…especially on your SLA.  From my standpoint your missing a major reason why I still cling to Tier 1.  I don't view Tier 1 as a performance tier, I view it as an availability tier.  If I want to ensure continuous availability, I look at VMAX and VSP.  I don't have to coordinate outage windows to upgrade code, if something breaks it adapts until it's fixed.  If either of those things happens on a Tier 2 array, I'll be taking a beating because I have to coordinate outages and have half the throughput available. 

  3. Owen Hollands
    April 2, 2012 at 10:45 pm

    Why does everyone think it terms of one dimension (Tier 1, 2, 3) etc?  This is so pervasive in the industry that EMC even coined the term "Tier 0" to start pushing SSD drives.
    There are plenty of cases where there isn't a simple case of one configuration being "better but more expensive" then another.  How many core business applications are there demand super-high availability, but when you get down to it don't really push that much IO.  This leaves your short-stroked 144GB 15k RPM 1+0 Tier 1 storage (or worse still, SSD) under utilized.
    On the other side there are many high-performance applications that will soak up as much IO and CPU cycles as you can give them, but if actually sit down with the owner they would gladly take a few extra outages a year if you can give them some more performance.  Of course, getting them to remember this during the outage might be difficult, but that is another problem.
    There are plenty of cases where the business needs both performance and availability, but if you only think in terms of Good by Expensive or Cheap and Bad then you will always be stuck unable to delver the "best" service you can, because you are not using the same version of "best" that you customer is.

  4. Donny Parrott
    April 3, 2012 at 12:47 pm

    Good opening. The storage space is preparing for a large shake up. A generational leap is nearing as the design of storage is transforming to meet business and technology changes.
    Local storage abstraction (aka Nutanix), scale out storage (aka Isilon), all SSD storage (aka Pure) are all approaching new paradigms in storage which will press so called "tier1" further to the edge.
    I believe the days of 10K and 15K disks are numbered. Will solutions like Nimble where there is SSD and SATA only, the once opposing requirements of speed and capacity are now succinctly addressed. The differentiation will come in software with hot-spotting, work load distribution, and storage pool abstraction.
    Therefore, I believe the current Tier 1 arrays are dinosaurs headed toward extinction. Distributed arrays (centrally managed), deduplication, host based storage, etc. are creating higher performance and more cost effective solutions.

  5. April 3, 2012 at 2:19 pm

    While performance is one characteristic of a storage platform you have to look at the other things that make up a Tier 1 array; availability and supportability.
    On the VSP and VMAX you have unparalleled availability that none of the "modular"/startup boxes and solutions can even think to match.  Not even the AMS or VNX from the same vendors can do it.  Saying that the spinning disks days are numbered is like saying that "tape id dead".  While the disks may decline in numbers, there will be a use case for them for a long time.
    On the supportability front, what about the mainframe?  You have to have these big boxes for mainframe environments.  The mainframe is still the biggest and baddest database server out there.  For a VSP you get 2 hour response time by default.  Do you get that with a "modular"/startup box or solution?  For most of these startups it's difficult to find a local depot for parts replacement let alone a global organization providing break fix.
    While SSD is cool and is great in certain use cases it is not a one size fits all solution.  Architected properly, I have seen a VSP/VMAX solution do in the hundreds of thousands of IOPs with .5 ms response time across all workloads.  And this was even using spinning disk and auto tiering functionality in the array.  
    I guess it's fun sometimes to bash the traditional vendors for what some see as archaic solutions when a new startup comes out and their marketing machine gets spun up. But we should remember that most startups fail and then what do our customers do when they can't get support or even upgrade what they bought.

  6. April 3, 2012 at 5:38 pm

    Some great comments guys..

    @Nathan

    Thanks for reading and finally chiming in. I personally see Tier 1 as both availability and performance. My thoughts are that the performance aspect may be slipping out of their reach thanks to the SSD revolution.
    On the point of VMAX and VSP adapting to breakages. Some of the new SSD arrays are getting good at that. I hope the guys at Kaminario dont mind me mentioning that they very quickly engineered extremely fast drive rebuilds for me when I asked. Ive known USP V's and Symms take days to rebuild failed drives and take moderate performance hits during that window.  Ive also had cache failures that have inflicted significant throughput issues (cache write through modes) on USP V and Symm arrays. By the very nature of SSD based technology, the hit of going straight to disk (SSD) is nowhere near as big.  It's not all rosy, but there are great and exciting signs.

    @Owen

    Thanks for your comment. Im seeing mroe and more business critical/mission critical apps implement their HA in the application stack and not use some of the HA features of traditional arrays. Diluting the significance of these array based HA features.
    On your point of many apps wanting perf and availability. I see the traditional Tier 1 having to up their game on the perf side and the SSD array vendors having to up their game on the availability side.
    Being somewhat pedantic, if my SSD's sit there underutilised, at least they're sucking relatively little power and kicking out relatively little heat ;-)  I know Im taking it a bit far there :-D

    @Donny
    Some good points about the general future of storage. Cloud, SSD, scalable. There is the potential for the incumbent Tier 1 arrays to become dinosaurs, but the companies and engineers behind them will not let that happen without one almighty fight. The storage industry just seems to get more and more interesting every day!

    @Ron

    Snig, Ive added an update to the end of the post to clarify my thoughts around availability. However, I will say that I think by the very nature of the designs behind some of the SSD arrays and the inherent characteristics of SSD allow for better resiliency and reduced impact when failures happen. See my comment above to Nathan.
    As for Mainframe support. Its just an opinion here but I wouldnt be at all surprised if in 10 years time the only vendor developing arrays with MF support is IBM. Shrinking market where IBM has a distinct advantage. Im not a fan of DS8K by a long shot, but I have it on good authority that it is the platform to beat when it comes to MF support.
    Also, I have to ask how much ($£, Btu, kVA, floor tiles etc) these well architected VSP and VMAX's that deliver hundreds of thousands of IOPS at sub millisecond response consume.
     

  7. Donny Parrott
    April 3, 2012 at 9:29 pm

    Thank you everyone for the discussion. Many good points and thoughts to be considered.
    I agree that the availability and support of the Tier1 systems has been extremely high, but then again so have the costs. I would like to ask everyone their thoughts in the realm of storage on two phrases I hear alot lately – "good enough" and "designed to fail".
    I am seeing alot of activity around commoditization and requirement alignment. To this end, resources are selected based upon requirement satisfaction, not necesarily capability. Some have coined this "the Cisco effect". As an example, if I require high performance over availability, then select the tools to do so. If I can manage a lower performance requirement, but never allowed to have downtown, select the appropriate tool.
    Designed to fail on the other hand addresses failure mitigation by plaining for the loss of resources and adequately distributing workloads to render the failure insignificant.
    Therefore, how do we architect a storage solution that is cost effective (buy what you need when you need it), is performant with requirement, is available with requirement, and is well supported? In the "classical" sense, we builid a large system with numerous components and paths. To this, we deploy a complex and resilient software set. This is the "Tier 1" arrays. Now, a new breed of arrays address this in another way through commodity assets and intellegent software. These scalable arrays allow for performance, capacity, availability, serviceability, and support in a flexible matrix.
    I do have to speak to my experience in this arena. While not a Tier 1 array by any means, Lefthand made me a believer in my first experience. With multiple nodes in operation and LUNs replicated accross nodes, I shutdown an active node (multiple iometer sessions) by removing the power (hard failure). The I/O was redirected to other nodes and the only measureable impact was the indicator for a loss of node. The compute environment and performance monitors were unaffected. Upon restoration of the node, all data changes were updated and synchronization restored within 8 hours.
    So, give me a VSP/VMAX I can purchase in bite size chunks at a competitive price point.
    But back to topic, unless the VSP/VMAX engines are redesigned to utilize the bandwidth within the SSD disk group, it becomes another overpriced and underutilized resource.

  8. April 4, 2012 at 3:42 am

    Nigel, there is fight going on between the software and hardware companies.  Most storage features are implemented in software (thin provisioning, replication, deduplicaion etc).  Traditionally these run on the storage controller, but you can build them into the OS or software stack.  Everyone is out for some of the same IT budget, and companies like Oracle and Microsoft would rather you spend your IT budget with them.  If they can write software that means you can achieve the same outcome without spending big on storage hardware, they can charge more and still claim to cheaper.
    100% agree with you that the traditional "Tier 1" vendors are probably not the best for performance, they are primarily built for reliability.  Didn't HP do something several years ago where they fired a bullet through a running array?  Of course, the super performance array are also looking to expand their market, but both sides need to be careful they don't end up in a middle ground with a not-that-fast, not-that-reliable product that no one buys.
    In the end, its not the product that delivers value, but the way it gets used.  You don't buy a motorcycle to deliver furniture, and you don't by a 10 ton truck to be a motorcycle courier.  There is room for different storage products, both hardware based and software, to meet different organizations needs.  The trick is not known what works best, but knowing what works best where.
     
      Owen.

  9. Catherine Campbell
    April 4, 2012 at 12:00 pm

    Interesting debate. Whilst out and out performance is rarely the defining characteristic of what any organisation considers to be "Tier 1" for them (which may of course be radically different from one organisation to another), it is generally there somewhere on the shopping list, and the possibility of radically lower latency opens up opportunities which business and application owners would be foolish to ignore.
    As you say, Nigel – there are several tradeoffs which will work themselves out over the next <guessing> 18-24 months – which include:
    functionality (replication, dedup, ThP, etc) in array vs. functionality in OS/hypervisor vs. functionality in application – which impacts what you need out of the array by way of reliability and serviceability
    SSD (in widest sense) deployed in "general purpose" array vs dedicated shared device vs inside server – and how, in the latter 2 cases, that storage is managed as part of the wider storage landscape.
    We live, as ever, in interesting times, and the ability of vendors to respond to these changes (which probably means backing more than one horse at this point) will, I believe, be reflected in their long term success (or otherwise).

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.