Tiering in All-flash Arrays

By | April 28, 2014

Last week, as part of a quick post I scribbled about Pure Storage, I asked the question – Are they still innovating?

Fair question I thought – after all, not a lot has changed in their product over the last ~2 years. But then I struggled to think of areas where they could realistically still innovate.

At first, I struggled….

Anyway, I was lying awake the other night thinking about this (as you do) and remembered a conversation I’d had with them a while back. They’d been talking about consumer-grade flash, and how their OS (Purity) was designed from the ground up to work with it in the best way possible blah blah blah…. Anyway, we’d also been bemoaning the crappiness of traditional storage arrays and their hideous flash + spinning rust tiering implementations blah blah blah. ย So I’d asked them “what about having different tiers of flash in an all-flash array?”. Something like SLC, MLC and TLC. Or probably not even bothering with the SLC. And as we discussed this, the more I thought “yeah” that could be rockin!

Tiered All-flash Array (TAFA)

But hang on a second…. if I think traditional storage arrays implementing tiering with flash, 10K and 7.2K drives is a monstrosity, why would an all-flash array with two or three tiers of flash be any better? Well……

For starters – and probably the most important point – is that although TLC (Triple Level Cell) flash has the kind of write endurance that would barely satisfy a compliance archive, it’s actually a pretty damn fast. Seriously… according to some of what I’ve read, the IOPS and latency difference between TLC and MLC is nowhere near the differences you’ll get between 7.2K NL-SAS and MLC. Nowhere near!

So in a traditional tiered storage implementation, the lowest tier is usually 7.2K NL-SAS drives. These drives give high capacity but perform like a two-legged blind dog that’s lost it’s sense of smell – especially the bigger drives (4TB and up). So on traditional arrays with a mix of rotating rust and well-greased flash, it’s a real trade-off between capacity and performance – and trade-off suck!! But if your an all-flash array, and your lowest tier is TLC flash, then you get the high capacity but don’t have to sacrifice performance. There is no more trade-off.

TLC Flash Performance

On the performance front, read and write performance for MLC and TLC doesn’t have to be a million miles apart. In fact, unless you’re doing something wrong, read performance of TLC should probably be about the same as read performance of MLC. Write performance…. Hmmmmmm…. yeah probably not quite the same. But we shouldn’t be talking orders of magnitude different. May be TLC should have 1/2 or 2/3 the write performance of MLC!? Not too shabby!

Sure, you don’t want to be writing to it like an Etch A Sketch, as it’ll wear out faster the break pads on an F1 car. But on an array like a Pure Storage AFA, the array’s OS should be clever enough to keep writes to TLC to an absolute minimum! After all, the AFA guys are always banging on about how clever their software is, and how amazing it is at working properly with flash. So if they really are half decent, then it should be a walk-in-the-park making TLC a viable option as a so-called capacity tier.

Point being……. the actual IOPS and latency figures of TLC flash are much more in-line with MLC flash, you just can’t write to it that often. But like we said….. if its your bottom tier, you won’t be writing to it much anyway! Plus, lash some half-decent array code on top of it and you’ll hardly ever be writing to it. Sounds like a winner to me!

Another Reason Tiering Just Flash is Better than Tiering Flash and Disk!

Oh and another reason why a tiered AFA will be better than a traditional array with a combo of flash and spinning disk tiers….. those traditional arrays just weren’t designed to work well with flash!ย Yes, most of them have been updated or flash-optimised, but honestly, that’s a bit like trying to add an egg to a cake after it’s baked. A tiered AFA on the other hand has the flash intelligence added into the mix before the baking! Tastes a lot better and much less chance of making you sick.

Potential Challenges

Just after a quick bit of thinking I’m not sure I came up with any majors.

One thing that did come to mind was resource constraint on the existing crop of all-flash arrays. Adding additional workload to them could become an issue, and could further impact abilities to scale (up or down).

After all, most of the AFAs out there at the moment are fairly small – a few shelves of flash drives seems the limit right now. And that’s no doubt for a reason. If you’re deduping and compressing everything in and out of the array, you need CPU grunt to do that and RAM/SLC flash to hold the mapping tables and the likes. So throwing additional overhead into the mix might put even more strain on the modestly spec’d Intel based AFAs already out there. Keeping track of how many writes a particular part of the address space is getting could be non-trivial – at least if you’re mapping at a small enough granularity….

Conclusion

Like I say, I talked to guys at Pure – and other companies as well – about this a year or so back. I’ve no idea if they’re working on it, but it’s dead centre of their raison d’etre – flash at the price of disk. Well… with decent dedupe and compression algo’s, plus a nifty little tiering system like this, what’s stopping them giving us flash for the price of off-shore tech support…CHEAP!!

9 thoughts on “Tiering in All-flash Arrays

  1. Big JM

    You lost me just too early in this post.

    Let’s assume we lived in your fantasy world of AFA’s at the cost of spinning media–or even lower. Awesome!! So what happens to the cost of spinning media? Doesn’t it also lower by extension? In other words, no matter how low the cost of SSD gets, spinning disk will always be lower.

    And if we’re tiering data based on the actual premise associated with the value of tiering–meaning stale data should live on cheap disk–then, financially speaking, won’t it ALWAYS make more sense to use SAS drives for your stale data?

    See, I’m not talking about a hybrid approach–like Nimble or XIV. I’m talking straight up traditional tiering leveraging SSD and a moderate density (600-900GB) SAS drive to get the best bang for your buck. There’s no way to get AFA’s to match that kind of $/GB.

    Also, I hate to go all “let’s have a tech fight here” since I respect your site a lot, but honestly this notion of “traditional arrays not being built for SSD” is just silly. This is more kool-aid drinking than actual fact. Let’s pick on your beloved Pure for a second. In order for Pure to accomplish anything they have to rely heavily on the capabilities of the SSD. Theirs isn’t an architecture that actually leverages SSD to maximize its capabilities. Quite the contrary. Theirs is an architecture that is heavily dependent upon SSD to deliver their software functionality. Near as I can tell, the only thing Pure does that’s innovative on the SSD level is their ability to monitor and protect from the individual cell failure vs the entire SSD. That is very cool stuff. Their inline dedupe is a performance inhibitor waiting to happen. First a cheap tag then a byte-to-byte comparison. Where’s the resources for that work going to come from? You guessed it, your CPU’s & SSD IOps. Some bloggers, like yourself, would have us believe that trading off from the IO at the SSD level is fine BUT I have to disagree. If, at any point in time, I have to rely on my SSD IO to cover some software functionality then I’m not purpose built for SSD, I’m dependent upon it and that’s not a good thing. Looking at their overall architecture objectively if you were to remove SSD from the equation you’d be left with a poor representation of what storage is supposed to look like. If that’s what you mean by “built for SSD” then you can have that.

    There’s more problems with their inline dedupe. Overrun one of their active/passive controllers and dedupe goes post-process–and this is not an unlikely occurrence by any stretch. For a company that is selling on the value of compacted space that’s a pretty big, shiny glaring hole to have, don’t you think?

    They’re not alone. Nimble should NEVER call themselves an AFA. They’re basically a NL array with an EMC Fast Cache rip off. Speaking of EMC, XtremIO is another solution that just has hole after hole after hole in it. SPOF anyone?

    I digress. I can bash all the vendors out there–from EMC to HP to Violin and beyond–but that wouldn’t change some realities. Kaminario for their part stepped up to the SPC-1 and delivered a respectable IO number. Unfortunately their latency was pathetic at 3.44ms. BUT at least they stepped up and wowed us with more than just kool-aid. And if we’re being fair, they stayed sub-ms up to about 60% load and then crept up over the millisecond mark.

    The interesting thing is that HP’s 3PAR array, which is actually an IO processing machine, is the only vendor to submit an SPC-1 benchmark from SSD and come in at sub-ms for their response time at 100% load. Isn’t that, by definition “built for SSD”? Of course it is. No inline dedupe for them so you choose to disqualify them I’m sure. EMC on the VNX gets substantial performance benefits from FAST Cache which, if we’re truly being technologically objective, is much more impressive than anything Pure or Kaminario may be doing.

    Sorry for the long rant and I normally love your site but this “not built for SSD/AFA” nonsense has to go away. Either somebody is funding your post or you’re so far deep into the kool-aid you’ve lost site of the porous architectures that these AFA’s are built on. I’m sorry but if I have to rely on SSD to make up for the shortcomings of my architecture then I don’t have an AFA. I have a problem waiting to happen.

  2. Nigel Poulton Post author

    @Big JM

    Thanks for your comments. Really interesting!

    First I gotta say that nobody is funding my posts – fact. Second I’m not deep in the kool-aid. Though I can;t say that’s a fact, but it’s my strong opinion ๐Ÿ˜€

    So my whole thought process re “flash at the cost of spinning media” is entirely based on dedupe and compression. No dedupe, no case! But assuming we have dedupe on SSD, throwing it onto TLC will just make that flash (and AFA) even cheaper. And… although I didn’t go into it, I think the price per IOP is going to get more and more important. Spinning media is going to get worse and worse at write performance – look at singled magnetic recording (SMR) thats on its way…. Im not a fan of the price per IOP metric (yet) but as spinning media gets bigger and more awkward to work with, I think that metric will become important. All of the innovation around spinning media is in the capacity area, none in performance. So back in the day, for performance, I used to buy 18GB 15K spindles, tomorrow the best I’ll be able to get will be something like a ~1TB 10K drive, and more commonly multi-TB 7.2K and 5.4K. That’s not a great combo when it comes to enterprise workload performance. Add into the mix SMR drives, and update-in-place operations on spinning media will become a royal pain in the rear! HDD’s days as performance media in the enterprise are numbered. Not over, but definitely numberred.

    On your point re lowest tier always being on spinning media… I get your point. My line of thought was two-fold…. One, that TLC with dedupe could be close to $/usable-TB of spinning disk but have performance benefits. I know that’s debatable, but that’s the thinking. Two, having spinning disk and flash in the same array with each as a persistent tier is relatively hard to do (to do well, IMO). You’re optimised for one but not the other. Though I suppose you wouldn’t have to be uber optimized for spinning media if it was your lower tier. And I suppose you may be able to afford the dedupe penalty on spinning media if it’s your lowest tier. So I see that one’s still wide-open. But I beleive many traditional array architectures are very poor choices for flash.

    Re a “straight up traditional tiering leveraging SSD and modest density (600-900GB) SAS” as you put it. Well I doubt an architecture like that will get any dedupe, so a deduping AFA may not be too far off the target $/TB. And I think some of the AFA vendors are betting their business on achieving that $/TB target. Surely all AFA’s that aren’t gunning for Tier 0 uber IOPS and uber low-latency are gunning for the same $/TB as a traditional tiered VNX/3PAR/CML/NTAP…? If they can;t achieve that, then “sayonara”.

    When I say “built for SSD/SS/flash” I mean exactly what you say – it relies on and works to the strengths of the media – take the media away and it’s fall apart. In the same way that many of the traditional arrays designed years ago around spinning disk struggle with solid state. The properties and behaviour characteristics of the two are worlds apart, and something designed for flash would necessarily rely on the properties of flash in order to work in an optimal way. I’m resisting a really cheesy analogy ๐Ÿ˜‰ I have several experiences and thoughts around why traditional arrays aren’t great with solid state, even 3PAR in some respects. Though I will agree they put in a pretty good show with the all-flash 7450 – respect to them for that.

    BTW your comment about my “beloved Pure” honestly hurt! I’m no fan-boy. I like their approach and think I understand what their array is and isn’t, but honestly I’m no fan boy. I thought I was originally calling them out for sitting around scratching their arses for the last two years – where’s the innovation!!!!??? But re Pures dedupe going post-process under heavy load. I would hope it would! As always, it’s not a magic bullet….. use it appropriately. I certainly wouldn’t dream of using it for everything. Though I do agree it’s a gaping hole in their “space efficiency” story, and one that should be more clearly stated, but I won’t be holding my breath.

    XtremIO is interesting IMHO, though yes, they have a hole or two to plug first before I’d put it in production. The double-drive failure in a single X-brick for starters….. But XtremIO is evidence from EMC that their existing line-up (VNX and VMAX) wasn’t up to the challenge of all-flash. Evidence for the need for purpose built AFAs, no?

    And on your final point where you say “if I have to rely on SSD to make up for the shortcomings of my architecture then I donโ€™t have an AFA”. Hmmmmmm like I said earlier, I don’t agree with that. Isn’t is like saying I don’t have a cache-centric architecture because I need a boat load of cache in my machine for it to work. Or may be even saying I don’t have a spinning disk centric architecture because I need to front it with gigabytes and gigabytes of cache for it to perform anywhere near reasonable. Of course an AFA architecture relies on the performance and other characteristics of flash.

    Gotta say, I miss conversations like this on blog posts! Appreciate your input. And I’m open to you being right on all accounts. I don’t think so, but I’m open to it ๐Ÿ˜€

  3. Big JM

    Okay so now we’re getting somewhere…

    I agree with your fundamental reasoning. IF we can use a model of $/IOp–or $/Usable IOps (even better)–then tiering in an AFA could make some logical sense. Although one could still make a counter argument should dedupe/compression–even post process dedupe–come to an array with both SSD & Spinning media.

    On your second point we have to agree to disagree or we can keep debating. It’s nice that you’re open to my being right–especially since you’re so blatantly wrong. ๐Ÿ˜‰

    Every technology purchase is an investment. It’s an investment in innovation, engineering, mechanics, etc. Where is the engineering on these AFA’s? They’re not built to maximize the potential of the SSD but rather with a major dependency on the SSD itself. The fact is the only thing these arrays are actually doing is trading off the full potential of Flash for the extension of its life. There is nothing innovative in building an array whose sole purpose is based around the extension of the life of an SSD. That’s like building a house just so you can have a closet!!

    Since you and I agree on 3PAR’s benchmark results, let’s use them as an example. 3PAR’s design makes the disk–and connectivity for that matter–somewhat irrelevant. IO comes in, it’s cached actively across all the nodes. Cache hits a certain amount and the data is committed in full stripe writes to the back end. Since they have no raid groups or concatenated raid stripes or anything clunky like that, they basically just use the whole backend pretty much evenly. Plug in SSD, they cache a little less, flush a little more often. So the end result? Well give them spinning media and they had a 450K IO benchmark at 7ms latency (or slightly less), give them SSD and they push a sub-ms result. They basically invented Thin Provisioning and deduplicate zeroes inline–with no performance drop off or having it shut off ever.

    That to me is an engineered design. There’s no dependency on the media because they figured out how to process and then protect the IO. Granted, they have no dedupe/compression just yet. (I’m assuming they will have this at some point but only time will tell.) I think most people who are technically objective would prefer this design over any other.

    VNX leverages FAST Cache to offload cache misses and cover up the slowness of spinning media. They’re much more resilient than any AFA and, at the end of the day, FAST Cache delivers very, very good performance. (Granted lots of work that has to be done to make it happen by the admin.)

    Hitachi’s ability to cache incredible amounts of IO–of all sizes–and then slap that to SSD with very good performance is very impressive.

    Even NetApp’s implementation of VST has something to be said for it.

    The point I’m making is that the legacy arrays of old were purpose built around IO and at the end of the day IO is the most important thing. Processing it and then protecting it trumps salvaging the wear life of an SSD any day of the week & twice on Sunday. Not one of the AFA’s, IMO, are actually worried about IO. They’re just throwing it up in the air and saying “Don’t worry SSD will cover it. Here’s how we make your SSD last long.” and that is a very flawed statement.

  4. Nigel Poulton Post author

    @Big JM

    I agree that 3PAR has it’s merits, and those are borne out with the 7450 and the SPC-1 results. However, I think under the hood it’s not ideal for consumer grade flash. It shipped with 50GB SLC drives for a lot of years while they got their story straight. But even now, 3PAR burns through SSD at quite a clip compared to at least some of the AFA systems out there. And it would be better if it did dedupe. Sure, I expect dedupe is just around the corner, but it remains to be seen how good it is.

    Hammering the AFA guys for wanting to prolong the life of flash (I get the fact you’re may be saying it’s their entire value proposition and as such not a great value prop)…. but let’s be honest 3PAR, CML, VNX will all be doing dedupe on flash as soon as they can. And by the time they get there, may be some of the AFAs will be more up to the enterprise level of the traditionals. Though I’m not certain what they currently lack – unless your talking features like replication (incl SRM) and snaps….?

    As for VNX and FAST Cache. That’s a real band-aid in my opinion. I’m certain that a major reason 3PAR hasn’t done a FAST Cache of its own is because it doesn’t need it – at least nowhere near as much as VNX. But VNX FAST Cache, as good as it’s performance can be, it’s an ugly bolt on that covers the cracks in a creaking platform.

    I get your final point about the IO being important. Are you saying the AFA products on the market dont process and protect IO in a suitable way??

  5. Big JM

    What I’m saying is they don’t process IO in a way that maximizes the IO/latency delivery of the SSD itself–which is the main reason we are supposed to be buying these Flash arrays in the first place.

    They care about dedupe. They care about write endurance/write amplification. They care about protection–to some degree. That’s where it ends. Inevitably we, the consumer, are left to debate “AFA” vs “Legacy Storage” based on these points alone. While those points are important it’s only half the discussion.

    We are moving to Flash to get latencies on par with memory. That’s the whole point. If Spinning Media could drive sub millisecond response times Flash wouldn’t exist and this conversation would be moot.

    That said, I want to know who is being innovative around IO processing. If someone is leaving it up to the SSD I want to know why. I don’t want to hear how they’re borrowing IO from the SSD because they have it in abundance. SSD may have IO in abundance BUT it certainly doesn’t have latency in abundance because only one vendor can publish an independent sub ms benchmark!!

    Which vendor is getting sub ms response times from SSD AND giving me a full software suite on top of that? Which vendor’s dedupe is REALLY inline and not impacting that response time? Which vendor is delivering 5 nines–or better–on top of that sub ms response time? That is what is truly important about Flash as far as I’m concerned.

    Whose protection is best is debatable IMO. Scale up or Scale out. Active/Standby at the controller level. No shelf level protection for any of the vendors. Is that a big deal? I would say it’s no worse than existing storage.

    By the way, thank you for indulging me in this conversation!!

  6. Nigel Poulton Post author

    I get that I/O latency is *one* of the reasons we’re using flash. But I’m not sure that’s the be-all-and-end-all of utilizing flash. If you’re Violin or Kaminario then may be yes. But because of technology, architectures and cost, I think there’s a pretty small market for this at the moment. And a lot of these AFA guys aren’t looking to set world records with flash – they’re looking to build a better solution than what’s already out there for tier 1. Well…. actually they’re looking for a slice of the multi-billion dollar tier 1 market.

    I’d say a huge aim of many of AFA’s out there is to get rid of wild differences in read and write latency when having to hit the backend. Raison d’etre being to deliver much more of a balanced experience with more predictable I/O – getting rid of those nasty peaks and troughs in latency that is a nightmare for every application out there. Cache can only do so much for spinning media when it comes to avoiding peaks and troughs. I’ve got the scars and Im sure you do too.

    Yes I think the AFA guys want to be fast (low latency) but I think a huge part of what they’re aiming for is predictable even’d out I/O. I honestly do. And as a former end-user and customer I sure as hell wanted that. And I was less interested in uber-low latency!

    For me, getting latencies in line with memory will come with technology other than flash. May be STT-RAM, PCM or something else, but not flash. To me flash won’t sit on the throne of T1 for long – pretty soon it’ll be relegated to middle tier – in between spinning on the bottom and next gen solid-state on top. The role of flash is to even out response times. It’s not cut for memory level latencies and won’t be sitting on the top of the performance hill for long.

  7. John Hayes

    Hi Nigel, TLC has great characteristics for colder storage and you identified the key software optimization for using – identifying data that’s likely to live a long time. The media performance is a lot worse – writes 3x slower (up to 7 ms/page) and reads up to 4x slow (1-2 ms/page). It’s hard to care about these factors when you weren’t planning on writing that much and reading is still way better.

    As far as software problems go it’s a lot easier to predict write rate compared to read rate – all the problems are on the hardware side. First, there aren’t any credible TLC SSDs out there – and even the ones that existed aren’t less expensive than cMLC. Samsung juiced theirs with some SLC to front writes and that ate a lot of the cost benefit – it’s possible an SSD is too small a form factor to take advantage of TLC.

    If you go down to the chip level, the products aren’t there – you sign a dozen NDAs to find out about TLC, it’s the previous generation technology and their engineer has a look of despair whenever you ask about how bad it can get. It’s the moonshine of NAND business – it goes into SD cards because no one cares if a bit flips in the middle of their 20 MB JPEG. A tweaked controller extends the life of a NAND media design and the fab.

    There’s no reason this has to be the case but the difficult transition to 3D NAND pretty much cancelled all plans to advance TLC through 2013 and 2014. MLC has resumed steadily dropping in price after a flat 2013.

    Big JM, the price of HDDs isn’t locked to continue dropping in proportion to NAND. To make an HDD you actually need raw materials – stuff dug out of the crust of the earth. Even as media gets denser, making servos motors, metal cases and magnets is not. This creates a minimum viable cost for an HDD that isn’t shrinking, so HDD only deliver decreasing/TB costs by getting larger. Gross margins for a chip are in excess of 90% – chromebooks and ultrabooks switched to SSDs because there were cheaper on an absolute measure. There’s no such thing as a 16GB HDD. iPod switched to NAND from Microdrives because it’s more reliable, lighter and uses less power.

    I don’t know if you remember the first consumer digital cameras; Sony made one in the mid-90s that actually took a floppy disk. Not exactly a product that takes the world by storm, but that’s about our current level of sophistication when using NAND in an SSD. Over the next few years later Film cameras collapsed, now Shutterfly owns Kodak.

    For SSD dependency I refer you to a great talk by Dave Wright:

    https://www.youtube.com/watch?v=AeaGCeJfNBg

    Nigel, shorter answer to your question is “yes” ๐Ÿ™‚

    John

  8. nate

    Hey Nigel –

    It may not be classified as a traditional AFA but the Dell Compellent platform has support for at least two different types of SSD tiers and the auto tiering is aware of them.

    http://www.dell.com/us/business/p/dell-compellent-flash-optimized/pd

    “With the ability to tier across write-intensive SLC SSDs and lower-cost, high-capacity, read-intensive MLC SSDs in a single solution, Dell Compellent Flash-Optimized solutions can reduce costs up to 80 percent compared to other Flash-Optimized Solutions.”

    I am a 3PAR person myself but have to admit it is a very innovative approach.

  9. Sam

    We currently operate an all-flash Compellent SC8000 array. I personnaly think it’s one of the best option out there. Tiering on Flash make a lot of sense.. there’s no more big performance drop as you were expecting on tiered disks.

    now if we could just enable dedupe..

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.