Sub-LUN Tiering Design Considerations

With two of the “big three” Enterprise array vendors already shipping kit with Sub-LUN Tiering capabilities, and the other no doubt hot on their heels, we are in the early stages of an era that has subtly different design and architectural considerations from what we have become versed to over the past years.  It’s my opinion that designing for sub-LUN tiering will be an art of its own over the next few years…..

I had a great discussion about design considerations for Sub-LUN tiering last week on Twitter, and realised that just about everyone involved in the conversation had different opinions.  So I thought I’d put my own personal thoughts together in a blog post. 

DISCLAIMER: As always, these are my own personal opinions and they will no doubt change over the coming months and years.

Choosing Your Tiers Wisely

Designing tiers is not a simple as it first appears.  Putting your finger in the air and picking your tiers will no doubt end in a poor design and potential bad experience with Sub-LUN tiering.

First up, there needs to be enough performance difference between the tiers to make moving extents between the tiers worth the effort.  If there is not enough difference in performance between the tiers then the load on the backend during migrations may not be worth the effort.

For example, in my opinion, the following would not be a great design -

  • Tier 1 – 400GB SSD in RAID6 (6+2)
  • Tier 2 – 600GB 10K SAS in RAID6 (6+2)
  • Tier 3 – 1TB 7.2K SATA in RAID6 (6+2)

I list RAID6 for all tiers in the example above because more and more folks seem to be deploying RAID6, due to its availability, seemingly without considering the flip-side of the coin, the poor performance.  Not only is there the extra overhead for parity calculation, there is additional work for the backend as more I/O has to happen for each write, and let’s not forget the elongated rebuilds that in large enough arrays with thousands of disks can see the array almost always in degraded mode (rebuilding failed disks).  And we all know that performance sucks during rebuilds.

NOTE: I am aware that not all RAID6 is created equal.  But in all systems that I know of, dual parity RAID performs worse than single parity with all of the above things taken in to consideration.

For me, a better solution may look something like below -

  • Tier 1 – 400GB SSD in RAID5 (7+1)
  • Tier 2 – 450GB 10K SAS in RAID10
  • Tier 3 – 1TB 7.2K SATA in RAID6 (14+2)

Of critical importance, the above provides larger performance delta’s between the tiers. 

Tier 3 is a non-performer and suited well for cold data (data that is rarely accessed).  Tier 3 will also be pretty cheap.  You would hope it would be the largest tier in your array.

Tier 2 will hugely outperform Tier 3 and is well worth the effort of moving extents up from Tier 3.  Using 450GB 10K disks also ensures that access density isn’t a problem – I’m still personally of the opinion that 600GB SAS/FC disks dont provide enough IOPs per GB to be a high performance tier…..  Also, as it is configured as RAID10 it will provide excellent write performance.

Tier 1 is based on SSD so is ideal for those read misses, and as it will be primarily used for read miss data, the RAID5 parity penalty shouldn’t be a problem.  Using SSD for writes is not its sweet spot, for this reason I think RAID10 schemes are ideal for Tier 2 above.

Don’t Forget About Availability

All of the above need to be underpinned by solid MTTDL calculations. 

While considering performance when choosing tiers is hugely important, availability must still be considered – but availability must not be the be all and end all. 

NOTE:  I am in no way saying that availability isn’t important, or less important than performance.  But…… if performance is so poor then technically your data may be as good as unavailable.  Also, I see the odd person sacrificing performance to the false god of RAID6.  I always advise to use RAID6 wisely!

In my above example, Tier 2 will not normally be a large tier.  In fact most, tiering designs will resemble a pyramid with the majority of capacity being at the bottom and the smallest amounts at the top.  I wouldn’t be suggesting RAID10 if you are expecting to have hundreds or thousands of disks in this tier (even across multiple subsystems).

NOTE:  If you have 10 subsystems and each has 128 drives in Tier 2, you should consider your environment as having 1,280 drives in Tier 2 and calculate your MTTDL accordingly

At the end of the day, whatever the RAID configuration, an unlucky sequence of events can result in data unavailability or data loss.  Don’t get over paranoid about it.  Base your decisions on solid MTTDL calcs, and if you can do some RAID10 and RAID5 within your MTTDL calcs then do it.  If we all jump on the RAID6 bandwagon and sacrifice our performance on the alter of RAID6 we will have applications vendors and administrators demanding DAS – and we don’t want that backward step!!

Is Two Tiers Enough?

When first talking about sub-LUN tiering about 18 months ago, with SSD in mind, my natural instinct was to contemplate designs with only SATA and SSD.  However, my opinion is now that the performance delta between the two tiers might be too great – I know that Ive just been saying the delta has to be significant enough.  Here me out….

RAID6 seems to be the only option for large SATA drives, especially 2TB+ SATA drives.  So you have the slowest drives available with the worst performing RAID, meaning that performance will be dire.  On the other hand SSD goes like the wind.  However, consider that an I/O stream that includes extents in both tiers will deliver hugely different response times.  Not ideal. 

Also, SSD is not a perfect fit for all workload types.  In fact its not great for random writes. 

UPDATE: I had a question on Twitter from Joe Kelly (@virtualtacit) re my comment above that SSD isn't great for random writes.  Hopefully the following helps clarify –

SSD (NAND flash) can only be written to in units of blocks (examples being 128K, 256K and 512K).  Assume, for this example, that a block has 10 addresses 0-9.  The block already has addresses 1, 3, and 7 – 9 written to.  An incoming write wants to write to addresses 2 and 6.  It is not possible just to update addresses 2 and 6.  Instead, the entire block must be read in to memory, flash erased, new contents calculated and then the entire block re-written.  This is similar to the overhead of small block writes in a RAID5 scenario.  For this reason, SSD aint the best when it comes to small writes.  There are of course techniques to mitigate this, however, over time and when your SSD tier is highly utilised (as you need it to be at its current cost) they become less and less effective.  May be the topic of a future post…

With these two considerations in mind, I don’t think two-tier systems are realistic at the moment.

 

Conclusion

It’s early days in the sub-LUN tiering world and best practices and opinions will evolve over time.  However, designing for sub-LUN tiering will be an art of its own, for the mean time at least.

Id be interested to hear your thoughts.

You can also talk to me and a bunch of smarter folks than me on Twitter.  I can be reached by sending tweets to @nigelpoulton

19 thoughts on “Sub-LUN Tiering Design Considerations

  1.  
    Nigel,
     
    Are you sure about your statement on SSDs? where did you get your information? as far as I know SLC SSDs (the so called enterprise grade) perform *BETTER* with small writes (with blocksize in the 2k-4k range), due the fact that the NAND pages are 4k inside the SSD.
    I don't know if my knowledge is up to date but I'm quite confident on what I recall.
    just my two cents.
    Fabio

  2. Hi Fabio,

    Thanks for popping by and getting involved.

    What you say is correct…. but like all things in this industry, must be taken with the usual pinch of salt (content on this website probably requires more than a “pinch” of salt ;-)

    As you say, pages are normally 4K and then grouped in to blocks, often sized between 128K and 512K. But….. NAND flash can only be directly written to if empty, so, the first time you write to an area you can program it directly and thus the write is very fast. However, if writing to areas that have already been written to, that is when you encouter the read/erase/calculate/write cycle that I mention in the article.

    Enterprise class SSD have a ton of spare capacity that is manipulated in various ways so that it optimises writes to hit clean areas wherever possible – kind of like WAFL if you know NetApp, or redirect on write…

    So…. as far as Im aware, you are correct that writing to SSD in small writes is fast, as long as there is sufficient free capacity and the controller can allocate free areas to write to… but once you start having to write to areas that are already written to, performance dives.

    These techniques work well, like WAFL, as long as there is free capacity for the controller to perform is trickery. But my point about SSD is that it is so expensive and most arrays will be configured with comparatively small amounts of it, that you will want to push its utilisation as high as possible – thus negating much of the cleverness the controller implements to offset the poor write performance.

    Make sense? Might have to write a post on this in the future.

    Nigel

  3. Nigel,
    Thanks for the clarification, now I get your point, you're referring to the write amplification issue, but as far as my experience goes with SSDs usually the garbage collection algorithm works OK (as in no noticeable performance degradation) up to 95% full, but my experience is limited to STEC drives.
    Fabio

  4. Hi Nigel,
    Thanks for the article. Whenever looking at the prospect of sub-LUN tiering (I have yet to implement) I have always wondered as to the best way to decide your ratio of tiers as most implementations will just spread the load across the available tiers, with no indication of how better the performance could be if more SSD, say, was added in – or conversely – how well utilised is the SSD and how much of that could very comfortably exist on a lower tier.

    Does anyone have any experience / input into this?
    Also I read a good over-view of wear-levelling techniques used by SSD controllers on The Register.
    http://www.theregister.co.uk/2010/12/13/making_mlc_flash_last_longer/
    It is specific for MLC but there is some interesting stuff there (patrolling for redundant data so blocks can be pre-erased for faster writes in future; TRIM cmd in OS* ; the continued movement of long term data to more frequently used cells; and use of DSPs to extract that last gasp from lower performing cells – seemingly in a similar way to the disk drive guys using them for PRML)
    * there goes the effectiveness of all my data recovery / unerase software…

  5. Hi Fabio,

    Yes, STEC Zeus IOPS drives are usually formatted so that only ~half of the real capacity is exposed (e.g. a drive that has 400GB worth of cells will be formated so that the controller only exposes 200GB) that way, with the garbage collection and the controller utilising the hidden capacity, writes can almost always be directed to clean/erased blocks…

    I suppose one needs to be careful how their particular array vendor formats their drives – i know at least in the past that vendor X formats the exact same physical drives and will expose for example, 300GB of usable capacity whereas vendor Y will expose only 200GB – capacity vs performance….

  6. Gordon,

    Thanks for your comments…

    All good array vendors should have tools that will take statistics from your array and model different scenarios if you were to add X tier or Y tier.  Of course, depending on how mature their sub-LUN technology is, these tools will improve over time as more is learned from data gathered in the field… If your vendor cant offer you any guidance then you may want to reconsider your vendor ;-)

    For shops that have very little understanding of their workload and are having to make wild guesses as to how much of each tier to install, I may recommend installing a minimum amount of SSD as possible at first and then adding more as and when required.  I certainly wouldn't be installing mroe than 5% SSD if your not certain you need it.
     
    Also, It may be worth asking your peers (folks who work for companies that work in the same line of business as you) and ask your vendor for guidance on what they are doing at companies in the same line of business as you…

    The thing is, sub-LUN requires a huge rethink on so many fronts and we will all be learning over the next few years – including our esteemed vendors.

  7. There are a few things that are often overlooked when dicscussing SSD write performance:
    1) With drives like the STEC ZeusIOPS, there is a specific amount of space reserved for bad-block replacement and pre-erasing for writes, and this space is not available for other purposes. For example, the 400GB Zeus actually has 512GB of NAND Flash. So even at 100% full, writes will still land on pre-erased space, minimizing the overhead of writes
    2) Many drives (including the ZeusIOPS) do not maintain a strict mapping of external LBA to internal layout – instead, they virtualize the mapping inside the drive. Thus, an internal 128KB "page" can contain discontiguous LBA blocks; each of these LBA blocks are mapped to the appropriate external reference.
    3) The above allows the drive to collect "random writes" into contiguous pages (as Nigel says, analguous to WAFL) and then write them to flash, mapping the new blocks into the appropriate spaces and invalidating the older blocks (which will eventually be reclaimed by the garbage collection processes).
    4) An array that is aware of this (such as VMAX and CLARiiON) will try to cache as many random writes as possible, delaying physical writes to the SSD until there are sufficient blocks/tracks to fill up multiple pages on the SSD, thereby optimizing the write performance of the SSD.
    5) Done right, the delay between this destaging of pending writes actually allows the drive time to "clean up" by coalescing invalidated blocks and pre-erasing more pages for the next set of writes.
    The net effect is that with at least SOME SSDs, in at least SOME storage arrays (including both Symmetrix and CLARiiON), random (and sequential) write performance is markedly better than that of 15K spinning rust – often approach the MB/s limits of the drive even with a front-end load that is all small-block I/O. Thus, unlike most commercial SATA SSDs, enterprise-class EFDs do actually benefit small-block random write performance- and sometimes significantly.

  8. Pingback: Juku
  9. Seagate are now doing a HDD with on board NAND flash. Whilst I'm not sure how this is utilised at the moment, their firmware guys are smart enough to get this doing its own sub-drive tiering eventually if it is not already fully doing this.
    I was thinking about these types of drives if they are then used in an array that has a mix of trad HDDs and SSDs; with sub-LUN tiering code and how this would all tie in together and how to put these into a performance model.
    And then my head started to really hurt   :-)

  10. Hi Gordon,
    I see some positives for using hybrid drives like these in arrays, but wonder if the drawbacks and challenges will make it not viable.  Architecturally it has its pros and cons/difficulties/overlaps with array firmware etc…. :-S
    Brain fry thinking of that ;-)

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

You can add images to your comment by clicking here.