With two of the “big three” Enterprise array vendors already shipping kit with Sub-LUN Tiering capabilities, and the other no doubt hot on their heels, we are in the early stages of an era that has subtly different design and architectural considerations from what we have become versed to over the past years. It’s my opinion that designing for sub-LUN tiering will be an art of its own over the next few years…..
I had a great discussion about design considerations for Sub-LUN tiering last week on Twitter, and realised that just about everyone involved in the conversation had different opinions. So I thought I’d put my own personal thoughts together in a blog post.
DISCLAIMER: As always, these are my own personal opinions and they will no doubt change over the coming months and years.
Choosing Your Tiers Wisely
Designing tiers is not a simple as it first appears. Putting your finger in the air and picking your tiers will no doubt end in a poor design and potential bad experience with Sub-LUN tiering.
First up, there needs to be enough performance difference between the tiers to make moving extents between the tiers worth the effort. If there is not enough difference in performance between the tiers then the load on the backend during migrations may not be worth the effort.
For example, in my opinion, the following would not be a great design -
- Tier 1 – 400GB SSD in RAID6 (6+2)
- Tier 2 – 600GB 10K SAS in RAID6 (6+2)
- Tier 3 – 1TB 7.2K SATA in RAID6 (6+2)
I list RAID6 for all tiers in the example above because more and more folks seem to be deploying RAID6, due to its availability, seemingly without considering the flip-side of the coin, the poor performance. Not only is there the extra overhead for parity calculation, there is additional work for the backend as more I/O has to happen for each write, and let’s not forget the elongated rebuilds that in large enough arrays with thousands of disks can see the array almost always in degraded mode (rebuilding failed disks). And we all know that performance sucks during rebuilds.
NOTE: I am aware that not all RAID6 is created equal. But in all systems that I know of, dual parity RAID performs worse than single parity with all of the above things taken in to consideration.
For me, a better solution may look something like below -
- Tier 1 – 400GB SSD in RAID5 (7+1)
- Tier 2 – 450GB 10K SAS in RAID10
- Tier 3 – 1TB 7.2K SATA in RAID6 (14+2)
Of critical importance, the above provides larger performance delta’s between the tiers.
Tier 3 is a non-performer and suited well for cold data (data that is rarely accessed). Tier 3 will also be pretty cheap. You would hope it would be the largest tier in your array.
Tier 2 will hugely outperform Tier 3 and is well worth the effort of moving extents up from Tier 3. Using 450GB 10K disks also ensures that access density isn’t a problem – I’m still personally of the opinion that 600GB SAS/FC disks dont provide enough IOPs per GB to be a high performance tier….. Also, as it is configured as RAID10 it will provide excellent write performance.
Tier 1 is based on SSD so is ideal for those read misses, and as it will be primarily used for read miss data, the RAID5 parity penalty shouldn’t be a problem. Using SSD for writes is not its sweet spot, for this reason I think RAID10 schemes are ideal for Tier 2 above.
Don’t Forget About Availability
All of the above need to be underpinned by solid MTTDL calculations.
While considering performance when choosing tiers is hugely important, availability must still be considered – but availability must not be the be all and end all.
NOTE: I am in no way saying that availability isn’t important, or less important than performance. But…… if performance is so poor then technically your data may be as good as unavailable. Also, I see the odd person sacrificing performance to the false god of RAID6. I always advise to use RAID6 wisely!
In my above example, Tier 2 will not normally be a large tier. In fact most, tiering designs will resemble a pyramid with the majority of capacity being at the bottom and the smallest amounts at the top. I wouldn’t be suggesting RAID10 if you are expecting to have hundreds or thousands of disks in this tier (even across multiple subsystems).
NOTE: If you have 10 subsystems and each has 128 drives in Tier 2, you should consider your environment as having 1,280 drives in Tier 2 and calculate your MTTDL accordingly
At the end of the day, whatever the RAID configuration, an unlucky sequence of events can result in data unavailability or data loss. Don’t get over paranoid about it. Base your decisions on solid MTTDL calcs, and if you can do some RAID10 and RAID5 within your MTTDL calcs then do it. If we all jump on the RAID6 bandwagon and sacrifice our performance on the alter of RAID6 we will have applications vendors and administrators demanding DAS – and we don’t want that backward step!!
Is Two Tiers Enough?
When first talking about sub-LUN tiering about 18 months ago, with SSD in mind, my natural instinct was to contemplate designs with only SATA and SSD. However, my opinion is now that the performance delta between the two tiers might be too great – I know that Ive just been saying the delta has to be significant enough. Here me out….
RAID6 seems to be the only option for large SATA drives, especially 2TB+ SATA drives. So you have the slowest drives available with the worst performing RAID, meaning that performance will be dire. On the other hand SSD goes like the wind. However, consider that an I/O stream that includes extents in both tiers will deliver hugely different response times. Not ideal.
Also, SSD is not a perfect fit for all workload types. In fact its not great for random writes.
UPDATE: I had a question on Twitter from Joe Kelly (@virtualtacit) re my comment above that SSD isn't great for random writes. Hopefully the following helps clarify –
SSD (NAND flash) can only be written to in units of blocks (examples being 128K, 256K and 512K). Assume, for this example, that a block has 10 addresses 0-9. The block already has addresses 1, 3, and 7 – 9 written to. An incoming write wants to write to addresses 2 and 6. It is not possible just to update addresses 2 and 6. Instead, the entire block must be read in to memory, flash erased, new contents calculated and then the entire block re-written. This is similar to the overhead of small block writes in a RAID5 scenario. For this reason, SSD aint the best when it comes to small writes. There are of course techniques to mitigate this, however, over time and when your SSD tier is highly utilised (as you need it to be at its current cost) they become less and less effective. May be the topic of a future post…
With these two considerations in mind, I don’t think two-tier systems are realistic at the moment.
It’s early days in the sub-LUN tiering world and best practices and opinions will evolve over time. However, designing for sub-LUN tiering will be an art of its own, for the mean time at least.
Id be interested to hear your thoughts.
You can also talk to me and a bunch of smarter folks than me on Twitter. I can be reached by sending tweets to @nigelpoulton