Dynamic Provisioning: The 42MB page unravelled

By | January 27, 2009

DISCLAIMER:  The opinions expressed in this post are just that, my opinions.  I do not work for a vendor or a partner and therefore do not speak authoritatively.  However, I do have fairly broad experience and knowledge of Hitachi storage.

NOTE:  Ive put this together quickly as Im going to be busy for the next fews and my wife is also about to give birth (couple day overdue already).  So apologies if it doesn't read as smotthly as Harry Potter.

Firstly, a paragraph for the benefit of those who don't already know – Dynamic Provisioning, as implemented by Hitachi (HDP on the HDS USP-V, ThP on the HP XP24000) is based around a basic allocation unit of 42MB.  HDS refer to this as a Page .  Essentially any time a host writes to a previously unallocated area of a LUN, the array will allocate a new 42MB page to that LUN.  So, as data is written to an HDP LUN it will be grown in units of 42MB.  OK.

Now, unless its changed, this 42MB is taken from contiguous stripes from a single LDEV in a single Array Group.  This basically means that each of these 42MB pages is allocated from a single Array Group (8 disks).  Subsequent pages allocated to the LUN will probably be allocated from another Array Group in the Pool….. So over time your LUN should have its blocks nicely balanced across all spindles in the Pool.  A form of wide striping. 

DISCLAIMERActually I should point out that I don't work for Hitachi, HDS, HP or Sun.  So Im not an authority on this topic or the USP/XP (see my previous post for where Im coming from).  Despite this it should be an interesting pos.  Also microcode changes happen all the time and the guys at the factory don't mail me about them, so tweaks to the algorithm will at some point, if they have not already, render some of the above information out of date.

However, for the purpose of this discussion, all we really need to know form the above paragraph is that space is allocated in units of 42MB called pages.  To date this has not changed.

So………. the USP-V/XP 24000 is pretty flexible and supports the following hardware RAID levels –

  • RAID10 2+2
  • RAID10 4+4
  • RAID5 3+1
  • RAID5 7+1
  • RAID6 6+2

In addition to the above, we must also consider the ill-named but quite impressive Concatenated Array Group (it is actually a stripe).  These allow you to join 2 or 4 RAID5 7+1 Array Groups to get wider back-end striping. 

So if we include these then we have an additional two RAID levels to consider –

  • RAID5 14+2
  • RAID5 28+4

As we can see, the USP-V gives us all three of the popular RAID levels (1, 5 and 6) as well as scope for customisation of each, giving a total of 7 possible RAID configurations.

Now then, track size for OPEN-V volumes on the USP-V is 256K.  However, the USP-V RAID controllers write two rows per spindle per stripe.  In other words, when writing a full stripe the USP-V will write two tracks per physical spindle before moving on to the next spindle in the RAID set (need diagram).  So you could say the effective chunk size per spindle is 512K.  With this in mind we can calculate the stripe size for all of our above RAID configurations using the following calculation –

x = a * 512

Where x the stripe size and a is the number of data spindles in the RAID set.

Based on the above calculation, the respective stripe sizes of each of the above RAID configurations is as follows –

  • RAID10 2+2 = 1024K
  • RAID104+4 = 2048K
  • RAID5 3+1 = 1536K
  • RAID5 7+1 = 3584K
  • RAID5 6+2 = 3072K
  • RAID5 14+2 = 7168K
  • RAID5 28+4 = 14336K

And…..

42MB = 43008K

Still with me?  Fairly boring I know.  However…….

Now for the slightly more interesting part (I stress the slightly).  You will find that every one of the stripe sizes listed above divides perfectly into 42MB.  If you dig further, you will also find that 42MB is the lowest number that all of the above stripe sizes divide perfectly into, without generating a remainder.  I'll spare you the spreadsheet because its too wide to fit comfortably on this screen, but feel free to check it out for yourself.

Then if we factor in other things such as cache slot and segment size as well as pre-fetch (based on tracks (256K) and multiples thereof) they all also divide perfectly in to 42MB.

Also with the USP-V being track centric/cache slot centric (both 256K), it tends to internally map and manage things, such as external storage, in multiples of this track/slot size.  Again, divides perfectly.

Interestingly, you can also divide 42 by 2,3,4,6 and 7.  These numbers equate to the number of data spindles in all basic supported RAID configurations (not including previously mentioned concats).  To be honest, knowing a little about how clever, efficient and thorough the developers in Japan are I expect that the 42MB Page size maps to a lot more internally.  In fact I wouldn't be surprised if the number of screws used to build each internal disk chassis was divisible by 42….. 😉

Further to this internal mapping… Let's not forget that because HDP is borrows a lots from the Hitachi implementation of Copy-On-Write technology (COW) I will refer to the operation of allocating a new 42MB page to a LUN as a Borrow-On-Write, or BOW, operation. 

 Each time a BOW operation occurs there is overhead along the lines of the below –

  1. Search the free page table for the next available free page (if there is now logic on top of this to more evenly spread the load then the overhead will be more)
  2. Update the Dynamic Mapping Table (DMT) and the free page table
  3. Map the page into the DP-VOLs allocated page table
  4. Make the blocks avail able for access

(some of the terms used above are probably my own and not official)

May be there's more to the BOW operation, may be there's less?  But it won't be too far from that mentioned above.

So…… with the above in mind, the smaller the page size, the more often these BOW operations will be required when growing a volume.  Each time incurring a small overhead, so the less frequently they happen the better.  Of course these operations only occur when a new page is demanded. 

Also, albeit probably hardly worth mentioning, the DMT (that's an official HDS term and not my own) is another layer that must be traversed in order to map LBAs in a LUN back to to blocks within a Pool for normal read and write requests that dont require allocation of a new page.

Anyway, if page size was smaller, the DMT would of necessity be larger due to there being more pages per Pool.  As a result it would take longer to search and update.  And when you think that each pool can have millions of pages, the DMT could get quite large.  Take the following as an illustration –

A USP-V, in all its glory, can have over 140  internal Array Groups and each Array Group can be comprised of 8 x 300GB spindles.  Each 300GB Array Group formatted to RAID5 (7+1) yields around 1.8358TB usable Base 2.  This gives each Array Group 45,832  x 42MB pages.  Multiply this by the possibility of, lets say, 140 Array Groups gives you 6,416,480 x 42MBpages that all need to be represented in the DMT.  And that's not to mention External Storage which can also be Pooled for HDP. So a smaller Page size would significantly increase the size of the DMT and as a result, reduce the efficiency of the DMT.

For those of you still reading, thanks, and I'll leave the theory there for now.  In theory the ingredients are fine, but the proof is in the pudding –

At the end of the day, from experience in the field and from knowing a little of how well aligned it is to the internal structures and workings of the array, I think the 42MB page works very well. 

And even after all is said and done, I trust that the Hitachi guys in Japan know far more than me about how their kit works, and for that I'm sure they also know far more about their own kit than their competitors know about it too.

Now on a side, there are people out there, their names tend to be "Barry ", who like to point their fingers and laugh at Hitachi's supposedly large and chunky 42MB page.  One of these Barry's, when questioned on the extent/page size chosen by his own company's Dynamic Provisioning offering was shamefully quiet.  I say shamefully considering his previous deafening criticisms of others choices as well as a promise to tell us once his engineers had decided.  If I remember right, the most we ever got from him was some spiel about it being aligned with the internal workings of the Symm their array.  Makes one wonder if he has something to hide or some backtracking to do? 

It appears that may be the reason why Hitachi chose such a large page size (and size is a matter of perspective) was because of how flexible the USP-V is and how many different configuration options there are, all of which need to be considered and mapped to.!?

It works and it works pretty well, thats all I can say for it.

Nigel

PS.  Im open to anybody shedding any further light on the topic at hand, feel free to comment.

14 thoughts on “Dynamic Provisioning: The 42MB page unravelled

  1. Nickolay

    Hi, thanks for the post. I’m courious about "RAID10 4+4" configuration. Does USP-V support 4D+4D groups or you mean 2*(2D+2D)?

  2. Nigel Poulton

    Yes I am saying that the USP-V (and USP) do true RAID10 (4D+4D) –
    1 = mirror.  So the USP will create four mirrored pairs
    0 = stripe.  It will then stripe over those four mirrorred pairs.

    I know that Storage Navigator shows them as 2 x 2D+2D concatenations but it is wrong.  I guarantee.

    Storage Navigator is not the tool being pushed and as such doesnt appear to get a great deal of development.  Just take a look at RAID5 (7+1) concatenations (interleaving/striping).  That is not represented well in Storage Navigator either.

  3. Barry Whyte

    All very clever maths, but when it comes down to it, a 4K write still allocated 42MB of space.The reason they chose 42MB for thin provisioning was because it was the lazy – easy – option. That is, already the chunk size on USP was 42MB – probably for all the clever divisional reasons you quote above, but to simply implement an algorithm that allocates what is already allocated in a fixed way up until now in a dynamic way, needs little dev effort or test effort.A database (even without format) that starts creating tables and data entries at random LBA (in the block world) will still allocate 42MB a time no matter what the write size is.Its still chubby provisioning when compared with the SVC or 3Par implementations, no matter how funky the maths.As for the unmentioned vendor, I agree, its all done with smoke and mirrors, and if that doesn’t work, hell, lets just roll in another symm, sorry, box…

  4. Nigel Poulton

    Thanks Barry, feels strangely good to be back.

    As for it being lazy.  May be it was lazy or rushed to release, obviously Im not privy to why decisions were made.  However, the building blocks of HDP are similar to those implemented by Copy-On-Write on the USP.  And although I cant remember the allocation unit size for COW, Im certain it was significantly smaller than 42MB.  So surely the lazy thing to do would have been to lift the structures from COW and fudge them in to HDP?

    And of course, its still a version 1 product, that I honestly think works and performs well.  May be things will change in the future.  We all live, and hopefully, we all learn.  No shame in changing page size in the future if it can be improved.

    Oh and lets remember, Barry Burke (who interestingly has not commented, must have some real work to do), when backing down on his promise to reveal the EMC allocation unit for their offering, said that all we had to worry about was that it would be aligned to the internal structures of the Symm (or words to that effect).  So he seems to think that mapping to existing internal structures is all important.  Of course Im not holding Barry B up as all knowing, heaven forbid, but I tend to think on this occasion he would agree with me.

    Oh and surely when you say a DB that writes 4K IO’s” will always allocate 42MB a time no matter what….” you only mean only when its used up its previous 42MB page?  What would you recommend as a better size?

    Oh and I think that the Hitachi guys knew that on an enterprise box like the USP-V, scratching and sniffing around for every available space saving would not be top priority for customers

  5. Misty

    Nigel- Thanks for the post. Very informative. You actually driected me here from an HP Forum- If you wouldn’t mind dropping me a line at the email above? You seem to be more helpful than our HP reps, so Iw ouldn’t mind asking you a couple questions!
     

  6. Pingback: blogs.rupturedmonkey.com » Blog Archive » HDP – Response to Marc Farley

  7. Pingback: Hitachi Virtual Storage Platform – VSP – Technical Deep Dive

  8. CAMUS

    You are saying "So a smaller Page size would significantly increase the size and efficiency of the DMT". But I thought the smaller page size would reduce the performance of DMT. Can you please clarify…
    And the other thing is in the HDS user guide it says…Do you know how the calculations
    are made:-
     
    for each pool-VOL.
    The capacity of the pool (MB) = Total number of pages × 42 (4116 + 84 ×
    Number of pool-VOLs)

  9. Nigel Poulton

    CAMUS,

    You are correct, that statement is a mistake.  It should read "So a smaller Page size would significantly increase the size of the DMT and as a result, reduce the efficiency of the DMT".  I have updated the text in the article, thanks for pointing it out.

    Im not aware of the calc you mention, which manual did you get it from?

    Nigel

  10. James Wilson, HP Product Marketing

    Nigel, I think you have the whole story correct with respect as to why the 42MB page size was chosen. With respect to 42MB being allocated when a 4 KB write is made, this is actually an advantage from a performance point of view. True, the XP24000/P9500 box does have to do the 42 MB allocate for the first write, but then for all the rest of the writes up until that 42 MB is consumed, there is no additional allocate required. Thus the performance is optimized relative to allocating some extra space at each boundary. In an array that is measured in terms of 100's of TBs, temporarily allocating an extra 41+ MB to get the write going is a small matter. Our customers are finding that the overall performance of the Thin Provisioning implementations is pretty fast and very solid. As you note, the 42 MB page size works with all the RAID levels equally well, which was a design goal, not an accident or a rush to market, either. Conveniently, it also is an excellent foundation for Smart Tiers, the Auto tiering feature, that is now built on top of the Thin Provisioning tool. 42 MB is a large enough and mathematically convenient page size to move data efficiently as you note, but small enough that it is still pretty fast to use when migrating pages between levels of storage. I am of course biased, but I think in this case Smart is a good thing, relative to Fast or Easy.

  11. Pingback: ZFS Stammtisch - Seite 185

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.