When bigger isn’t better

By | November 2, 2006

Ive recently been reviewing the design of an enterprise backup environment that is using LTO-3 as its tape technology.  When looking at the strategy for the internal database backup, approx 2GB, I noticed that they were backing it up daily to a dedicated non appendable pool of tapes and having the tapes shipped offiste each day.  All of which is fairly standard practice.  The thing that bothered me was how long they were keeping these backups for – 1 month!  Im still struggling to think of a scenario where you might want to recover your backup environment to a month ago!?!?  But the point Id like to make is how much space this is wasting on a 400/800GB tape (in a 31 day month this will store 64GB on 12400GB worth of tape, and that’s without compression).  Smaller capacity tapes would be ideal for this situation but of course as is the way with everything in the storage industry – everything is getting bigger! 

There is a similar, seemingly unstoppable, trend with disks too.  Now to the uninitiated, bigger disks might seem all good.  However, to the more initiated, several issues become apparent and Id like to address one of these here. 

I recently did some performance measuring on a storage array for a company that was being forced into using larger disk drives in a storage array.  Although, on the spec sheet, the larger disks performed near enough the same as the existing smaller disks, the problem would arise from the potential of creating more LUNs on these larger disks.  For example (WARNING: oversimplified example) –  

Imagine this company was currently running with 50GB disks and a standard LUN size of 10GB.  The existing disks are therefore divided into 5 x 10GB LUNs.  At the moment the disk performance is fine.  However, despite the fact that the larger 100GB disks perform very similarly, each one will be divided not into 5, but into 10 x 10GB LUNs and therefore potentially receive twice as much “work”.  The matter is further complicated by the fact that as more and more LUNs are carved on a single disk it becomes more and more difficult to predict the type of workload that the underlying disks will be subjected to. 

And then consider this scenario……what do you do when your Oracle DBAs come to you asking for storage for the new database that’s going in?  Of course they want their own dedicated spindles!  And as usual they know exactly what they want – 18GB 15K Wink So for a start you have a hard enough time convincing the DBAs to go with larger disks because the new array doesn’t support 18GB disks, the smallest it will take is 73GB.  Then you find out they only want 60GB for their database file and 10GB for their log files – not both on the same set of disk  as performance is crucial and they cant afford to mix different workloads on the same disks.  The next problem is that your array only lets you install disks in groups of 8, so…… when its all translated, what the DBAs are asking for is two groups of 8 disks –  ·         

  • DISK_GROUP1 for database = 8 x 73GB in RAID 10 = 292GB useable·         
  • DISK_GROUP2 for log files = 8 x 73GB in RAID 5 = 511GB useable 

All a bit overkill don’t you think?  So after convincing your DBAs that the bigger disks wont bring down the performance of their database you now have to convince management into buying in the region of 800GB of useable space when you only need 70GB of it – good luck! 

As a trade off may be you could put some other sequential workloads on DISK_GROUP2 but the question then becomes “how many sequential workloads can a disk group take before the workload becomes more random than your random workloads” Laughing

This also reminds me of speaking with an EVA guy a while ago (I like the EVA) who was evangelising about how great the EVA was because the backend never needed to be a bottleneck anymore.  If you had a bottleneck on the backend you simply added more disks, and more disks, and more disks until the problem went away.  The problem with this approach though, is that its a two edged sword – every time you add disks you add capacity, and the first law of storage dynamics states that “wherever free capacity exists it will be used” (BTW I just made that law up so don’t go quoting it in any meetings). 

My theory behind it is this –  

more/bigger disks = more space = more LUNs = more hosts = more applications = more workloads = more contention = more random = DIRE PERFORMANCE 

To sum it up disk manufacturers seem obsessed with larger capacities, a bit of “our disks are bigger than your disks”.  To a point I understand why with the current rates of data growth.  However, as we’ve seen larger capacities brings its own problems, especially with the disk drive already being by far the slowest part of a computer system.   

Mackem

22 thoughts on “When bigger isn’t better

  1. JM

    Well put, if only the people who write the checks understood this concept. While I agree with what you’ve said here, I also think it’s worth mentioning that a lot of DBAs have no understanding of modern storage arrays and the effects of smart caching. They often request things which are absurd based on the actual workload they’ll be putting on the LUNs they’re given. Many of them have had the mantra of separate spindles for everything drilled into their heads since birth — everything is JBOD in their world. Having separate spindles for every application is great, but who can afford it? In my experience very few applications require what is taught in DBA school. In a lot of situations disks are underutilized and bigger is in fact better. If your disks are asleep running a boring database and nothing else, give them something to do — even if it’s adding three or four more boring databases on top.

  2. SanGod

    Easy – I work for a financial institution. One of our requirements, from a regulatory standpoint, is that we need to be able to restore our production financial databases to a specific date and time. To top that off, since we’re a lender, we have to keep all data pertaining to a loan for the life of a loan plus 7 years. So on a 20 year loan, we’re required to keep every scrap of data for 27 years.

    Fun stuff.

    We mostly accomplish that through our document imaging application, which allows us to keep files for ‘x’ days after it’s deleted, guarnateeing that it’s captured in at least one full backup. (we also don’t give *ANYONE* delete access to the document imaging system)

    Databases are backed up weekly, but only the first weekly backup every month is taged for infinite retention. We also capture all transaction logs between backups, so if it became necessary to restore to a point in time, all we need is the prior full backup and every transaction log tape between.

    It’s scary with Sarbanes/Oxley and HIPPA requirements the amount of backup data companies can generate. We’re in our sixth month of business and already our full backups run about 12 LTO3 tapes + 6 SDLT1 tapes.

    Then again, we also have a 17TB dumparea dedicated to Veritas for the purposes of backup-to-disk, which enables us to get our snapshots copied within the 6 hour window that upper management has proscribed.

    (we just put it to tape during business hours when most backup hardware sits idle.)

  3. SanGod

    Oh, and when I was at Disney, the requirement was daily, full backups to tape for *ALL* financial data. This equated to about 400TB of tape per day.

    Of course there is no such thing as an incremental database backup using any kind of split-mirror technology (such as timefinder.)

  4. richard

    Machem,
    You say:

    “the new array doesn’t support 18GB disks” and “your array only lets you install disks in groups of 8” .

    This is interesting … Typical firmware should allow any practical number of disks & any disk capacity in a raid group… is this a ‘marketing’ feature on that particular array..? Is this common ?

  5. Mackem

    Thanks for the feedback gents

    JM – I agree that often bigger is better, just not always – and we don’t seem to have much of a choice these days. It looks like my experience of DBAs is similar to yours, however, I must say that sometimes as storage people we don’t help ourselves. Ive seen some terribly designed “SANs” with databases and log files sharing the same 4 disks and giving dreadful performance – so we cant always blame the DBAs, sometimes we have scarred them 😉

    Aloha Sangod, with a reply like that I cant exactly dispute your alias 😉 I must confess to being more than a little slap-dash in my opening comments re the tape backup environment. I just really threw it in the there to explain why I started thinking about wasted capacity……. The last tape backup environment I “managed” (also subject to SOX compliance) kept internal database backups for a long time, but only the weekly and monthly jobs, not the daily ones. Thanks for the detailed comments though – I may refer back to them next time I have to design/implement a tape backup environment! PS. I hope that you plan on running periodic tape copy jobs on that data you are planning on keeping to 27 years, otherwise I wouldn’t like to be the guy trying to recover a file from a 20 odd year old LTO tape. Mind you I doubt that you will still be around at the same company in 27 years to care – although of course you will be doing all you can to make the job easier for the poor guy who is around then needing to restore a 27 year old file!

    Richard, re my comments about the subsystem that doesn’t take 18GB disks and requires disks to be installed in 8 disk groups…….. although I did mention the HP EVA product later in the post, I was not referring to the EVA when I made the comments. So “if” you were referring to the EVA with your response, then you were spot on – you can mix and match spindle sizes and speeds in the same disk group, and if you felt like it you could also install just a single spindle into an existing disk group. However, I was actually thinking about the HP XP (Hitachi) range where you MUST install disks in groups of 4 or 8 and all disks in the group MUST be the same size and speed. I will also say that for enterprise boxes it is quite common to install disks in groups of 4 etc. Your comments have sparked my interest though and I will be talking to HDS and HP to see if you can install smaller disk than they advertise in the owners guide. E.g. The owners guide for the XP12K lists the smallest supported disk as 73GB 15K. Im now kicking myself for taking the owners guide at face value without doing my own digging. I will reply here if I find out its wrong.

  6. richard

    Mackem,
    Considering the cost of small disks, it probably makes very little difference if the disks are 18 or 73GB … you get the extra capacity for free. In any case,very small disks are becoming obsolete.

    A lot has been said lately regarding the need for small fast disks. There is probably a good case for the re-emergence of small (16Gb) RAM-based SSDs for such IO/sec related applications. These should be plug & form-factor compatible with FC disks, which is not difficult to do.

    Hoever, it will take a major player to make the first move in order to ‘legitimize’ this old concept. Do you see this happening ?

  7. SanGod

    We bring the tapes back in 5 years to duplicate them, either to another LTO3 tape or whatever our current technology is. Means that in about five years we’ll start doing a monthly duplication of tapes in addition to our daily/weekly duplications.

    Not going to be fun – we’re having enough difficulty keeping our duplications happening on-time. (Thanks to our last fiasco – http://www.sangod.com/?p=34 – we’re running about 3 days behind in duplications.

    One of the big benefits to doing Disk –> Disk –> Tape backups is that when we’re having tape difficulties, disk backups keep running, it’s just a matter of catching up the tape duplications.

    Costs a bit though, we have 17TB of Raid-5 storage dedicated to backups, and that allows us to keep about 2 weeks online.

  8. mackem

    Richard,

    Im not sure if I see this happening??? Unfortunately a lot of good ideas either dont happen or fizzle out. But often this is because they dont get the backing from the major players that you are talking about.

    One thing I always wanted to see, and what I indirectly referred to in the post, is disks with multiple actuators. This has been done in the past but for one reason or another it never took off. Check out this article on storagereview.com – http://www.storagereview.com/guide2000/ref/hdd/op/actMultiple.html

    It seems to me that although the mechanical disk drive is the slowest component in a computer by a country mile – its obviously not that much of an issue for companies, otherwise some of these ideas that we’re talking about would be taken up by the disk manufacturers and storage vendors. Yes there is a need but not enough (yet) to warrant the investment required….. thats my pennies worth at least.

    If like you say, SSDs for I/O intensive apps were to become popular, then I would have to counter that they should be designed and manufactured to be compatible with SAS rather than FC. I see SAS as the way ahead for the time being.

  9. SanGod

    EMC used to market something called “Perma-Cache” While I know it’s still avaialable as an option on the Symmetrix DMX-1 line, (don’t know about the DMX-3) I haven’t seen it pushed in forever.

    Basically it allows you to copy a hypervolume (partition) into Cache permanently, where it’s kept and updated. The disk is updated as the Symmetrix sees idle cycles, but all reads and writes are directly to cache.

    Back when the 3000 series was at it’s Peak, (The Symm4) I’ve seen these volumes used mostly for database indexes. The search times were phenominal since they never waited for physical disks to spin. (Usually you’d allocate 2 or 4 gig hypers to PC, to conserve on cache utilization.)

    SSD with a twist maybe?

  10. mackem

    Sangod – Ive seen the Hitachi equivalent, Flash Access, where you can pin a LUN into cache and get lightning responses (no pun intended). As with the setups youve seen it was also very small LUNs due to the cost of cache in these boxes – probably the limiting factor as to why its not used so often, along with the fact that there is a limited amount of cache can go into these boxes.

    However, with cache hits being in the order of thousands of times faster than disk, what about some kind of a middle ground? Some solid state media thats only in the order a few hundred times faster than disk? Heck Id take something that was “merely” a hundred times faster than disk. Writes could still come through expesive first line cache and then be destaged to the slower SSD.

    Not really had time to think about this properly – Just thinking outloud 😉

  11. SanGod

    You know when I was at MTI we started looking at options to speed up the Gladiator series raid controllers. Of course they looked at everything BUT the controllers themselves. One of the things we got was a prototype SSD drive. It was a 20G 2.5″ Laptop drive in a 3.5″ case, surrounded by RAM, with a small battery backup (internal) to allow for the cache to de-stage to disk in the event of loss of power.

    It worked well, but we found universally that the boot times were longer because that particular model (and forgive me, I can’t remember who gave it to us to eval, that was a lot of years ago) didn’t allow any direct disk access. (The idea being that they should have set it up so that until the cache was populated the IO’s went directly to disk.

    In the end MTI had to realize that it was their raid controller that was causing the whopping 13mb/sec transfer rates.

    Lately they’ve given up, and sold out to EMC. I can’t imagine the problems the first sales force had going out and trying to tell all these customers who they spent years bad-mouting EMC to, that “well, it’s all Ok now.”

  12. richard

    Machem & SanGod,

    This is a good discussion…

    If some of the EMC & HDS customers have a need for small cache-based LUNs (via expensive centralized cache architecture), then there is a reasonable case emerging for plug-compatible, inexpensive SSDs, emulating regular disks.

    If EMC or HDS are not in a hurry, some of their more ‘dynamic’ competitors may do it, although its difficult to see who it may be.

    Machem, you ask for some kind of ‘middle’ ground….

    There is no reason why an SSD-based raidgroup could not be used as a secondary cache, much like the L2/L3 cache in a processor, a concept which requires some firmware work, but is well understood.

    SanGod,… It is probably not worth fooling around with internal disks or built-in batteries for backup. This costs space, money and reliability.

    My understanding is that these large systems are UPS & battery backed.
    The SSD backup could be to a striped partition on a regular disk backend, perhaps assisted by background destaging, backed by central battery power… much like the existing centralized caches are now.

  13. SanGod

    No, the internal drive we were evaling weren’t big enough to be taken seriously.

    Then again, neither was MTI. 😉 However these drives would have been great for hobbiests at home, but not something i would put a production application on. For the most part, I’ll stick to my EMC arrays. They have enough cache in them for any five applications.

    Especially since I’m currently working in a mostly windoze shop, there is no chance any dozen Microsquish servers could ever push an EMC array to it’s limit.

  14. mackem

    An ex-colleague of mine, Joe, was recently doing some performance testing at a HP test facility in Holland (if memory serves me correctly). They were pushing an XP12K and an Oracle database to see if they could get a certain number of transactions per second. They were seeing good results on the XP until the cache write pending rate hit around the 40% mark.

    On Hitachi boxes the destaging process starts picking up the pace at the around the 40% mark, basically shovelling data off to disk at a faster rate. Once this 40% mark was breached, the performance of the XP took an absolute nose dive. Now although this will no doubt have been in part to do with the caching and destaging algoithms, the back end disks were certainly starting to become a bottle neck. They were running RAID10 on 15K disks with the spindles doing no other work. May be he could have benefited from SSD on the back end instead of spinning disks?

    With disk being so MANY times slower than cache, they often do become a performance limiting factor in a busy environment.

  15. richard

    Mackem,
    As you know, this situation would have been a lot worse if your friend was running Raid 5… there are no magical ‘firmware-level’ solutions and everyone has the same problems, including EMC & HDS.

    In situations like this, it would be handy to be able to replace some of the magnetic disks with plug-compatible SSDs, build a new raidgroup, relocate hot-spots, etc …and try again.

    From my experience, it is very difficult to go back to a ‘slow’ solution, providing the benefit can be easily demonstrated and the cost is right. Historically, this was the case with some of the old I/O caching solutions, going back to PDP11 days.

  16. tim

    Let’s try this again and hope your blog lets me post it. I can tell you that texas ramsan that was posted above this is what at least one of the major online vendors uses. I’m not at liberty to say who (as I can’t find a press release), but believe me, it’s definitely “mature” if these guys are using it. It’s incredibly fast, and what they use for all of their database infrastructure. TMS FTW!

  17. Nigel (mackem)

    Thanks for the response Tim. If you could find the press release that would be great as Id be very keen to know who it is and I know a lot of other people would too.
    Im staggered that the other storage vendors are following Texas’ lead with the SSD.

  18. tim

    I will ask our guy who works with that account all the time if I can spill the beans. I have a feeling the answer is no, they’re generally fairly tight lipped about their setup. Needless to say, they’re about as big as you can get online outside of google or msft, which is probably more than I should have said already :>

  19. tim

    Well, I found out why there was never a press release. I guess I was in fact wrong, they never pushed the Ramsan’s beyond their eval stage. Last I had heard they were on their way up into production, I guess they ran into something they didn’t like, or something better came along (would be nice to know what if that were the case).

  20. Nigel (mackem)

    Thanks for the update Tim. Id be curious to know what made them change their mind but that information is unlikely to be forthcoming – shame. Id be curious if anyone else has experience with the ramsan in a large demanding environment. I’ll be sure to post about it if I find out.

  21. Heckers

    Hi Richard,I think hradware monitoring is available since ESX 3.5 Update 1, or do you mean something else?Viktor

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.