Quick Question on VSAN vs XtremIO Architectures

By | February 5, 2015

This is just a really quick one……

I noticed that upgrading an existing VSAN to the newly announced VSAN 6.0 is a non-disruptive rolling upgrade.

Fair enough, so it should be right?

Well…… part of the upgrade requires what sounds like a fairly significant underlying on-disk layout change.  Basically, the upgrade to VSAN 6.0 updates the underlying layout to a new format based on VirstoFS technology.  Sounds major to me!

So with that in mind…. I think an online rolling upgrade is fairly impressive.  Well done to the VSAN team.

So If VSAN Can Do It Why Can’t Others?

The question then begs….. if VSAN can upgrade the underlying disk layout on the fly as part of an online upgrade, then why can’t XtremIO (or any other modern storage technology for that matter)?!?!?!

Is the VSAN architecture superior?

Now I get that there are different use-cases for storage technologies, and that there’s not a one-size-fits-all architecture. But on the topic of non-disruptive upgrades… is VSAN a superior architecture?

Am I Missing Something?

May be I’m missing something.  May be the on-disk changes that VSAN is making are different and less fundamental than those recently made by XtremIO?

Just a curious guy asking a genuine question.

Comments welcome (please state your vendor affiliation).

10 thoughts on “Quick Question on VSAN vs XtremIO Architectures

  1. Chris Evans

    Nigel, let’s look at some of the issues involved. If I understand it correctly, Virtual SAN uses 1MB blocks, which are read in their entirety, or for writes uses 4KB blocks which are coalesced in cache for flushing to disk (not sure how often). These blocks are simply mirrored between disks/servers. So, subject to having sufficient space, changing (say) the block size is fairly trivial. You just bung 2x 1MB blocks together that are contiguous on disk by building a new 2MB replica, then invalidate the old data and repeat Voila.

    Remember also that Virtual SAN has no compression, no dedupe and no RAID-5/6, so there’s no complex data rebuild or rehydration to do; also there’s no impact from many references in metadata to the same block of data that would be the case with de-duplicated storage.

    Now, flip to XtremIO. The architecture of this platform uses the hash calculations done for dedupe to determine which node a 4KB block of data should sit on. If that data changes in any way, (say by combining 2x 4KB blocks) then the hash changes and the data may need to be relocated to fit into the data dispersal algorithm. This is a big issue, because any lookup algorithm may not know which data has been converted and which data has not; it relies on an algorithm rather than lookup table. Thus the task of rebuilding XtremIO data is intensive (a) because hashes have to be recalculated and (b) data redispersed.

    So, i think it’s a case of Virtual SAN having a lot less overhead, plus the use of algorithms for data placement/location. As far as I am aware, Solidfire refreshed their on-disk structures with the Carbon release; however they may be using all metadata/lookup, so don’t have the same issue.

  2. Nigel Poulton Post author

    @chris

    Great insight Chris… much appreciated!

    Begs the question…. are simpler architectures the future. A lot of scale-out storage architectures opt for the simple mirroring approach rather than more complex RAID etc. Are we in an era of loosely coupled architectures that are extremely simple under the hood?

  3. Chris Evans

    Nigel, it’s that shared nothing approach. Solidfire talk about it all the time; how it’s hard to do, but pays off in long term benefits. The XtremIO design has massive dependencies between nodes. If a node goes down, you lose access to data, which is why nodes are dual controller, battery backup, like mini-VNXs. It’s scale out but not shared nothing, so ultimately flawed.

    Chris

  4. Stefano Panigada

    [EMC employee based on IP address]
    Chris is right when describing High level architecture of XtremIO, which has been designed to take advantage of “how flash drives works” which is totally different from traditional drives.
    In any case, non disruptive addition of X-Brics is due very soon with release 4.0 in Q2…

  5. Aaron Delp

    Disclosure: SolidFire Here.

    Hey Nigel! The best explanation I have seen so far was on a blog post from our CEO Dave Wright regarding shared nothing. Not trying to plug SF, just wanted to throw out a good comment on architecture I have seen that might help explain a bit more.

    From the blog: In a shared disk architecture, there is only a single “source of truth” for a piece of data or metadata. Changes that dramatically affect the format or layout of the data or metadata are inherently risky, since there is no “backup.” By comparison, shared nothing architectures have redundant copies of data distributed across multiple nodes in a cluster. This allows you to modify the format of data in one location (or completely migrate it off) while preserving known good copies elsewhere. Chad even notes that non-disruptive upgrades are common in Object storage systems, but completely misses the reason for it. It’s not because of the storage protocol (object vs. block), but because Object storage systems are almost universally shared nothing architectures.

    Source: http://www.solidfire.com/blog/the-advantages-of-a-shared-nothing-architecture-for-truly-non-disruptive-upgrades/

    Cheers!

    -Aaron

  6. Chris Evans

    Stefano, I was careful not to make too many assumptions on the XtremIO design around the data placement algorithms because I would expect EMC doesn’t rely entirely on lookup algorithms to locate data when reading. Each node keeps track of the placement of data within itself through metadata tables, for example and I’d expect there to be a “global catalogue” of some sort.

    No doubt the subject of non-disruptive expansion has been in the works for some time. It will be interesting to see how that is implemented. For example, as a 4-node system is upgraded to 6 nodes, the placement algorithm for new data will change. Assuming a system has some free space, there’s probably no need to rebalance data for the most part, as new writes will cause an automatic dispersal. The tricky part seems to be how existing data will be accessed (my point above). If an algorithm is used to locate existing data then there’s a problem; how does the system know whether data being requested is before or after the expansion (when those scenarios would result in a different location for the some hash calculations)? If there’s a global metadata table (which I expect there is) then it’s a no-brainer as the data is easy to find.

    This begs the question, why did EMC wait until now to release NDU? is that simply a case of ensuring they have a pipeline of features?

    Chris

  7. Chuck Hollis

    VMware employee, intimately familiar with VSAN these days 🙂

    The rolling migration with VSAN is a disk group at a time, so it can potentially be a fairly hefty amount of data (e.g. multiple disks worth) in each go. Sufficient free space must be available, of course. We are in the midst of creating fairly detailed instructions (and a checklist!) for folks to do this when the time comes.

    It also should be pointed out that the format conversion with VSAN 6.0 can be deferred. You can run the new software bits, but against the older data format if you choose. You won’t get access to all the new features, but it’s fully supported, as is partial migrations.

    However, the migration is non-disruptive, and applications continue to run during the entire process.

    — Chuck

  8. Terafirma

    I don’t know if I agree about the complexity of having to read it in and write it out again and the risk. Post de-dupe has been around for a long time and Pure use it to get additional savings after fixed block. Then again the ExtremeIO doesn’t look to be running a on disk format built for flash (from my research) just look at garbage collection.

    Then again maybe it was entirely down to the $1000000 marketing campaign that they do no post dedupe and running NDU would have broken that.

  9. John Nicholson

    Chris, with VAIO (or whatever name marketing will give it, please nothing as cringe worthy as vRealize) I would expect VMware to allow data reduction and other inline services to VSAN without mucking with their non-disruptive update routine (which does involve data movement, or depending on existing mirrors).

    vSphere 6 is a complete replacement of the underlying file system/structure (so actually more disruptive than ExtremeIO’s block change really).

    Now I’m curious if Stefano can please explain what “Designed for how flash drives work” actually means. VSAN is designed for how flash drives work (In Hybrid configs, caching system, in all flash log structure means gentle treatment of cheap high capacity TLC SATA Flash.

    VSAN will be able to support the DIMM based flash devices that vSphere supports (.05ms of latency!)

    If ExtremeIO was “designed for flash” I would expect it to either it to support.

    Something more exotic than SAS flash drives (NVMe, PCI-Express, DIMM). Scaling SAS drives takes a lot of CPU for overhead (and latency over time). There’s a reason other vendors use custom form factors (HDS/Violin) or accept the network trade off of latency for loosely coupled scale out.

    I’d also except it to support the low endurance “cheap” SATA flash that VSAN (and Pure, and other vendors support).
    From what I can tell its not really designed for the highest performance (no exotic’s) or for cost scaling (cheap MLC/TLC SATA).

    Normally I’m opposed to lawnchair astronaut architect arguing about low level design choices, but in the case of ExtremIO there were some very curious once chosen, and unless this is a prototype controller system for a radically different array (based on PCI-Express networked flash) I’m not seeing the superiority of this system vs VSAN.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You can add images to your comment by clicking here.