Today Pure Storage announced a pretty cool new all flash storage array, the Pure Storage FlashArray. An interesting startup playing in an interesting and massively disruptive area of technology. This post guts under the hood of the Pure Storage FlashArray and explains exactly why solid state storage technologies are literally changing the rules when it comes to storage array design.
Warning: This is a ~2,000 word deep dive article and not for technical lightweights. We get under the hood of the Pure Storage FlashArray, including technologies like inline deduplication and how flash storage is literally changing the rules of enterprise storage array design.
BTW: I recorded a Deep Dive podcast with Matt Kixmoeller, Rick Vanover and Stephen Foskett where discuss the features of the Pure Stoarge FlashArray and get down in the weeds of some of the cool stuff that flash technology enables in storage arrays. You can listen, or download the MP3 from the link below –
DON'T MISS AN EPISODE – SUBSCRIBE FOR FREE!
…or click to download this podcast
FlashArray Architecture Primer
Still here…..good. Let’s start with just a few quick basics to set the scene, and then we’ll get knee deep in some pretty interesting stuff….
The Pure Storage FlashArray is commodity based dual active/active, stateless controller architecture. There is nothing custom about the hardware, its all off-the-shelf stuff with the secret sauce being the in the software, which Pure are calling Purity Operating Environment.
The two controllers are connected via dual redundant 40Gbps QDR Infiniband. Cutting straight to the chase, I have to say I am not a fan of dual controller architectures. However, I will say two things about this –
- As we will find out throughout this post, solid state storage changes just about all of the rules. So I will hold out on giving them a hard time over this (for the time being).
- The low latency high speed interconnect (QDR Infiniband) lends itself to a more scale-out future. So I expect scale-out will come some time in the future.
- I should add that I’m a fan of stateless controller architectures. They make life so much easier when it comes to upgrading and replacing failed components…
The controllers contain a small amount of DRAM which is used primarily for caching the hottest metadata as well as being a working area for write data as it passes through Pure’s data reduction technologies.
All storage within the FlashArray forms a single large pool of storage and is comprised of consumer grade MLC flash memory from Samsung Electronics. All volumes created on the FlashArray are automatically presented to all front end ports and the intent is that hosts will round robin I/O across all front end ports (effectively wide striping on the front end).
Today’s controllers support 8Gbps Fibre Channel only, with other protocols, most notably 10Gbps iSCSI on the horizon.
The dual controllers are connected to expansion shelves that house the solid state drives and the NVRAM (actually STEC ZeusIOPS SLC flash drives) via redundant 6Gbps SAS. Every expansion shelf houses 2 x NVRAM drives in the outer two drive bays. Housing NVRAM outside of the controllers contributes significantly to the stateless design of the controllers.
As well as NVRAM being outside of the controllers, system config data is also stored outside of the controllers, in the expansion shelves. System config data and other metadata is stored right alongside user data in the flash drives of the expansion shelves.
Stateless controllers (separating the configuration and user data from the compute power) is something I’m a fan of and this is a good move by Pure. For one thing, it makes swapping out the controllers for later generations a lot simpler.
The products announced today are the FA-310 single controller system (use at your own risk!) and the FA-320 dual controller HA system. Check the Pure website for more details as we wont be talking marketing, we;re just going to talk tech!
That’s the high level primer done and dusted. Now lets talk about the cool stuff….
Inline Deduplication of Primary Data
Pure are making a lot of noise about being able to dedupe primary data inline without taking a performance hit. If you’re like me, you’ll need a truck worth of salt to help you swallow a statement like that. Such things don’t seem possible, right!? Well bear with me because I think this is very interesting….
BTW, this is a cracking example of where solid state changes the rules.
Deduplication, a la Pure Storage FlashArray, is inline (post ACK to host) at a variable block size down to as small as 512 bytes. And it’s global across the entire namespace of the FlashArray. Good stuff.
As a write comes in to the FlashArray, a relatively weak hash is applied tot he data to give the FlashArray a hint as to whether or not the data has been seen before. The hash is performed by built-in hashing functions available on the Intel chips running the FlasArray. There is no custom silicon here (no ASIC, no FPGA and not even any GPUs).
Of critical importance is the fact that utilising such simple built-in hashing functions carries very little overhead for the controllers.
That’s all well and good, but weak hashing functions will require byte-for-byte compares to make sure that we don’t make a mistake and think we have a match when we actually don’t. And bit-for-bit compares are expensive as hell when it comes to performance (you have to read data from disk). But therein lies the secret. The FlashArray has no disks in it, at least not the spinning kind. Reads from NAND flash are so fast they are practically free!
Soooooo….. using built-in simple hashing functions generate very low overhead for the controllers, and byte-for-byte compares from flash are lightning fast. Pretty cool stuff!
Lets quickly step through an incoming write to show how it fits together –
- A write enters the Pure FlashArray (where it is checksummed and copied between controllers over the IB backend)
- Basic pattern removal and a quick compression is performed on the write.
- The data is copied to the NVRAM in the expansion shelves. At this point the host receives and ACK, meaning all subsequent operations are asynchronous.
- Next the hash is applied to the data and a byte-for-byte compare is carried out to determine whether the data is unique or actually a duplicate of data already seen. Interestingly this manipulation of data is performed against a copy of the write in DRAM to make it even faster.
- Once the data has been deduplicated, all remaining data is unique and is ran through a compression algorithm to squeeze it down even further. Remember, all of this is after the host has received and ACK but before the data is written to flash.
- From hereon in, writes are coalesced into structures called segments (usually 56MB), parity is calculated, the data is committed to flash and the copies in NVRAM and DRAM are released.
It is worth pointing out here that all writes to flash are append operations. At no point, under normal operating circumstances, does the FlashArray update data in-place. This avoids write amplification, improving lash wear and tear and write performance.
Net net, all writes to flash are deduped, compressed and written in append mode. Perfect for writing to flash.
Pretty cool in my opinion, and something that just isn’t possible with architectures designed around spinning disk. Dedupe of primary data, inline, in spinning disk architectures just doesn’t exist. At least not as elegantly (I nearly said as pure) as this.
Actually, another reason why deduplication of primary data is rarely seen in spinning disk architectures is that it has the side effect of scattering your data all over the back-end, or more accurately leaving pointers that point to all sorts of random locations on the back end. And it’s spinning disk 101 that that orderly (sequential) placement of data on the backend helps improve performance. If your data is laid out in nice contiguous chunks, reading it back will be a pleasant experience, but as soon as you start asking those read/write heads to jump here there and everywhere, all performance bets are off!
None of the above applies to flash. Flash is great at random reads, and the speed of random reads makes the reconstitution of deduplicated data fast! Another thing that spinning disk can struggle with, at least without the help of a large cache.
Interestingly, deduplication is not integral to the process of writing data to flash in the FlashArray. If the array is is processing huge amounts of write data, the system will dynamically turn off deduplication so that write segments are flushed through the system even faster. Once the burst of high write activity is complete, deduplication is dynamically turned back on. However, users are not able to administratively control whether or not data is deduplicated. Interestingly and importantly though, data that was not deduplicated on ingestion will be deduplicated later (more on this shortly).
Also on the topic of compression, if the unique data being written to flash does not compress well, it will be laid out in it’s uncompressed form. This way there will be no decompress penalty (~50µs).
So, what kind of crazy vendor marketing dedupe figures can we expect to see?
Well….. according to Pure, the data back from the first 100 or so units shipped to customers suggests an average deduplication ratio of 5.8:1. Pretty good, and pretty believable. Oh and that’s with the array still operating at sub 1ms latency. As mentioned on the podcast, this will go a long way to making the purchase of an all flash array a reality. The guys at Pure are talking about $5-10/GB. Starts to make it compelling.
So with all of that in mind…..why aren’t all flash arrays doing dedupe?
Continuous Background Optimization
The Pure Storage FlashArray never rests, it is always trying to optimise the way that data is laid out on flash. There are background processes running (that Pure refer to as Continuous Background Optimization) that are constantly scouring the array looking for ways to keep it lean and improve the way data is protected and laid out.
As an example, the Continuous Background Optimization processes will try and group homogeneous data types such as –
- Segments of unique non-changing read heavy data. Such segments can be grouped together to minimise how often they are re-written, keeping flash wear and tear to a minimum.
- Segments of highly referenced dedupe data that can be made smaller and afforded extra protection.
Also of interest is that each time data is picked up and laid out elsewhere on flash, it is passed through the data reduction engine, meaning that it is checked again for deduplication opportunities. After all, more data may have been written to the array, adding potentially new dedupe opportunities. By continually examining the backend and re-laying it out via the data reduction engine –
- Data stored on flash is kept fresh. This is increasingly important as we move more and more toward consumer grade flash memory, and eventually TLC flash.
- There is no need for a dedicated rebalance operation to be used when new drives are added to the array. As a natural by-product of the data optimisation processes picking up and re optimising the backend layout, data will eventually find itself evenly balanced across the larger back-end
RAID. Hmmmmmmm a technology from the 80’s that is fundamental to protecting data in every data center across the globe. But also a technology that get’s a lot of stick for being dated and unable to cope with today’s demands. Should be interesting…..
Well, for starters, RAID-3D is based on write segments, not disks, making it more of an object based RAID approach. This is a much more modern approach than creating RAID sets based around disks, and potentially works better with the type of failure modes seen in flash. Not only does it tend to make rebuilds faster and smarter (you don’t have to rebuild zero data that has no user data on it) it also lends itself to parallel RAID operations where multiple drives are involved in protecting and re-protecting data. Net net, fast rebuilds and re-protect operations.
It is also worth pointing out that during re-protect operations on the FlashArray, when new parity is calculated on segments that have reduced parity due to the failure, the data is again passed through the data reduction engine and therefore deduped again and re laid out.
When write data comes in to a FlashArray, write segments are filled up, and parity is calculated against the write segment (segments are actually complex structures made up of smaller structures with their own parity and checksums etc).
All data is stored with at least dual parity, with certain segments having a 3rd copy based on the importance of the data in that segment.
Data is then striped across the backend in a pseudo random fashion so that data is evenly spread across the backend, lending itself to nice self-balanced backend layouts.
In the event of flash failures, the FlashArray prioritises the re-protection of data rather than the rebuilding of the failed drive. In fact, there are no dedicated spare drives, instead spare capacity is reserved within the the pool on all drives in the pool. According to Pure, re-protect operations should take approximately 20 minutes. This is fast, and the fact that all data is at least dual parity protected should help owners of Pure Storage FlashArrays sleep well at night.
While on the topic of protecting data, it is also worth noting that all data is checksummed on arriving in the array. These checksums are checked every time data is read back from flash. This checksumming not only ensures that the data read back is error free (free from dreaded bit errors), it also ensures that the data being read back is the intended data.
Final Few Bits and Wrap-up
Full Array Encryption. Errr yeh…. not a great deal to say on this other than it is always on with zero key management. And apparently no overhead either. I don’t know so much about this at the moment.
Caching. The FlashArray reserves a small amount of DRAM as a read cache. This read cache will be generally be populated with the most highly referenced dedupe data, as this is likely to be the most frequently accessed data. I personally like the logic behind this as its not the usual guess work of read-ahead – instead it’s the guesswork that highly deduped data will be frequently accessed . And if it doesn’t prove overly useful for your I/O proile, not the end of the world as reads from flash are amazingly fast anyway.
All flash Arrays: The market for all flash arrays is maturing at a rate of knots. There are already a ton of interesting startups with innovative technologies. But the big traditional vendors are now starting to wake up to this market with the likes of EMC acquiring XtremIO – a huge validation of long term viability the market itself. One has to wonder how long before more of the startups get acquired.
Pure Storage: The way I see it, Pure Storage are an interesting company with some disruptive technology playing in a potentially new and large market. They seem to have a cracking product underpinned with solid technology. They are aiming at the traditional Tier 1 storage market of FC attached Symmetrix VMAX, HDS VSP, HP 3PAR etc. This is a big market, tons bigger than the so called Tier 0 market. However, this large Tier 1 market is dominated by big hairy Gorilla called EMC. And they don’t like other people coming and setting up camp in their territory.
If you like listening to technical podcasts, or have a commute to work where you’d like to listen to engaging technical discussion, Rick Vanover and I recorded a Deep Dive podcast on the Pure Storage FlashArray and all of the cool technology it employs. The podcast can be listened to or downloaded in MP3 format from the links below. It’s a cracking technical discussion –
DON'T MISS AN EPISODE – SUBSCRIBE FOR FREE!
…or click to download this podcast