SSD in RAID failures

SSD's - seem like an answer to IO issues in terms of bottlenecks - and rightly so.

As the initial concerns over limited operations of the NAND mechanisms that sit behind the storage are pushed aside in the clamor for raw speed - they find themselves in the production environment.

SO - you are looking at the pricing and wondering why there is a difference in pricing between the enterprise grade hardware and the domestic grade.... something that is likely to rear its head some time later it would appear.

Some tests I did with Constellation NL-SAS drives with a PERC H310 versus the same controller with SSD's was showing a noticeable increase in speed.... seeing 600MB/sec cache dependent and dropping off against 1100MB/sec sustained.... and with the latency... figures that sit either end of barely measurable to 5 to 10 m/seconds.

Sure - things like the PCIe Enterprise Class SSD's that you will get with things like the Dell Fluid Cache configurations - sure they will do 40MB/sec for 5 years... but the lower end stuff?

On the whole we use the Crucial (Micron) MX100's in 512MB. These seem more substantial physically and certainly faster than the Lite-On 512 that come Dell Branded with servers such as the PE220 and above. We are yet to see failures of these.

HOWEVER - and this is a big however - I have heard reports from other parties within the group that cheaper options have been failing. Let us not forget that SSD's just tend to stop - dead - if you are lucky you get a light - otherwise it is just dead - nothing - silence. Unlike mechanical disks that will start to generate SMART errors before file-system errors or the distinctive sounds of a drive in pain... SSD's tend to not work. From personal experience this can either be while using them, or on restart.

So lets assume that we have two givens - that assuming no manufacturing error there is a finite number of operations on an SSD. Lets also assume that we are going to mitigate the threat of 'mechanical' failure through a RAID1 Mirror. Great.

Sure. Right up to the point you realise that the two devices are now taking for the same ride, and burning I/O operations at exactly the same rate.

So - are likely to fail at the same rate in my book. No?

Step up plans afoot to introduce a policy to cycle slave drive in RAID1 arrays on a periodic nature.

Better solutions in an ideal world:

- Software level support for using SSD's as caching for higher capacity drives;

- Better still introduction into the kernel the ability to mix drive types in a form of hierarchy more akin to the likes of EMC solutions. NL-SAS long term, SAS for live data, and SSD's for hot cached items - with a live migration and consistency between them.

- RAID controllers, or SSD SMART support that will manage this kind of abrupt failure.

Leave a Reply

Your email address will not be published. Required fields are marked *