easy way to tell if your SSD is failing

**cloudkat** · 12-22-2009, 07:49 AM

I work for a government contractor. We supply equipment to them that uses an SSD. This equipment runs 24 hours a day, 7 days a week processing information. It has Win XP Pro so it is not an SSD “aware” OS.

It is my understanding that the average read/write lifespan for each cell is approximately 100K.

My question is this:
Is there a way to see this drive failing BEFORE it actually fails?
An example of this would be since we use a 32GB SSD, when I go to the drives properties it shows up as 29.8GB in Windows. As this drive starts failing because of individual cells going down, will this number change to reflect the current USABLE maximum space?
This would be a very easy way to tell that our SSDs are starting to fail.

What do you think? Is there a way to know before a drive fails?

Thanks in advance for your help

**One_Hertz** · 12-22-2009, 08:53 AM

No, there is no way to really tell. It happens suddenly and usually with almost no hope for recovery.

If you are lucky, it will allow itself to be read still (like when my X25-E failed). A normal windows machine won't be able to boot with such a drive plugged in though because it just goes into busy until repower after a single write command. You need a hardware imaging tool to get data off. I guess a hardware write blocker could work as well, but I have not tried that.

**alfaunits** · 12-22-2009, 09:13 AM

Are you sure the NAND failed or was it the controller logic? It's logical that controller logic would make the drive incapable of doing anything really, but failed NAND should recover easy.

**One_Hertz** · 12-22-2009, 09:57 AM

Originally Posted by alfaunits

Are you sure the NAND failed or was it the controller logic? It's logical that controller logic would make the drive incapable of doing anything really, but failed NAND should recover easy.

I don't know what failed. All I know is that my particular drive would read every LBA properly, but a single write command would send it into permanent busy state. A normal machine can not deal with this. Mine wouldn't even boot into windows with it plugged in. Intel did a cross ship RMA for me and was very apologetic so

to them.

**Spoiler** · 12-22-2009, 10:10 AM

Originally Posted by cloudkat

I work for a government contractor. We supply equipment to them that uses an SSD. This equipment runs 24 hours a day, 7 days a week processing information. It has Win XP Pro so it is not an SSD “aware” OS.

It is my understanding that the average read/write lifespan for each cell is approximately 100K.

My question is this:
Is there a way to see this drive failing BEFORE it actually fails?
An example of this would be since we use a 32GB SSD, when I go to the drives properties it shows up as 29.8GB in Windows. As this drive starts failing because of individual cells going down, will this number change to reflect the current USABLE maximum space?
This would be a very easy way to tell that our SSDs are starting to fail.

What do you think? Is there a way to know before a drive fails?

Thanks in advance for your help

Not sure about anything that can report if an SSD will fail or is likely to fail, but you might be able to tell how much the drive has been used or written to. I don't have the link, but I thought I remember seeing a post on the OCZ forums about using CyrstalDiskinfo or HD Tune to see the read/write history of the SSD.

**Serra** · 12-22-2009, 10:44 AM

You should not see a decrease in available space before a drive dies because typically there is a bit of extra capacity built in that takes over for failed sectors (like in traditional drives). So you would be down to SMART monitoring, like with any drive I suspect.

**alfaunits** · 12-22-2009, 10:46 AM

Unfortunately, not all cells/SSDs have the same lifecycle

Some cell might fail after a few writes even.
If we could tell the amount of cells that are relocated vs. amount of extra storage cells that should tell enough?

OH's situation has me worried however

I thought in such a situation the SSD would return a write error to the faulty location and continue working - I'm inclined to think that it's a controller logic issue (i.e. controller hardware, not the NAND part or firmware)

**One_Hertz** · 12-22-2009, 11:06 AM

Originally Posted by alfaunits

Unfortunately, not all cells/SSDs have the same lifecycle

Some cell might fail after a few writes even.
If we could tell the amount of cells that are relocated vs. amount of extra storage cells that should tell enough?

OH's situation has me worried however

I thought in such a situation the SSD would return a write error to the faulty location and continue working - I'm inclined to think that it's a controller logic issue (i.e. controller hardware, not the NAND part or firmware)

Most (if not nearly all) failures right now are due to the controllers. I believe in my case the controller went into some sort of safety mode. Why it did that, I have no idea (maybe was running out of reserve space?). I hardly think the amount of wear this SSD was subjected to was anywhere near a server environment it was supposedly designed for.

**Marios** · 12-22-2009, 11:13 AM

Flash memory life is predictable. Though an SSD might fail without any warning signs for reasons other than flash wearing.

I use SMART from CyrstalDiskinfo.
We may check the health status or get more precise info depending on the disk we have. For example...

On Intel X25m we may check media wear out indicator.

On Mtron we may use Total erase count raw value

To calculate the remaining life from the Total Erase Count raw value (this is how many times we performed an erase block operation) we need to know a few more things which are:

The SSD size in GB.
The Block size in ΚΒ.
Months of SSD usage.
Flash memory specification. This is how many times we may erase each block. Typical values: MLC:1000-100000 SLC:100000-1000000.

The written percentage of the disk is also very important.
So in case we are not going to change that we may get an estimation about the remaining life in the following way.

**Marios** · 12-22-2009, 12:15 PM

For Intel SSDs I think "Host Writes" (the raw value) is the same like "Total erase count raw value".
So we might also be able to use this number to estimate the remaining life of our SSD.
I can't confirm this though.

E1 Host Writes: Is the raw value a measure of how many times we performed an erase block operation?
Any help is appreciated.

**alfaunits** · 12-22-2009, 12:19 PM

You presume that all flash has the same and maximum life span - it doesn;t. That will kill the flash faster than actual use.
We need something other than erase count - probably reserve space usage?

**Ao1** · 12-22-2009, 12:56 PM

Originally Posted by alfaunits

You presume that all flash has the same and maximum life span - it doesn;t. That will kill the flash faster than actual use.
We need something other than erase count - probably reserve space usage?

Intel drives
E8 - Available Reserved Space
This attribute reports the number of reserve blocks remaining. The attribute value begins at 100 (64h), which indicates that the reserved space is 100 percent available. The threshold value for this attribute is 10 percent availability, which indicates that the drive is close to its end of life. Use the Normalized value for this attribute.

Edit: Maybe academic anyway as all SSD failures I have heard about to date occurred without notice and before the max erase count was even close to an end date. There is not a SMART attribute to tell you when the controller is going to cr*p out.

**alfaunits** · 12-22-2009, 02:30 PM

You also have to worry about the RAID controllers cra*ping out, just as much as SSD controllers.
Hope at least the OP meant flash part only.

**Marios** · 12-22-2009, 02:48 PM

For Intel SSDs we may use 32 as block size and what I said before should work.
The actual block size is 128 but the "host writes" count is increased by 1 for every 65,536 sectors (512b) written by the host.

Originally Posted by alfaunits

You presume that all flash has the same and maximum life span - it doesn;t. That will kill the flash faster than actual use.
We need something other than erase count - probably reserve space usage?

I would say that most MLC chips have an average of 5000 erase eraseblock cycles. We may use this number and forget any deviation form a small percentage.
In case the wear leveling is not perfect we just get some eraseblocks out, the reallocated sector count increases until all the spare space is gone, then we start getting a smaller and smaller hard disk. The rest eraseblocks though have a longer lifespan. Anyway this is the best way to have a measure of the flash memory condition.
Though I think the wear leveling is almost perfect and the erase count differences between the eraseblocks are very very small.

**alfaunits** · 12-22-2009, 06:07 PM

No, I mean two flash cells do not have the same lifespan - one may have 4000 write cycles one 4001 or 5000 etc. The 5000 is just an average, and an estimate.
That's why we can't tell if a certain block will fail soon, even if we know exact erase count on it.

**Marios** · 12-23-2009, 08:23 PM

A very small percentage of cells deviates from the average. No problem.
On the other hand it is very good to know we have already consumed for example 4000 out of the 5000 average lifespan of our cells.
Don't you think?
Would you ever buy a used SSD with those statistics?

**Levish** · 12-27-2009, 08:04 AM

Originally Posted by Marios

A very small percentage of cells deviates from the average. No problem.
On the other hand it is very good to know we have already consumed for example 4000 out of the 5000 average lifespan of our cells.
Don't you think?
Would you ever buy a used SSD with those statistics?

Most ssd manuf create them with a projected lifespan of over 10 years. If you are at 4000 of 5000 writes left then its seen a very hard life, almost unrealistically so.

**Ao1** · 12-28-2009, 03:14 AM

Who is worried about the max erase count when the capacitors have a lifespan of 1,000 hours @ 70oc. Check out page 19 of this IDF 2009 presentation by Intel. I think that is an OCZ Summit but I could be wrong.

Check out page 27 to see my next ssd

**alfaunits** · 12-28-2009, 03:22 AM

If only they'd make those affordable to Average Joe

**Levish** · 12-28-2009, 06:35 AM

Originally Posted by audienceofone

Who is worried about the max erase count when the capacitors have a lifespan of 1,000 hours @ 70oc. Check out page 19 of this IDF 2009 presentation by Intel. I think that is an OCZ Summit but I could be wrong.

Check out page 27 to see my next ssd

but the drives are typically at 25c-30c

**Ao1** · 12-28-2009, 06:43 AM

Originally Posted by Levish

but the drives are typically at 25c-30c

Maybe the ambient air temp is, but the caps will get hotter. Either way the point is that the controllers seem to be the weak link not the nand.

EDIT: SSD running temps vary as can be seen here.

**halcyon** · 01-04-2010, 01:25 AM

Originally Posted by cloudkat

Is there a way to see this drive failing BEFORE it actually fails?

Well, there is no 100% accurate way, but statistically you can get an early warning in many cases:

http://www.hdsentinel.com/

My friend runs 12 NAS servers and several SSD systems and he swears by it (after having rebuilt/recovered several RAID configs).

YMMV, of course. I'm not affiliated with them, just passing along info.

Thread: easy way to tell if your SSD is failing

Thread Tools

Search Thread

Rate This Thread

Display

easy way to tell if your SSD is failing

Bookmarks

Bookmarks

Posting Permissions