Results 1 to 22 of 22

Thread: easy way to tell if your SSD is failing

  1. #1
    Registered User
    Join Date
    Mar 2008
    Location
    Knoxville, TN
    Posts
    13

    easy way to tell if your SSD is failing

    I work for a government contractor. We supply equipment to them that uses an SSD. This equipment runs 24 hours a day, 7 days a week processing information. It has Win XP Pro so it is not an SSD “aware” OS.

    It is my understanding that the average read/write lifespan for each cell is approximately 100K.

    My question is this:
    Is there a way to see this drive failing BEFORE it actually fails?
    An example of this would be since we use a 32GB SSD, when I go to the drives properties it shows up as 29.8GB in Windows. As this drive starts failing because of individual cells going down, will this number change to reflect the current USABLE maximum space?
    This would be a very easy way to tell that our SSDs are starting to fail.

    What do you think? Is there a way to know before a drive fails?

    Thanks in advance for your help

  2. #2
    SLC One_Hertz's Avatar
    Join Date
    Oct 2004
    Location
    Ottawa, Canada
    Posts
    2,953
    No, there is no way to really tell. It happens suddenly and usually with almost no hope for recovery.

    If you are lucky, it will allow itself to be read still (like when my X25-E failed). A normal windows machine won't be able to boot with such a drive plugged in though because it just goes into busy until repower after a single write command. You need a hardware imaging tool to get data off. I guess a hardware write blocker could work as well, but I have not tried that.
    Last edited by One_Hertz; 12-22-2009 at 08:09 AM.

  3. #3
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,794
    Are you sure the NAND failed or was it the controller logic? It's logical that controller logic would make the drive incapable of doing anything really, but failed NAND should recover easy.
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  4. #4
    SLC One_Hertz's Avatar
    Join Date
    Oct 2004
    Location
    Ottawa, Canada
    Posts
    2,953
    Quote Originally Posted by alfaunits View Post
    Are you sure the NAND failed or was it the controller logic? It's logical that controller logic would make the drive incapable of doing anything really, but failed NAND should recover easy.
    I don't know what failed. All I know is that my particular drive would read every LBA properly, but a single write command would send it into permanent busy state. A normal machine can not deal with this. Mine wouldn't even boot into windows with it plugged in. Intel did a cross ship RMA for me and was very apologetic so to them.

  5. #5
    Xtreme Member
    Join Date
    Jul 2008
    Location
    Michigan
    Posts
    300
    Quote Originally Posted by cloudkat View Post
    I work for a government contractor. We supply equipment to them that uses an SSD. This equipment runs 24 hours a day, 7 days a week processing information. It has Win XP Pro so it is not an SSD “aware” OS.

    It is my understanding that the average read/write lifespan for each cell is approximately 100K.

    My question is this:
    Is there a way to see this drive failing BEFORE it actually fails?
    An example of this would be since we use a 32GB SSD, when I go to the drives properties it shows up as 29.8GB in Windows. As this drive starts failing because of individual cells going down, will this number change to reflect the current USABLE maximum space?
    This would be a very easy way to tell that our SSDs are starting to fail.

    What do you think? Is there a way to know before a drive fails?

    Thanks in advance for your help
    Not sure about anything that can report if an SSD will fail or is likely to fail, but you might be able to tell how much the drive has been used or written to. I don't have the link, but I thought I remember seeing a post on the OCZ forums about using CyrstalDiskinfo or HD Tune to see the read/write history of the SSD.
    MainGamer PC----Intel Core i7 - 6GB Corsair 1600 DDR3 - Foxconn Bloodrage - ATI 6950 Modded - Areca 1880ix-12 - 2 x 120GB G.Skill Phoenix SSD - 2 x 80GB Intel G2 - Lian LI PCA05 - Seasonic M12D 850W PSU
    MovieBox----Intel E8400 - 2x 4GB OCZ 800 DDR2 - Asus P5Q Deluxe - Nvidia GTS 250 - 2x30GB OCZ Vertex - 40GB Intel X25-V - 60GB OCZ Agility- Lian LI PCA05 - Corsair 620W PSU

  6. #6
    Xtreme CCIE Serra's Avatar
    Join Date
    Dec 2004
    Location
    St. Louis, MO
    Posts
    3,979
    You should not see a decrease in available space before a drive dies because typically there is a bit of extra capacity built in that takes over for failed sectors (like in traditional drives). So you would be down to SMART monitoring, like with any drive I suspect.
    Dual CCIE (Route\Switch and Security) at your disposal. Have a Cisco-related or other network question? My PM box is always open.

    Xtreme Network:
    - Cisco 3560X-24P PoE Switch
    - Cisco ASA 5505 Firewall
    - Cisco 4402 Wireless LAN Controller
    - Cisco 3502i Access Point

  7. #7
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,794
    Unfortunately, not all cells/SSDs have the same lifecycle Some cell might fail after a few writes even.
    If we could tell the amount of cells that are relocated vs. amount of extra storage cells that should tell enough?

    OH's situation has me worried however I thought in such a situation the SSD would return a write error to the faulty location and continue working - I'm inclined to think that it's a controller logic issue (i.e. controller hardware, not the NAND part or firmware)
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  8. #8
    SLC One_Hertz's Avatar
    Join Date
    Oct 2004
    Location
    Ottawa, Canada
    Posts
    2,953
    Quote Originally Posted by alfaunits View Post
    Unfortunately, not all cells/SSDs have the same lifecycle Some cell might fail after a few writes even.
    If we could tell the amount of cells that are relocated vs. amount of extra storage cells that should tell enough?

    OH's situation has me worried however I thought in such a situation the SSD would return a write error to the faulty location and continue working - I'm inclined to think that it's a controller logic issue (i.e. controller hardware, not the NAND part or firmware)
    Most (if not nearly all) failures right now are due to the controllers. I believe in my case the controller went into some sort of safety mode. Why it did that, I have no idea (maybe was running out of reserve space?). I hardly think the amount of wear this SSD was subjected to was anywhere near a server environment it was supposedly designed for.

  9. #9
    Xtreme Member Marios's Avatar
    Join Date
    Dec 2005
    Posts
    427
    Flash memory life is predictable. Though an SSD might fail without any warning signs for reasons other than flash wearing.

    I use SMART from CyrstalDiskinfo.
    We may check the health status or get more precise info depending on the disk we have. For example...

    On Intel X25m we may check media wear out indicator.



    On Mtron we may use Total erase count raw value



    To calculate the remaining life from the Total Erase Count raw value (this is how many times we performed an erase block operation) we need to know a few more things which are:

    The SSD size in GB.
    The Block size in ΚΒ.
    Months of SSD usage.
    Flash memory specification. This is how many times we may erase each block. Typical values: MLC:1000-100000 SLC:100000-1000000.

    The written percentage of the disk is also very important.
    So in case we are not going to change that we may get an estimation about the remaining life in the following way.


  10. #10
    Xtreme Member Marios's Avatar
    Join Date
    Dec 2005
    Posts
    427
    For Intel SSDs I think "Host Writes" (the raw value) is the same like "Total erase count raw value".
    So we might also be able to use this number to estimate the remaining life of our SSD.
    I can't confirm this though.

    E1 Host Writes: Is the raw value a measure of how many times we performed an erase block operation?
    Any help is appreciated.

  11. #11
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,794
    You presume that all flash has the same and maximum life span - it doesn;t. That will kill the flash faster than actual use.
    We need something other than erase count - probably reserve space usage?
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  12. #12
    Xtreme Mentor Ao1's Avatar
    Join Date
    Feb 2009
    Posts
    2,597
    Quote Originally Posted by alfaunits View Post
    You presume that all flash has the same and maximum life span - it doesn;t. That will kill the flash faster than actual use.
    We need something other than erase count - probably reserve space usage?
    Intel drives
    E8 - Available Reserved Space
    This attribute reports the number of reserve blocks remaining. The attribute value begins at 100 (64h), which indicates that the reserved space is 100 percent available. The threshold value for this attribute is 10 percent availability, which indicates that the drive is close to its end of life. Use the Normalized value for this attribute.


    Edit: Maybe academic anyway as all SSD failures I have heard about to date occurred without notice and before the max erase count was even close to an end date. There is not a SMART attribute to tell you when the controller is going to cr*p out.
    Last edited by Ao1; 12-22-2009 at 12:04 PM.

  13. #13
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,794
    You also have to worry about the RAID controllers cra*ping out, just as much as SSD controllers.
    Hope at least the OP meant flash part only.
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  14. #14
    Xtreme Member Marios's Avatar
    Join Date
    Dec 2005
    Posts
    427
    For Intel SSDs we may use 32 as block size and what I said before should work.
    The actual block size is 128 but the "host writes" count is increased by 1 for every 65,536 sectors (512b) written by the host.

    Quote Originally Posted by alfaunits View Post
    You presume that all flash has the same and maximum life span - it doesn;t. That will kill the flash faster than actual use.
    We need something other than erase count - probably reserve space usage?
    I would say that most MLC chips have an average of 5000 erase eraseblock cycles. We may use this number and forget any deviation form a small percentage.
    In case the wear leveling is not perfect we just get some eraseblocks out, the reallocated sector count increases until all the spare space is gone, then we start getting a smaller and smaller hard disk. The rest eraseblocks though have a longer lifespan. Anyway this is the best way to have a measure of the flash memory condition.
    Though I think the wear leveling is almost perfect and the erase count differences between the eraseblocks are very very small.

  15. #15
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,794
    No, I mean two flash cells do not have the same lifespan - one may have 4000 write cycles one 4001 or 5000 etc. The 5000 is just an average, and an estimate.
    That's why we can't tell if a certain block will fail soon, even if we know exact erase count on it.
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  16. #16
    Xtreme Member Marios's Avatar
    Join Date
    Dec 2005
    Posts
    427
    A very small percentage of cells deviates from the average. No problem.
    On the other hand it is very good to know we have already consumed for example 4000 out of the 5000 average lifespan of our cells.
    Don't you think?
    Would you ever buy a used SSD with those statistics?

  17. #17
    Xtreme Addict
    Join Date
    Nov 2003
    Location
    NYC
    Posts
    1,589
    Quote Originally Posted by Marios View Post
    A very small percentage of cells deviates from the average. No problem.
    On the other hand it is very good to know we have already consumed for example 4000 out of the 5000 average lifespan of our cells.
    Don't you think?
    Would you ever buy a used SSD with those statistics?
    Most ssd manuf create them with a projected lifespan of over 10 years. If you are at 4000 of 5000 writes left then its seen a very hard life, almost unrealistically so.

  18. #18
    Xtreme Mentor Ao1's Avatar
    Join Date
    Feb 2009
    Posts
    2,597
    Who is worried about the max erase count when the capacitors have a lifespan of 1,000 hours @ 70oc. Check out page 19 of this IDF 2009 presentation by Intel. I think that is an OCZ Summit but I could be wrong.

    Check out page 27 to see my next ssd

  19. #19
    Xtreme Addict
    Join Date
    Jun 2006
    Posts
    1,794
    If only they'd make those affordable to Average Joe
    P5E64_Evo/QX9650, 4x X25-E SSD - gimme speed..
    Quote Originally Posted by MR_SmartAss View Post
    Lately there has been a lot of BS(Dave_Graham where are you?)

  20. #20
    Xtreme Addict
    Join Date
    Nov 2003
    Location
    NYC
    Posts
    1,589
    Quote Originally Posted by audienceofone View Post
    Who is worried about the max erase count when the capacitors have a lifespan of 1,000 hours @ 70oc. Check out page 19 of this IDF 2009 presentation by Intel. I think that is an OCZ Summit but I could be wrong.

    Check out page 27 to see my next ssd
    but the drives are typically at 25c-30c

  21. #21
    Xtreme Mentor Ao1's Avatar
    Join Date
    Feb 2009
    Posts
    2,597
    Quote Originally Posted by Levish View Post
    but the drives are typically at 25c-30c
    Maybe the ambient air temp is, but the caps will get hotter. Either way the point is that the controllers seem to be the weak link not the nand.

    EDIT: SSD running temps vary as can be seen here.
    Last edited by Ao1; 12-28-2009 at 06:08 AM.

  22. #22
    Xtreme Member
    Join Date
    Nov 2003
    Posts
    196
    Quote Originally Posted by cloudkat View Post
    Is there a way to see this drive failing BEFORE it actually fails?
    Well, there is no 100% accurate way, but statistically you can get an early warning in many cases:

    http://www.hdsentinel.com/

    My friend runs 12 NAS servers and several SSD systems and he swears by it (after having rebuilt/recovered several RAID configs).

    YMMV, of course. I'm not affiliated with them, just passing along info.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •