Announcement

Collapse
No announcement yet.

Have you check your hard drives lately? SMARTCTL

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Have you check your hard drives lately? SMARTCTL

    Do you check your hard drives using smartctl? You should, and here's why:

    I have a family server. With the exception of moving (three times in the last four years), hurricane Matthew (power off for a couple days), and one or two other power outages, the server has been up and running constantly for 8 years.

    The original drive configuration was 4 drives: 2X1TB and 2X2TB. At the time of the build, the 1TB drives were a year or so old and the 2TB drives were added, one at a time, to increase capacity. One in 2012 and the other in 2013. Later, I wanted more capacity so I swapped out the 1TB drives for a single 6TB drive in 2014. Finally, wanting full mirrored capacity, in 2017 I added a second 6TB drive and reconfigured both pairs into RAID1 for data and added a single small SSD for the boot drive. This left me with 8TB of redundant storage for data, which has been plenty.

    Here's the drive ages this morning in years and days. This is not derived from calendar days, but rather from the Power_On_Hours of the drives:
    sda (6TB) - 3 years, 247 days
    sdb (2TB) - 4 years, 146 days
    sdc (6TB) - 1 year, 42 days
    sdd (2TB) - 6 years, 80 days
    sde (64GB) - 164 days

    Periodically, I check the drive status using smartctl, which runs as daemon - basically checking the "health". I do this a few times a year. This week, I was a little startled to see the two 2TB drives are no longer as "young and spry" as they used to be (me neither ). In the output from sudo smartctl -a on the oldest drive, I got this:

    Code:
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Conveyance offline  Completed: read failure       90%     54342         1652502969
    # 2  Extended offline    Completed: read failure       90%     54340         326819317
    # 3  Conveyance offline  Completed: read failure       90%     54175         326821768
    # 4  Extended offline    Completed: read failure       90%     54173         1652502969
    # 5  Conveyance offline  Completed: read failure       90%     54007         1652503001
    # 6  Extended offline    Completed: read failure       90%     54005         326821768
    # 7  Conveyance offline  Completed: read failure       90%     53839         326821768
    # 8  Extended offline    Completed: read failure       90%     53837         326819317
    # 9  Conveyance offline  Completed: read failure       90%     53671         1652502969
    #10  Extended offline    Completed: read failure       90%     53669         326819317
    #11  Conveyance offline  Completed: read failure       90%     53503         326821768
    #12  Extended offline    Completed: read failure       90%     53501         326819317
    #13  Conveyance offline  Completed: read failure       90%     53337         1652502960
    #14  Extended offline    Completed: read failure       90%     53335         326819317
    #15  Conveyance offline  Completed: read failure       90%     53169         326821768
    #16  Extended offline    Completed: read failure       90%     53167         326821768
    #17  Conveyance offline  Completed: read failure       90%     53070         326819317
    #18  Extended offline    Completed: read failure       90%     53068         1652502960
    #19  Conveyance offline  Completed: read failure       90%     52903         1652502960
    #20  Extended offline    Completed: read failure       90%     52901         326821768
    #21  Conveyance offline  Completed: read failure       90%     52735         326819317
    and a couple other not-so-nice indicators:
    Code:
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1
    but these were OK:
    Code:
    198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    The second 2TB drive had only one poor indication:
    Code:
    #16  Conveyance offline  Completed: read failure       90%     35957         5445952
    #17  Extended offline    Completed: read failure       90%     35955         5445952
    Clearly, it's time to retire the eldest of the drives, and maybe it's twin, before an actual failure occurs. So yesterday, I made full backups of all the data residing on the 2x2TB RAID1 file system. The next step will be to remove the oldest drive from the RAID by re-configuring the RAID into a single drive file system (NOTE: All my file systems are BTRFS so I can do ALL of this while still using the server and without powering down or rebooting. Love me some BTRFS!). Then I will remove the old drive and insert a new replacement.

    Since I will be buying a new drive, I went shopping. I prefer Western Digital drives and have had excellent results with them. The 2TB drives were purchased before WD released it's "Red" drives designed for NAS and server systems. They are both "Black" drives - the performance version. They were not Enterprise class drives, but I upgraded their firmware years ago to the enterprise level. They had a 3 year warranty and both have out-lived that in both calendar years and power-on hours.

    After some research, I will be replacing the two 2TB drives with a single 10TB WD "Red Pro" drive. It has a couple advantages over it's smaller relatives that are worth the extra cost and a longer warranty.

    By going to a single drive this large, I will also be moving all the data to this single drive, re-configuring the 2X6TB array into stand-alone drives, removing RAID1 altogether. Instead of RAID, I will do automated backups using the 2 6TB drives as backup storage. Since I keep all my data in 15 separate btrfs subvolumes, backups are easy to automate AND if the drive fails, a simple change to the fstab mount will return the system to service. Mounting RAID1 in degraded mode and removing the bad drive is more work than a simple re-mount. Since this is not a work-critical environment, that seems sufficient to me.
    Last edited by oshunluvr; Apr 22, 2018, 08:37 AM.

    Please Read Me

    #2
    Excellent post!
    I took the same route (RAID1 to SINGLE) for the same reasons for two 750GB drives, added a 3rd 750GB as the second backup for snapshots. And, as you said, I never powered down Neon, which is running on them. Why, oh why do people continue to use EXT4 when they have this power waiting for them?
    "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
    – John F. Kennedy, February 26, 1962.

    Comment


      #3
      Originally posted by GreyGeek View Post
      Excellent post!
      I took the same route (RAID1 to SINGLE) for the same reasons for two 750GB drives, added a 3rd 750GB as the second backup for snapshots. And, as you said, I never powered down Neon, which is running on them. Why, oh why do people continue to use EXT4 when they have this power waiting for them?
      Thanks.

      In answer to your question; I believe human nature dictates that familiarity equals safety. I suspect this is highly ingrained from the cave-man days. In short: people don't like change.

      Please Read Me

      Comment


        #4
        Have you check your hard drives lately? SMARTCTL

        As Alexander Poe said,
        "Be not the first by which the new is tried, or the last by which the old is cast aside."

        We've got the arrows in our backs. It's safe for them to cast aside EXT4 now. [emoji3]
        Last edited by GreyGeek; Apr 26, 2018, 06:05 AM.
        "A nation that is afraid to let its people judge the truth and falsehood in an open market is a nation that is afraid of its people.”
        – John F. Kennedy, February 26, 1962.

        Comment

        Working...
        X