(re)scanning the hard drive surface on linux

Message boards : Number crunching : (re)scanning the hard drive surface on linux

To post messages, you must log in.

AuthorMessage
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 267,994,998
RAC: 0
Message 1360 - Posted: 16 Nov 2009, 14:30:51 UTC
Last modified: 16 Nov 2009, 14:33:45 UTC

I guess that there are some linux power-users browsing the forums from time to time, so I'll try to ask here :-)

One of the server hard drives suddenly got a few blocks marked as damaged today:



I'm not sure if it's really hdd surface problem, I've already seen errors like this once and they were caused by the faulty PSU. Since the server's PSU isn't very good, I suspect it may be happening again. Also, the syslog says that the drive went offline for a moment just before these errors popped out.

Before replacing the drive I'd like to verify if it's really damaged - what's the best tool to rescan the surface (and the sectors marked as damaged)? Is there any tool for linux that I can trust, or should I just download and run the drive manufacturer tools ?

I already backed up everything and scanned the drive with badblocks (read only scan), the drive seems to be in a good condition - no weird noises during scan/file operations, no spinup problems or anything suspicious - just these few blocks marked as bad.
M4 Project homepage
M4 Project wiki
ID: 1360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
quel

Send message
Joined: 19 May 09
Posts: 34
Credit: 32,923,471
RAC: 0
Message 1361 - Posted: 16 Nov 2009, 17:37:43 UTC - in response to Message 1360.  

Well, some bad sectors over the life of the drive are normal.

In some cases the sector remapping is automatic and in other cases it isn't as you noted the drive went offline.

If you already did a full bad blocks scan then there isn't anything new to learn from the manufacturer tools. Make sure you do a forced full fsck post badblock scan if you didn't already.

If you haven't yet then install the package smartmontools and do a smartctl --all /dev/sda. If SMART doesn't give you some notice about imminent drive death or the reallocated sector count is in a near failure state, then you're probably fine.
ID: 1361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 267,994,998
RAC: 0
Message 1362 - Posted: 16 Nov 2009, 19:17:52 UTC

SMART didn't show any errors until I ran the self tests. Both short and extended tests (smartctl -t short / smartctl -t long) stopped after few second with the same error:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline   Completed: read failure       90%     16934         429195333
# 2  Short offline       Completed: read failure       90%     16934         429195333


I thought that it may be a 'soft' error caused by power failure, but looks like this time it's damaged surface.

M4 Project homepage
M4 Project wiki
ID: 1362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
quel

Send message
Joined: 19 May 09
Posts: 34
Credit: 32,923,471
RAC: 0
Message 1363 - Posted: 16 Nov 2009, 19:39:24 UTC - in response to Message 1362.  

Yes, if it fails even the short test then it's time to get a new drive.
ID: 1363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
quel

Send message
Joined: 19 May 09
Posts: 34
Credit: 32,923,471
RAC: 0
Message 1364 - Posted: 16 Nov 2009, 19:42:44 UTC - in response to Message 1362.  

Also, bigger drives seem to get a lot less testing at the factory. I wrote this up recently: http://insomnia.quelrod.net/docs/new_drive_testing.txt

You'd be amazed at how many 1.0, 1.5, 2.0 TB drives from any vendor actually don't even pass that simple test new out of the retail box (not OEM.) There are quite a few very untested 750G drives out there. A full rw test on a 1.5TB drive can take a good 10 hours. (I'm a *nix admin by day and have a few hundred HDs currently spinning.)
ID: 1364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 267,994,998
RAC: 0
Message 1365 - Posted: 16 Nov 2009, 20:07:53 UTC
Last modified: 16 Nov 2009, 20:11:12 UTC

Ravager:/tmp# smartctl -a /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3320620AS
Serial Number:    [EDITED]
Firmware Version: 3.AAC
User Capacity:    320,072,933,376 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov 16 20:50:25 2009 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                 ( 430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 115) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   086   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   091   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       608
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   051   030    Pre-fail  Always       -       90130145
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16937
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       716
187 Reported_Uncorrect      0x0032   039   039   000    Old_age   Always       -       61
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   058   047   045    Old_age   Always       -       42 (Lifetime Min/Max 41/42)
194 Temperature_Celsius     0x0022   042   053   000    Old_age   Always       -       42 (0 16 0 0)
195 Hardware_ECC_Recovered  0x001a   101   060   000    Old_age   Always       -       1773457
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 62 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 62 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 45 00 95 e0  Error: UNC at LBA = 0x00950045 = 9764933

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 3f 00 95 e0 00      00:21:35.564  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:33.663  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:33.662  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:31.761  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:31.760  READ DMA EXT

Error 61 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 45 00 95 e0  Error: UNC at LBA = 0x00950045 = 9764933

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 3f 00 95 e0 00      00:21:24.092  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:33.663  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:33.662  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:31.761  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:31.760  READ DMA EXT

Error 60 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 45 00 95 e0  Error: UNC at LBA = 0x00950045 = 9764933

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 3f 00 95 e0 00      00:21:24.092  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:22.191  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:22.190  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:31.761  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:31.760  READ DMA EXT

Error 59 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 45 00 95 e0  Error: UNC at LBA = 0x00950045 = 9764933

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 3f 00 95 e0 00      00:21:24.092  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:22.191  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:22.190  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:20.289  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:20.288  READ DMA EXT

Error 58 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 45 00 95 e0  Error: UNC at LBA = 0x00950045 = 9764933

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 3f 00 95 e0 00      00:21:24.092  READ DMA EXT
  ec 00 00 45 00 95 a0 00      00:21:22.191  IDENTIFY DEVICE
  25 00 08 3f 00 95 e0 00      00:21:22.190  READ DMA EXT
  25 00 08 07 c1 97 e0 00      00:21:20.289  READ DMA EXT
  ca 00 20 e7 54 00 e0 00      00:21:20.288  WRITE DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed: read failure       90%     16934         429195333
# 2  Selective offline   Completed: read failure       90%     16934         429195333
# 3  Selective offline   Completed: read failure       90%     16934         429195333
# 4  Short offline       Completed: read failure       90%     16934         429195333
# 5  Short offline       Completed: read failure       90%     16934         429195333

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0  9865000  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


At least this one worked for almost 2 years before failing. Quite interesting that SMART shows 700+ power cycle counts. AFAIR this drive was in the server since the day I bought it (maybe I tested it in another machine for 1-2 days max before installing), so I see no reason why the power cycle count should be so high (unless there was a hardware - maybe PSU related problem I didn't notice). I thought that it'll be 50-60 maximum.

I hope that the guy who sold me the drive doesn't read this, because I think it's still on warranty, and he might not like the fact that it worked almost 24/7/365 in quite heavily loaded server |-)
M4 Project homepage
M4 Project wiki
ID: 1365 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
quel

Send message
Joined: 19 May 09
Posts: 34
Credit: 32,923,471
RAC: 0
Message 1366 - Posted: 16 Nov 2009, 20:32:18 UTC - in response to Message 1365.  

Heh. Well, I've RMAed many Seagate drives that came with a 5 year warranty. They make the process quite easy. You just need the model and serial number to check the warranty status. No need for receipts or any other fuss.

Well, read the MTBF ratings the manufacturers put on the drives and then ponder how to reconcile those numbers with reality ;)
ID: 1366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile doublechaz

Send message
Joined: 5 Mar 09
Posts: 27
Credit: 1,517,764
RAC: 0
Message 1370 - Posted: 17 Nov 2009, 3:49:14 UTC

I was having trouble with SATA drives dropping out of my array for a while. It turned out to be the PSU. I knew it wasn't the drives as I had spares that I could put in and they always tested perfect out of that server.

So, if you suspect the PSU I would say get a new one in there with plenty of headroom.

Are you running some non-zero RAID level on the server in question? I hope. ;)

ID: 1370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 267,994,998
RAC: 0
Message 1372 - Posted: 17 Nov 2009, 10:23:37 UTC

Nope, no RAID here of any type. Just multiple single HDDs with database tables spread between them, that way performance is better than with a cheap RAID (each large, frequently accessed table is on it's own physical drive, also each drive keeps a small number of less used tables); for realtime backup I use replication slave with just one, large single drive.

M4 Project homepage
M4 Project wiki
ID: 1372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TJM
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 25 Aug 07
Posts: 843
Credit: 267,994,998
RAC: 0
Message 1373 - Posted: 18 Nov 2009, 0:40:51 UTC - in response to Message 1372.  

It took few hours longer than I expected to fix everything. The database had one file completely damaged; I replaced it with a copy from backup, but every time I started the db server, the table was marked as read only. I was quite surprised when I noticed that a simple `DROP TABLE` fixed the problem, so I could recreate the table structure (the data was not important).
Everything is up and running, but there won`t be new work until tomorrow.

M4 Project homepage
M4 Project wiki
ID: 1373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : (re)scanning the hard drive surface on linux




Copyright © 2024 TJM