ZFS error vs smartctl, disk need to be replaced? - eviltoast

Hello,

I have ZFS running over LUKS encryption. ZFS out of a sudden started showing errors,

$zpool status
pool: tank  
state: ONLINE  
status: One or more devices has experienced an unrecoverable error. An  
attempt was made to correct the error. Applications are unaffected.  
action: Determine if the device needs to be replaced, and clear the errors  
using zpool clear or replace the device with zpool replace.  
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P  
scan: scrub repaired 0B in 10:13:00 with 0 errors on Fri Nov 17 01:43:01 2023  
config:  
  
NAME STATE READ WRITE CKSUM  
tank ONLINE 0 0 0  
mirror-0 ONLINE 0 0 0  
disk1.eli ONLINE 0 0 0  
disk2.eli ONLINE 0 184 4  
  
errors: No known data errors

which also shows up in dmesg,

[76443.247801] zio pool=tank vdev=/dev/mapper/disk1.eli error=5 type=2 offset=2103171743744 size=53248 flags=1074267264
[80861.452444] zio pool=tank vdev=/dev/mapper/disk1.eli error=5 type=2 offset=1841919950848 size=40960 flags=1074267264
[82374.040979] zio pool=tank vdev=/dev/mapper/disk1.eli error=5 type=2 offset=7028176924672 size=4096 flags=1572992

However smart ‘long’ test shows up ok. There doesnt seems to be any sector errors,

# smartctl -a /dev/sdc
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.63-1-lts] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD80EFZX-68UW8N0
Firmware Version: 83.H0A83
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 29 21:30:52 2023 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1100) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       112
  3 Spin_Up_Time            0x0007   155   155   024    Pre-fail  Always       -       473 (Average 376)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       632
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   093   093   000    Old_age   Always       -       49433
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       609
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   007   007   000    Old_age   Always       -       112187
193 Load_Cycle_Count        0x0012   007   007   000    Old_age   Always       -       112187
194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Min/Max 21/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     49409         -
# 2  Extended offline    Completed without error       00%     49406         -
# 3  Extended offline    Completed without error       00%     49405         -
# 4  Short offline       Completed without error       00%     49405         -
# 5  Extended offline    Completed without error       00%     41597         -
# 6  Extended offline    Interrupted (host reset)      20%     41579         -
# 7  Short offline       Completed without error       00%     21843         -
# 8  Short offline       Completed without error       00%     10193         -
# 9  Extended offline    Completed without error       00%        26         -
#10  Extended offline    Aborted by host               40%         9         -
#11  Short offline       Completed without error       00%         0         -

Does this disk need to be replaced?

  • erm_what_@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I would expect it to be a cable issue. I had long after market power cables on a consumer (Corsair) PSU which caused a voltage drop of about 1V which was enough to cause errors.

    Dodgy SATA/SAS cables can also cause it.

    The SAS controller overheating can also do it. I had them again when I didn’t have a fan on my LSI card.