gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Carl Voth
I'm setting up a dual-disk server and am trying to bring it up with
gmirror and gjournal. One slice per disk, the goal being to create a
single mirror from said slices with some of the partitions journaled.
Installed FreeBSD-7.0RELEASE to ad4, then used technique from here to
create single-disk mirror/gm0 on ad6:

   http://people.freebsd.org/~rse/mirror/

Modified ad4s1a /boot.config to pass control to boot stage 3 on ad6. So
far, so good. Began Ralf's procedure for inserting ad4s1 into
mirror/gm0. The synchronization began and reached 6% when this little
horror appeared:

ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR>
error=40<UNCORRECTABLE> LBA=134802751
GEOM_MIRROR: Request failed (error=5). ad6s1[READ(offset=69018976256,
length=131072)]
GEOM_MIRROR: Synchronization request failed (error=5).
mirror/gm0[READ(offset=69018976256, length=131072)]

After that, nothing. System unresponsive. Perhaps needless to say, the
system also becomes unbootable because the whole point here was to nuke
ad4 as part of inserting it into mirror/gm0.

I reinstalled FB7 to ad4, redid the /boot.config modification to make
ad6/gm0 bootable again and retried the insertion of ad4 into gm0. Exact
same error messages at exactly the same point with same consequences.
Now, I see that other folks are having unexplained DMA problems too,
albeit in different contexts. What should I be concluding here? Those
other folks don't seem to be concluding it's bad drives. If there were
bad sectors, I'd get different error messages, yes?

FWIW, I'm using gjournal on 3 partitions in mirror/gm0.

Here's my server's parts list:
- Intel S3210SHLC Motherboard.
- Kingston KVR800D2E5/2GI 2GB DRAM (x2).
- Intel BX80570E3110 Dual-Core Xeon E3110, 3 Ghz 6MB L2 Cache, LGA775.
- Seagate ST31000340AS Barracuda 7200.11, 1TB, SATA (x2).
- LG GH20NS10 Internal Super-Multi SecurDisc 20X SATA DVD Rewriter.
- Antec 4U22EPS650XR case (NeoPower 650W PSU).

Carl                                             / K0802647
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Jeremy Chadwick-3
On Mon, Oct 27, 2008 at 06:56:24PM -0700, Carl Voth wrote:

> I'm setting up a dual-disk server and am trying to bring it up with  
> gmirror and gjournal. One slice per disk, the goal being to create a  
> single mirror from said slices with some of the partitions journaled.  
> Installed FreeBSD-7.0RELEASE to ad4, then used technique from here to  
> create single-disk mirror/gm0 on ad6:
>
>   http://people.freebsd.org/~rse/mirror/
>
> Modified ad4s1a /boot.config to pass control to boot stage 3 on ad6. So  
> far, so good. Began Ralf's procedure for inserting ad4s1 into  
> mirror/gm0. The synchronization began and reached 6% when this little  
> horror appeared:
>
> ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR>  
> error=40<UNCORRECTABLE> LBA=134802751

Are you sure you don't have a bad hard disk?  This looks to be like a
classic block/sector failure.  This does not appear to be the infamous
famous "DMA timeout" problem, especially if this is the only error
you're getting.

> I reinstalled FB7 to ad4, redid the /boot.config modification to make  
> ad6/gm0 bootable again and retried the insertion of ad4 into gm0. Exact  
> same error messages at exactly the same point with same consequences.  

So you're saying that the *exact* same READ_DMA error, at the *exact*
same LBA, is reported on ad4?  If so, that's very bizarre.

> Now, I see that other folks are having unexplained DMA problems too,  
> albeit in different contexts. What should I be concluding here? Those  
> other folks don't seem to be concluding it's bad drives. If there were  
> bad sectors, I'd get different error messages, yes?

The "error=40<UNCORRECTABLE>" part of what you're seeing seems to imply
there's an uncorrectable read transaction that's happened.  What other
people see are DMA timeouts, but no actual sign of uncorrectable errors.

The problem with the "DMA timeout" issue is that it manifests itself in
hundreds of different ways.  Each case so far has to be handled on an
individual basis.

> FWIW, I'm using gjournal on 3 partitions in mirror/gm0.
>
> Here's my server's parts list:
> - Seagate ST31000340AS Barracuda 7200.11, 1TB, SATA (x2).

Can you please provide the output from the following commands?

dmesg
vmstat -i
atacontrol list
atacontrol cap ad4
atacontrol cap ad6
smartctl -a /dev/ad4
smartctl -a /dev/ad6

Thanks.

--
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Wojciech Puchar-5
In reply to this post by Carl Voth
> so good. Began Ralf's procedure for inserting ad4s1 into mirror/gm0. The
> synchronization began and reached 6% when this little horror appeared:
>
> ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE>
> LBA=134802751
> GEOM_MIRROR: Request failed (error=5). ad6s1[READ(offset=69018976256,
> length=131072)]
> GEOM_MIRROR: Synchronization request failed (error=5).
> mirror/gm0[READ(offset=69018976256, length=131072)]
>

your disk failed. (uncorrectable error)

assuming you eliminated other causes like drives overheating, cabling
problem (don't think so) etc.:


boot from some kind of live CD, then make another mirror (single disk now)
on other drive, then do

dd if=/dev/ad6s1 of=/dev/mirror/newmirror bs=2k conv=noerror,sync

i intentionally did bs=2k instead of larger, to minimize amount of lost
data.

then change your system to boot from newmirror, take out /dev/ad6 and have
it replaced on warranty (or buy new), put new ad6, insert it to the
mirror.
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Wojciech Puchar-5
In reply to this post by Jeremy Chadwick-3
>> error=40<UNCORRECTABLE> LBA=134802751
>
> Are you sure you don't have a bad hard disk?  This looks to be like a
> classic block/sector failure.  This does not appear to be the infamous
> famous "DMA timeout" problem, especially if this is the only error
> you're getting.

he can temporarity boot with hw.ata.ata_dma=0

but i think his drive failed.

>
>> I reinstalled FB7 to ad4, redid the /boot.config modification to make
>> ad6/gm0 bootable again and retried the insertion of ad4 into gm0. Exact
>> same error messages at exactly the same point with same consequences.


so IT IS FAILED drive!
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Jeremy Chadwick-3
On Tue, Oct 28, 2008 at 12:04:49PM +0100, Wojciech Puchar wrote:
>>> error=40<UNCORRECTABLE> LBA=134802751
>>
>> Are you sure you don't have a bad hard disk?  This looks to be like a
>> classic block/sector failure.  This does not appear to be the infamous
>> famous "DMA timeout" problem, especially if this is the only error
>> you're getting.
>
> he can temporarity boot with hw.ata.ata_dma=0

They're SATA disks, so this won't do anything sadly.

--
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Carl-65
Jeremy Chadwick said:
>> ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR>  
>> error=40<UNCORRECTABLE> LBA=134802751
>
> Are you sure you don't have a bad hard disk?  This looks to be like a
> classic block/sector failure.

I hadn't realized that a bad block would manifest itself with a message
about DMA. Seems like such semantics would be a little obscure to most
users, apparently including me.

> So you're saying that the *exact* same READ_DMA error, at the *exact*
> same LBA, is reported on ad4?  If so, that's very bizarre.

No, perhaps I wasn't clear enough. Both instances were on ad6, so far.

> Can you please provide the output from the following commands?

See end of message. Let me know if you then want more (in- or out-of-band).

Having now installed smartmontools, you can see below that I ran it for
both ad4 and ad6. Sure enough, ad6 has logged 2 READ DMA errors - does
that make this a definitive bad disk then?

Should I not be worried about ad4 too? Those Raw_Read_Error_Rate and
Seek_Error_Rate numbers should be zero or very close to it, shouldn't
they? I don't know how to interpret what I'm seeing in that output, so
I'd appreciate any insight. Should I be returning both disks for
warranty claims (they're both very recently purchased)?

Wojciech Puchar said:

> boot from some kind of live CD, then make another mirror (single disk now)
> on other drive, then do
>
> dd if=/dev/ad6s1 of=/dev/mirror/newmirror bs=2k conv=noerror,sync
>
> i intentionally did bs=2k instead of larger, to minimize amount of lost
> data.
>
> then change your system to boot from newmirror, take out /dev/ad6 and have
> it replaced on warranty (or buy new), put new ad6, insert it to the
> mirror.

I think you're describing a method to help me save as much data from ad6
as possible. Fortunately, this is all about constructing a new system,
so there's no data yet to lose.

Is there anything I should know about this model of hard disk with
regards to being known for problems? Also, is there a good test I can
perform to hopefully flush out any problems before I put this thing into
service?

Carl                                             / K0802647

######## Additional Information ########

# vmstat -i
interrupt                          total       rate
irq1: atkbd0                           4          0
irq4: sio0                        125724         16
irq19: uhci3                           5          0
irq21: uhci1+                     478364         63
irq23: uhci2 ehci1                     1          0
cpu0: timer                     14517071       1923
irq256: em0                       109568         14
cpu1: timer                     14514956       1922
Total                           29745693       3940

# atacontrol list | grep -v "no device present"
ATA channel 0:
ATA channel 1:
ATA channel 2:
     Master:  ad4 <ST31000340AS/SD15> Serial ATA II
ATA channel 3:
     Master:  ad6 <ST31000340AS/SD15> Serial ATA II
ATA channel 4:
     Master: acd0 <HL-DT-ST DVDRAM GH20NS10/EL00> Serial ATA v1.0
ATA channel 5:
ATA channel 6:
ATA channel 7:

# atacontrol cap ad4

Protocol              Serial ATA II
device model          ST31000340AS
serial number         xxxxxxxH
firmware revision     SD15
cylinders             16383
heads                 16
sectors/track         63
lba supported         268435455 sectors
lba48 supported       1953525168 sectors
dma supported
overlap not supported

Feature                      Support  Enable    Value           Vendor
write cache                    yes      yes
read ahead                     yes      yes
Native Command Queuing (NCQ)   yes       -      31/0x1F
Tagged Command Queuing (TCQ)   no       no      31/0x1F
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      no       no      65278/0xFEFE
automatic acoustic management  no       no      0/0x00  254/0xFE

# atacontrol cap ad6

Protocol              Serial ATA II
device model          ST31000340AS
serial number         xxxxxxxA
firmware revision     SD15
cylinders             16383
heads                 16
sectors/track         63
lba supported         268435455 sectors
lba48 supported       1953525168 sectors
dma supported
overlap not supported

Feature                      Support  Enable    Value           Vendor
write cache                    yes      yes
read ahead                     yes      yes
Native Command Queuing (NCQ)   yes       -      31/0x1F
Tagged Command Queuing (TCQ)   no       no      31/0x1F
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      no       no      65278/0xFEFE
automatic acoustic management  no       no      0/0x00  254/0xFE

# smartctl -a /dev/ad4
smartctl version 5.38 [i386-portbld-freebsd7.0] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model:     ST31000340AS
Serial Number:    xxxxxxxH
Firmware Version: SD15
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Oct 28 18:07:25 2008 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                         was completed without error.
                                         Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                         without error or no self-test
has ever
                                         been run.
Total time to complete Offline
data collection:                 ( 650) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                         Auto Offline data collection
on/off support.
                                         Suspend Offline collection upon new
                                         command.
                                         Offline surface scan supported.
                                         Self-test supported.
                                         Conveyance Self-test supported.
                                         Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                         power-saving mode.
                                         Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                         General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 230) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                         SCT Feature Control supported.
                                         SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail
Always       -       158643744
   3 Spin_Up_Time            0x0003   092   091   000    Pre-fail
Always       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       108
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       0
   7 Seek_Error_Rate         0x000f   064   060   030    Pre-fail
Always       -       2921473
   9 Power_On_Hours          0x0032   100   100   000    Old_age
Always       -       499
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       108
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always
       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always
       -       0
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always
       -       65540
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
       -       0
190 Airflow_Temperature_Cel 0x0022   071   069   045    Old_age   Always
       -       29 (Lifetime Min/Max 23/31)
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always
       -       29 (0 20 0 0)
195 Hardware_ECC_Recovered  0x001a   039   019   000    Old_age   Always
       -       158643744
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
     1        0        0  Not_testing
     2        0        0  Not_testing
     3        0        0  Not_testing
     4        0        0  Not_testing
     5        0        0  Not_testing
Selective self-test flags (0x0):
   After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

# smartctl -a /dev/ad6
smartctl version 5.38 [i386-portbld-freebsd7.0] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model:     ST31000340AS
Serial Number:    xxxxxxxA
Firmware Version: SD15
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Oct 28 18:08:22 2008 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                         was completed without error.
                                         Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                         without error or no self-test
has ever
                                         been run.
Total time to complete Offline
data collection:                 ( 642) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                         Auto Offline data collection
on/off support.
                                         Suspend Offline collection upon new
                                         command.
                                         Offline surface scan supported.
                                         Self-test supported.
                                         Conveyance Self-test supported.
                                         Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                         power-saving mode.
                                         Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                         General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 227) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                         SCT Feature Control supported.
                                         SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   116   100   006    Pre-fail
Always       -       106947042
   3 Spin_Up_Time            0x0003   092   091   000    Pre-fail
Always       -       0
   4 Start_Stop_Count        0x0032   100   100   020    Old_age
Always       -       108
   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       2
   7 Seek_Error_Rate         0x000f   061   060   030    Pre-fail
Always       -       1376532
   9 Power_On_Hours          0x0032   100   100   000    Old_age
Always       -       499
  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
Always       -       1
  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
Always       -       108
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always
       -       0
187 Reported_Uncorrect      0x0032   098   098   000    Old_age   Always
       -       2
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always
       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
       -       0
190 Airflow_Temperature_Cel 0x0022   071   069   045    Old_age   Always
       -       29 (Lifetime Min/Max 23/31)
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always
       -       29 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   038   018   000    Old_age   Always
       -       106947042
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
       -       0

SMART Error Log Version: 1
ATA Error Count: 2
         CR = Command Register [HEX]
         FR = Features Register [HEX]
         SC = Sector Count Register [HEX]
         SN = Sector Number Register [HEX]
         CL = Cylinder Low Register [HEX]
         CH = Cylinder High Register [HEX]
         DH = Device/Head Register [HEX]
         DC = Device Command Register [HEX]
         ER = Error register [HEX]
         ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 475 hours (19 days + 19 hours)
   When the command that caused the error occurred, the device was
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   40 51 00 9d ed 08 08  Error: UNC at LBA = 0x0808ed9d = 134802845

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   c8 00 00 3f ed 08 48 00  13d+00:32:54.564  READ DMA
   c8 00 00 3f ec 08 48 00  13d+00:32:54.563  READ DMA
   c8 00 00 3f eb 08 48 00  13d+00:32:54.562  READ DMA
   c8 00 00 3f ea 08 48 00  13d+00:32:54.561  READ DMA
   c8 00 00 3f e9 08 48 00  13d+00:32:54.560  READ DMA

Error 1 occurred at disk power-on lifetime: 474 hours (19 days + 18 hours)
   When the command that caused the error occurred, the device was
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   40 51 00 9d ed 08 08  Error: UNC at LBA = 0x0808ed9d = 134802845

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   c8 00 00 3f e9 08 48 00  12d+23:04:28.359  READ DMA
   c8 00 00 3f 53 06 48 00  12d+23:04:27.202  READ DMA
   c8 00 00 3f 52 06 48 00  12d+23:04:27.193  READ DMA
   c8 00 00 3f 51 06 48 00  12d+23:04:27.191  READ DMA
   c8 00 00 3f 50 06 48 00  12d+23:04:27.191  READ DMA

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
     1        0        0  Not_testing
     2        0        0  Not_testing
     3        0        0  Not_testing
     4        0        0  Not_testing
     5        0        0  Not_testing
Selective self-test flags (0x0):
   After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

######## END ########
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Jeremy Chadwick-3
On Tue, Oct 28, 2008 at 08:41:31PM -0700, Carl wrote:

> Jeremy Chadwick said:
>>> ad6: FAILURE - READ_DMA status=51<READY,DSC,ERROR>  
>>> error=40<UNCORRECTABLE> LBA=134802751
>>
>> Are you sure you don't have a bad hard disk?  This looks to be like a
>> classic block/sector failure.
>
> I hadn't realized that a bad block would manifest itself with a message  
> about DMA. Seems like such semantics would be a little obscure to most  
> users, apparently including me.

Do not let the term "DMA" confuse you -- the operation was a read
operation, and DMA is used to do the transfer of data between
disk/controller/local memory.  You might see things like "READ_DMA48"
and "WRITE_DMA48", which just indicate that 48-bit LBA addressing mode
is in use when attempting the operation.

For sake of comparison, you should see what Linux and Solaris do.  For
example, when a disk falls off the bus (silently) on a Linux machine
using ext3fs, all I've ever seen is continual spewing of "ext3fs journal
errors" on the console -- absolutely no indication that the disk itself
has actually fallen off the bus.  With SCSI disks under Solaris, the
level of detail you get is perfect -- it's very easy to determine what
happened.  But in the case of ATA disks, you get more or less something
that looks similar to FreeBSD.

If you have complaints about the formatting of the output, I would
recommend filing a PR for it, or bringing it up with Soren Schmidt
([hidden email]), author of the ata(4) layer.  I will agree with you
that some more coherent error messages would be useful.

>> So you're saying that the *exact* same READ_DMA error, at the *exact*
>> same LBA, is reported on ad4?  If so, that's very bizarre.
>
> No, perhaps I wasn't clear enough. Both instances were on ad6, so far.

Then that makes ad6, or something specific to ad6, the culprit.

>> Can you please provide the output from the following commands?
>
> See end of message. Let me know if you then want more (in- or out-of-band).
>
> Having now installed smartmontools, you can see below that I ran it for  
> both ad4 and ad6. Sure enough, ad6 has logged 2 READ DMA errors - does  
> that make this a definitive bad disk then?

I'll have to look at the output.  See below.

> Should I not be worried about ad4 too? Those Raw_Read_Error_Rate and  
> Seek_Error_Rate numbers should be zero or very close to it, shouldn't  
> they? I don't know how to interpret what I'm seeing in that output, so  
> I'd appreciate any insight. Should I be returning both disks for  
> warranty claims (they're both very recently purchased)?

As you've admitted, the problem is that most people don't know how to
interpret SMART data, and start "freaking out" over things which are
normal.  People focus on the RAW values, which for many attributes is
the wrong thing to look at.  For example, on Seagate disks, a insanely
high Raw_Read_Error_Rate and Seek_Error_Rate means absolutely nothing;
it's normal.  But with another vendor, it might actually be accurate.
Welcome to one of the problems with SMART: the specification does not
state what format the raw data must be in.

Seagate chooses to encode some raw data for some SMART attributes in a
custom format.  The format is not publicly documented.  This is why you
have to go off of the adjusted values shown in VALUE/WORST/THRESH.
"How am I supposed to know all of this?!"  You aren't -- it comes with
experience.

> Is there anything I should know about this model of hard disk with  
> regards to being known for problems? Also, is there a good test I can  
> perform to hopefully flush out any problems before I put this thing into  
> service?

I'm confused: what gives you the impression there's a problem with
*this model* of hard disk?  I've seen no evidence presented that
indicates such.  What makes you ask that question?

None of us here work at Seagate, so even if there was a known problem
with this specific model of disk, we wouldn't know.  For all we know,
there could be little 3mm tall terrorists dancing on the platters, ready
to leap out at any moment and stab us!  :-)

Please keep something in mind: just because you have brand new hard
disks *does not* guarantee they're free of errors.  I have seen hundreds
of "brand new" hard disks fail right out of the box, including SCSI
disks (which people, for some reason, think are "less likely to have
this problem" simply because they cost more money).  I deal with this
situation on a daily basis at work, believe it or not.

> # vmstat -i

Interrupts look fine; I was looking for anything that might indicate an
absurdly high rate.

atacontrol cap output looks fine too, nothing weird or out of the
ordinary (I wasn't expecting anything to show up here, but I did want to
get an idea if the disks were truly SATA300 or not).

Let's take a look at the SMART data.

> # smartctl -a /dev/ad4
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED      WHEN_FAILED   RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail Always         -       158643744
>   3 Spin_Up_Time            0x0003   092   091   000    Pre-fail Always         -       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age  Always         -       108
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail Always         -       0
>   7 Seek_Error_Rate         0x000f   064   060   030    Pre-fail Always         -       2921473
>   9 Power_On_Hours          0x0032   100   100   000    Old_age  Always         -       499
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always         -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age  Always         -       108
> 184 Unknown_Attribute       0x0032   100   100   099    Old_age  Always         -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age  Always         -       0
> 188 Unknown_Attribute       0x0032   100   099   000    Old_age  Always         -       65540
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age  Always         -       0
> 190 Airflow_Temperature_Cel 0x0022   071   069   045    Old_age  Always         -       29 (Lifetime Min/Max 23/31)
> 194 Temperature_Celsius     0x0022   029   040   000    Old_age  Always         -       29 (0 20 0 0)
> 195 Hardware_ECC_Recovered  0x001a   039   019   000    Old_age  Always         -       158643744
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age  Always         -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age  Offline        -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age  Always         -       0

All of the attributes here look good.

To get an update on Attribute 198, you'd need to run a short offline
test ("smartctl -t short /dev/ad4").  You can safely do this while
the disk is in use; don't let the word "offline" make you think the
disk disappears.  You can watch the status using smartctl -a, and
once its finished, you can compare the old value to the new.  I'm
willing to bet it remains zero.

The temperature also looks good (29C).  Additionally, the SMART error
log for this disk looks fine; no signs of errors.  

I would say ad4 is in perfect shape.

> # smartctl -a /dev/ad6
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED      WHEN_FAILED   RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   116   100   006    Pre-fail Always         -       106947042
>   3 Spin_Up_Time            0x0003   092   091   000    Pre-fail Always         -       0
>   4 Start_Stop_Count        0x0032   100   100   020    Old_age  Always         -       108
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail Always         -       2
>   7 Seek_Error_Rate         0x000f   061   060   030    Pre-fail Always         -       1376532
>   9 Power_On_Hours          0x0032   100   100   000    Old_age  Always         -       499
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always         -       1
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age  Always         -       108
> 184 Unknown_Attribute       0x0032   100   100   099    Old_age  Always         -       0
> 187 Reported_Uncorrect      0x0032   098   098   000    Old_age  Always         -       2
> 188 Unknown_Attribute       0x0032   100   100   000    Old_age  Always         -       0
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age  Always         -       0
> 190 Airflow_Temperature_Cel 0x0022   071   069   045    Old_age  Always         -       29 (Lifetime Min/Max 23/31)
> 194 Temperature_Celsius     0x0022   029   040   000    Old_age  Always         -       29 (0 19 0 0)
> 195 Hardware_ECC_Recovered  0x001a   038   018   000    Old_age  Always         -       106947042
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age  Always         -       2
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age  Offline        -       2
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always        -       0

And here we see the core of the problem.  :-)

Attribute 5 shows the disk has reallocated two sectors (meaning, it
detected two sectors were bad, and reallocated them).  This is hard
evidence of bad blocks on the disk.

Attribute 10 indicates that there was one incident of the disk failing
to spin up properly, and had to re-initiate spinning up of the drive.
Why/how this happened is unknown, but at least it's not a huge number.
One incident is probably nothing to worry about.

I'm not completely sure what Attribute 187 represents, but it very
likely is directly related to Attribute 5.

Attributes 197 and 198 indicate a bigger problem: the two bad sectors
described earlier **have not** been corrected or remapped.  This is
bad.

I'll explain a bit how SATA disks deal with bad sectors.

First and foremost, straight out of the factory there's a pre-defined
list of physically bad sectors on the disk.  These sectors are never
accessed by the drive, and the manufacturer (Seagate) is the one who
creates that list.  It's 100% normal; SCSI disks have the same thing
(physical defect list vs. grown defect list).

SATA disks also have a certain amount of pre-allocated "spare sectors"
that the disk can use for transparent remapping.  When I say
transparent, I mean the OS never gets told of what's going on behind the
scenes.  Say the drive attempts to write some data, and the firmware on
the drive notices that one of the sectors has a problem.  The drive
will, unknown to the OS, say "okay lets not use that one, mark it bad,
and instead use one from the spare pool".  But there's only so many
spares...  As far as I know, SMART **does not** log transparent sector
remaps.

When the OS starts seeing errors due to bad sectors, it means the
pre-allocated "spare sector" pool has been exhausted.  SMART also
reflects this condition.

What you see above is a classic example of a hard disk with a growing
number of bad sectors.  There are *definitely* other bad sectors on
the disk which the drive has remapped on its own, but things are
getting worse.

As for the SMART error log -- what you see there is a direct result of
the two bad sectors.  Remember, block != sector, which is why you see
two error entries for the same LBA (there are probably two sectors
next to one another which make up part of the block).

Advice is simple: replace this hard disk.

I highly recommend you do an "advanced replacement" RMA, assuming
Seagate offers it, where the manufacturer sends you a new/refurbished
drive first.  They'll need a credit card number (in the case you don't
ship them the bad disk within 30 days, they charge you $$).

Hope this helps.

--
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Carl-65
Jeremy Chadwick wrote:
> Seagate chooses to encode some raw data for some SMART attributes in a
> custom format.  The format is not publicly documented.  This is why you
> have to go off of the adjusted values shown in VALUE/WORST/THRESH.
> "How am I supposed to know all of this?!"  You aren't -- it comes with
> experience.

And yet my failing drive's VALUE numbers are still all above their
THRESH values, despite it being bad enough to cripple the system. One
might argue those threshold values leave something to be desired.

>> Is there anything I should know about this model of hard disk with  
>> regards to being known for problems? Also, is there a good test I can  
>> perform to hopefully flush out any problems before I put this thing into  
>> service?
>
> I'm confused: what gives you the impression there's a problem with
> *this model* of hard disk?  I've seen no evidence presented that
> indicates such.  What makes you ask that question?

I don't have such an impression, thus far. In fact, Seagate drives have
always been good to me prior to this. It's only a precautionary question
because it's better to ask now than after I've committed a lot of real
data and time to it and put it all into service.

> Let's take a look at the SMART data.
>
>> # smartctl -a /dev/ad4
>>
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED      WHEN_FAILED   RAW_VALUE
...
>> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age  Offline        -       0
...
>
> To get an update on Attribute 198, you'd need to run a short offline
> test ("smartctl -t short /dev/ad4").  You can safely do this while
> the disk is in use; don't let the word "offline" make you think the
> disk disappears.  You can watch the status using smartctl -a, and
> once its finished, you can compare the old value to the new.  I'm
> willing to bet it remains zero.

I ran that test on both drives. ad6 failed immediately at 90% with a
"read failure" - not surprising. ad4 completed without error and no
change in it's values, just as you predicted.

>> # smartctl -a /dev/ad6
>>
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE UPDATED      WHEN_FAILED   RAW_VALUE
...
>>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail Always         -       2
...
>>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail Always         -       1
...
>> 187 Reported_Uncorrect      0x0032   098   098   000    Old_age  Always         -       2
...
>> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age  Always         -       2
>> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age  Offline        -       2
...
 >
> And here we see the core of the problem.  :-)

> Advice is simple: replace this hard disk.

> Hope this helps.

It definitely did, Jeremy. Your explanations were most helpful. Thanks!

Carl                                             / K0802647

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Jeremy Chadwick-3
On Wed, Oct 29, 2008 at 02:00:21AM -0700, Carl wrote:

> Jeremy Chadwick wrote:
>> Seagate chooses to encode some raw data for some SMART attributes in a
>> custom format.  The format is not publicly documented.  This is why you
>> have to go off of the adjusted values shown in VALUE/WORST/THRESH.
>> "How am I supposed to know all of this?!"  You aren't -- it comes with
>> experience.
>
> And yet my failing drive's VALUE numbers are still all above their  
> THRESH values, despite it being bad enough to cripple the system. One  
> might argue those threshold values leave something to be desired.

I'd urge you to file complaint(s) with drive manufacturers, as they're
the ones who decide the values.  Thresholds are not defined per the
ATA-ATAPI specification, so technically they can pick whatever value
they want.  This is exactly why you'll encounter people screaming "SMART
is worthless, the drive is already dead by the time the overall SMART
health check fails!"

If you go this route, please CC me, as I'd be quite to see what
manufacturers have to say.

--
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

Thomas Sparrevohn
On Wednesday 29 October 2008 10:04:39 Jeremy Chadwick wrote:

> On Wed, Oct 29, 2008 at 02:00:21AM -0700, Carl wrote:
> > Jeremy Chadwick wrote:
> >> Seagate chooses to encode some raw data for some SMART attributes in a
> >> custom format.  The format is not publicly documented.  This is why you
> >> have to go off of the adjusted values shown in VALUE/WORST/THRESH.
> >> "How am I supposed to know all of this?!"  You aren't -- it comes with
> >> experience.
> >
> > And yet my failing drive's VALUE numbers are still all above their  
> > THRESH values, despite it being bad enough to cripple the system. One  
> > might argue those threshold values leave something to be desired.
>
> I'd urge you to file complaint(s) with drive manufacturers, as they're
> the ones who decide the values.  Thresholds are not defined per the
> ATA-ATAPI specification, so technically they can pick whatever value
> they want.  This is exactly why you'll encounter people screaming "SMART
> is worthless, the drive is already dead by the time the overall SMART
> health check fails!"
>
> If you go this route, please CC me, as I'd be quite to see what
> manufacturers have to say.
>

Just a saw note - I saw the same problem with a hitachi disk - I ran a vendor diagnostics tool
that I found on their home page and it rebuild the bad sector map and the problem went away

The error occured after I had the disk for a couple of days - WHat puzzled me was that the drive
did not do it automatically
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

Re: gmirror slice insertion, "FAILURE - READ_DMA status=51<READY, DSC, ERROR>"

CyberLeo Kitsana
Thomas Sparrevohn wrote:
> The error occured after I had the disk for a couple of days - WHat puzzled me was that the drive
> did not do it automatically

Hard disks will not map uncorrectable bad sectors on read automatically,
as it no longer knows what the contents of that sector should be. In
this instance, the sector is usually remapped during a write.

Given the symptoms of the problem described above, it looks like this
uncorrectable sector is located in a portion of the disk that isn't
touched by FreeBSD's newfs or installation procedure, and would never
have a chance to be written to and corrected. Then, when the mirror sync
occurs (which copies every block verbatim, regardless of whether it's in
use or not) it's choking on that sector and locking up the disk, thus
freezing the OS.

One thing to try prior to RMAing the disk is to fill the entire disk
with zeroes (dd if=/dev/zero of=/dev/ad6 bs=131072 or similar) to give
its firmware a chance to remap all flakey sectors, and rewrite all ECC
information. I do this with every new or freshly acquired disk that's
guaranteed to be empty, to ensure that no surprise errors bite me later
on, as well as to make sure no previous data hangs around.

--
Fuzzy love,
-CyberLeo
Technical Administrator
CyberLeo.Net Webhosting
http://www.CyberLeo.Net
<[hidden email]>

Furry Peace! - http://wwww.fur.com/peace/
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

wildcards don't work in sh shell for FAT32 filesystem

Carl-65
In reply to this post by Carl Voth
Why do pathnames containing a wildcard work in the tcsh shell regardless
of the target filesystem, but do not work in the sh shell if the target
filesystem is FAT32?

The following sequence begins in the tcsh shell by mounting a FAT32
partition from a USB thumb drive. /tmp is in a UFS2 partition. There are
no files with "fish" in their names in either location. This is
happening in FreeBSD 7.0-RELEASE. Why do the last four commands not have
the same result?

     tcsh# mount_msdosfs /dev/da0s1 /mnt
     tcsh# rm -f /tmp/fish*
     rm: No match.
     tcsh# rm -f /tmp/*fish
     rm: No match.
     tcsh# rm -f /mnt/fish*
     rm: No match.
     tcsh# rm -f /mnt/*fish
     rm: No match.
     tcsh# sh
     sh# rm -f /tmp/fish*
     sh# rm -f /tmp/*fish
     sh# rm -f /mnt/fish*
     rm: /mnt/fish*: Invalid argument
     sh# rm -f /mnt/*fish
     rm: /mnt/*fish: Invalid argument

FWIW, the context of this discovery was trying to use the grub-install
script from the GRUB port to install its boot loader on a FAT32 thumb
drive. The script aborts when it attempts something like "rm -f
/mnt/boot/grub/*stage1_5"

Carl                                             / K0802647
_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"
Reply | Threaded
Open this post in threaded view
|

FreeBSD 7.1 distribution DVD

Carl-65
I've downloaded 7.1-RELEASE-i386-dvd1.iso and created a bootable USB
thumb drive with it. It boots up and launches sysinstall as expected.
However, when I try to launch the liveFS from the Fixit menu, it will
only look on /dev/acd0. What do I need to modify so that it will look
for the liveFS on the thumb drive?

Alternatively, since I'm using Grub as a multiboot manager, how would I
alter the FreeBSD boot configuration such that it boots the liveFS
directly instead of the primitive sysinstall-only boot option?



Carl                                             / K0802647

_______________________________________________
[hidden email] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[hidden email]"