martes, 3 de enero de 2012

S.M.A.R.T. Monitoring Tools, Revisa el estado de tu disco duro

Ayer me dejaron un portátil con un error de hardware y pensando que era el disco duro estuve trasteando con S.M.A.R.T.

Esta herramienta se utiliza para monitorizar y/o controlar el estado del dispositivo.

Ejemplo:
#/usr/sbin/smartctl -a /dev/sda

La primera parte muestra información sobre el modelo / firmware, sobre el disco,

=== START OF INFORMATION SECTION ===
Model Family: Fujitsu MHY2 BH series
Device Model: FUJITSU MHY2120CH
Serial Number: K22LT7A2BCR4
Firmware Version: 0040020B
User Capacity: 320,034,123,776 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3c
Local Time is: Tue Jan 3 19:08:05 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Smartmontools tiene una base de datos de tipos de disco. Si el disco está en la base de datos, puede ser capaz de interpretar los valores de atributos correctamente.
La segunda parte muestra los resultados de la investigación del estado de salud del dispositivo. Esto es un resumen acerca de la salud del disco.
Si el estado de salud del disco es "FAILING", realice una copia de seguridad de sus datos inmediatamente. El resto de esta sección de la salida proporciona información sobre la capacidad del disco y el tiempo estimado para llevar a cabo a corto y largo disco auto-pruebas.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 487) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
[...]


La tercera parte muestra las listas de la tabla del disco de hasta 30 atributos (de un máximo establecido de 255). Recuerde que los atributos no son parte del estándar ATA, pero la mayoría de los fabricantes todavía los soportan. A pesar de SFF-8035i no define el significado o la interpretación de los atributos, muchos tienen una interpretación estándar de facto.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 15122
2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 26476948
3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 1
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 3490
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 (2000, 0)
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always - 3563
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 4
9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 4832
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2517
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 211
193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always - 25520
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 35 (Lifetime Min/Max 12/55)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 708
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 (0, 6829)
197 Current_Pending_Sector 0x0012 100 091 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 094 094 000 Old_age Offline - 13
199 UDMA_CRC_Error_Count 0x003e 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 1859
203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 433761354638
240 Transfer_Error_Rate 0x003e 200 200 000 Old_age Always - 0


Cada atributo tiene un valor bruto de seis bytes (RAW_VALUE) y un valor normalizado de un byte (valor). En este caso, el valor bruto guarda tres temperaturas: la temperatura del disco en grados Celsius (35), además de su mínimo tiempo de vida (12) y los valores máximos (55).

[...]
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 35 (Lifetime Min/Max 12/55)

[...]

El formato de los datos en bruto es específico del proveedor y no especificado por cualquier estándar. Para realizar el seguimiento de los discos, el firmware del disco,smart lo convierte en un valor normalizado entre 1 y 253. Si este valor normalizado es menor o igual al umbral (THRESHOLD), se dice que ha fallado, como se indica en la columna WHEN_FAILED. La columna está vacía, porque ninguno de estos atributos ha fallado. El más bajo (WORST) es el menor valor alcanzado desde que SMART se ha habilitado en el disco. El tipo (TYPE) es un atributo el cual indica que si el atributo falló significa que el dispositivo ha llegado al final de su vida de diseño (Old_age) o es un fallo de disco inminente (Pre-fall). Por ejemplo, el disco spin-Tiempo (ID # 3) es un atributo de prefalla. Si este (o cualquier otro atributo prefail otros) falla, falla del disco se prevé en menos de 24 horas. (MAS ABAJO PODEIS VER UN LISTADO DE LOS ATRIBUTOS Y SU SIGNIFICADO)

La siguiente parte de la smartctl -a output es un registro de los errores de disco. Este disco en particular ha estado libre de errores, y el registro está vacío. Por lo general, uno debe preocuparse sólo si los errores de disco comienzan a aparecer en gran número. Un error ocasional transitorio que no vuelve a ocurrir por lo general es benigno. La página Web smartmontools tiene una serie de ejemplos de smartctl -a la salida que muestra algunas entradas de registro de error ilustrativos.

Ejemplo:

# smartctl -l error /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 4863 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4863 occurred at disk power-on lifetime: 2137 hours (89 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 93 a2 6f 2f 40

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 30 98 f5 2f 40 00 00:04:21.763 WRITE FPDMA QUEUED
60 40 28 a2 80 3a 40 00 00:04:21.763 READ FPDMA QUEUED
60 40 20 a2 1b 3f 40 00 00:04:21.763 READ FPDMA QUEUED
60 08 18 90 95 92 40 00 00:04:21.763 READ FPDMA QUEUED
61 00 10 28 08 d6 40 00 00:04:21.763 WRITE FPDMA QUEUED
[...]

La parte final de la salida de smartctl es un informe de las pruebas que se auto-ejecutan en el disco. Estos muestran dos tipos de pruebas automáticas, cortas y largas. Estas se pueden ejecutar con los comandos smartctl -t short /dev/sda y smartctl -t long /dev/sda y no alterar los datos en el disco . Por lo general, las pruebas cortas toman sólo un minuto o dos para completarse, y las pruebas largas tardan aproximadamente una hora. Estas pruebas personales no interfieren con el funcionamiento normal del disco, por lo que los comandos se pueden utilizar para los discos montados en un sistema en funcionamiento.
Si un auto-examen se encuentra un error, la dirección lógica de bloques (LBA) muestra dónde se produjo el error en el disco. La columna restante el porcentaje de la auto-prueba que restaba cuando el error fue encontrado. Si usted sospecha que algo anda mal con un disco, te recomiendo correr un tiempo de auto-test para detectar problemas.


# smartctl -l selftest /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description   Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline  Completed without error      00%      3730        -
# 2 Short            offline  Completed without error      00%      3729        -
# 3 Short            offline  Completed without error      00%      2893        -   


Con todo esto ya tenemos suficiente información para interpretar los resultados.

Ejemplo: 

1. REPARANDO UN SISTEMA DE FICHEROS (INGLES)

In this example, the disk is failing self-tests at Logical Block Address LBA = 0x016561e9 = 23421417. The LBA counts sectors in units of 512 bytes, and starts at zero.

root]# smartctl -l selftest /dev/hda:

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 217 0x016561e9


Note that other signs that there is a bad sector on the disk can be found in the non-zero value of the Current Pending Sector count:
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1


First Step: We need to locate the partition on which this sector of the disk lives:
root]# fdisk -lu /dev/hda

Disk /dev/hda: 123.5 GB, 123522416640 bytes
255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 5269319 530145 82 Linux swap
/dev/hda3 5269320 238227884 116479282+ 83 Linux
/dev/hda4 238227885 241248104 1510110 83 Linux


The partition /dev/hda3 starts at LBA 5269320 and extends past the 'problem' LBA. The 'problem' LBA is offset 23421417 - 5269320 = 18152097 sectors into the partition /dev/hda3.

To verify the type of the file system and the mount point, look in /etc/fstab:
root]# grep hda3 /etc/fstab
/dev/hda3 /data ext2 defaults 1 2


You can see that this is an ext2 file system, mounted at /data.

Second Step: we need to find the block size of the file system (normally 4096 bytes for ext2):
root]# tune2fs -l /dev/hda3 | grep Block
Block count: 29119820
Block size: 4096


In this case the block size is 4096 bytes. Third Step: we need to determine which File System Block contains this LBA. The formula is:
b = (int)((L-S)*512/B)
where:
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.

In our example, L=23421417, S=5269320, and B=4096. Hence the 'problem' LBA is in block number
b = (int)18152097*512/4096 = (int)2269012.125
so b=2269012.


Note: the fractional part of 0.125 indicates that this problem LBA is actually the second of the eight sectors that make up this file system block.

Fourth Step: we use debugfs to locate the inode stored in this block, and the file that contains that inode:
root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs: open /dev/hda3
debugfs: testb 2269012
Block 2269012 not in use


If the block is not in use, as in the above example, then you can skip the rest of this step and go ahead to Step Five.

If, on the other hand, the block is in use, we want to identify the file that uses it:
debugfs: testb 2269012
Block 2269012 marked in use
debugfs: icheck 2269012
Block Inode number
2269012 41032
debugfs: ncheck 41032
Inode Pathname
41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf

In this example, you can see that the problematic file (with the mount point included in the path) is: /data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf

When we are working with an ext3 file system, it may happen that the affected file is the journal itself. Generally, if this is the case, the inode number will be very small. In any case, debugfs will not be able to get the file name:
debugfs: testb 2269012
Block 2269012 marked in use
debugfs: icheck 2269012
Block Inode number
2269012 8
debugfs: ncheck 8
Inode Pathname
debugfs:


To get around this situation, we can remove the journal altogether:
tune2fs -O ^has_journal /dev/hda3

and then start again with Step Four: we should see this time that the wrong block is not in use any more. If we removed the journal file, at the end of the whole procedure we should remember to rebuild it:
tune2fs -j /dev/hda3


Fifth Step NOTE: This last step will permanently and irretrievably destroy the contents of the file system block that is damaged: if the block was allocated to a file, some of the data that is in this file is going to be overwritten with zeros. You will not be able to recover that data unless you can replace the file with a fresh or correct version.

To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk:
root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
root]# sync



Now everything is back to normal: the sector has been reallocated. Compare the output just below to similar output near the top of this article:
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1


Note: for some disks it may be necessary to update the SMART Attribute values by using smartctl -t offline /dev/hda

We have corrected the first errored block. If more than one blocks were errored, we should repeat all the steps for the subsequent ones. After we do that, the disk will pass its self-tests again:
root]# smartctl -t long /dev/hda [wait until test completes, then]
root]# smartctl -l selftest /dev/hda


SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 239 -
# 2 Extended offline Completed: read failure 90% 217 0x016561e9
# 3 Extended offline Completed: read failure 90% 212 0x016561e9
# 4 Extended offline Completed: read failure 90% 181 0x016561e9
# 5 Extended offline Completed without error 00% 14 -
# 6 Extended offline Completed without error 00% 4 -


and no longer shows any offline uncorrectable sectors:
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0




Table - Legacy Attribute IDs 
  Decimal Hex     Name Description
  0            00h       Invalid Invalid attribute identifier
  1            01h       Raw read error rate Frequency of errors while reading raw data from a disk
  2            02h       Throughput performance Average efficiency of a hard disk
  3            03h       Spinup time Time needed to spin up
  4            04h       Start/Stop count Number of spindle start/stop cycles
  5            05h       Reallocated sector count Quantity of remapped sectors
  6            06h       Read channel margin Reserve of channel while reading
  7            07h       Seek error rate Frequency of errors while positioning
  8            08h       Seek timer performance Average efficiency of operations while positioning
  9            09h       Power-on hours count Number of hours elapsed in the power-on state
  10          0Ah      Spinup retry count Number of retry attempts to spin up
  11          0Bh      Calibration retry count Number of attempts to calibrate the device
  12          0Ch      Power cycle count Number of power-on events
  13          0Dh      Soft read error rate Frequency of ‘program’ errors while reading from a disk
  187        BBh      vendor-specific vendor-specific
  189        BDh      vendor-specific vendor-specific
  190        BEh      vendor-specific vendor-specific
  191        BFh      G-sense error rate Fequency of mistakes as a result of impact loads
  192        C0h      Power-off retract count Number of power-off or emergency retract cycles
  193        C1h      Load/Unload cycle count Number of cycles into landing zone position
  194        C2h      HDA temperature Temperature of a hard disk assembly
  195        C3h      Hardware ECC recovered Number of ECC on-the-fly errors
  196        C4h      Reallocation count Number of remapping operations
  197        C5h      Current pending sector Number of unstable sectors (waiting for count remapping)
  198        C6h      Offline scan uncorrectable Number of uncorrected errors
count
  199        C7h      UDMA CRC error rate Number of CRC errors during UDMA mode
  200        C8h      Write error rate Number of errors while writing to disk (or)
  multi-zone error rate (or) flying height
  201        C9h      Soft read error rate Number of off-track errors
  202        Cah      Data Address Mark errors Number of Data Address Mark (DAM) errors (or) vendor-specific
  203        CBh      Run out cancel Number of ECC errors
  204        CCh      Soft ECC correction Number of errors corrected by software ECC
  205        CDh      Thermal asperity rate Number of thermal asperity errors
(TAR)
  206        CEh      Flying height Height of heads above the disk surface
  207        CFh      Spin high current Amount of high current used to spin up the drive.
  208        D0h      Spin buzz Number of buzz routines to spin up the drive
  209        D1h      Offline seek performance Drive’s seek performance during offline operations.
  220        DCh      Shift of disk is possible as a result of strong
shock loading in the store, as a result of falling (or) temperature
  221        DDh      G-sense error rate Number of errors as a result of impact loads as detected by a shock sensor
  222        DEh      Loaded hours Number of hours in general operational state
  223        DFh      Load/unload retry count Loading on drive caused by numerous recurrences of operations, like reading, recording, positioning of heads, etc.
  224       E0h       Load friction Load on drive caused by friction in mechanical
parts of the store
  225        E1h      Load/Unload cycle count Total number of load cycles
  226        E2h      Load-in time General time for loading in a drive
227          E3h      Torque amplification count Quantity efforts of the rotating moment of a drive
  228        E4h      Power-off retract count Number of power-off retract events.
  230        E6h      GMR head amplitude Amplitude of heads trembling (GMR-head) in running mode
  231        E7h      Temperature Temperature of a drive
  240        F0h       Head flying hours Time while head is positioning
  250        FAh       Read error retry rate Number of errors while reading from a disk.

Atributos de:
>> Fujitsu Devices <<
>> Maxtor Devices << 
>> Western-Digital Devices <<

Fuentes:
PDF completo en ingles.
http://www.ariolic.com/activesmart/smart-attributes/
http://smartmontools.sourceforge.net/badblockhowto.html
http://sourceforge.net/apps/trac/smartmontools/wiki/AttributesFujitsu

No hay comentarios:

Publicar un comentario