Ayer me dejaron un portátil con un error de hardware y pensando que era el disco duro estuve trasteando con S.M.A.R.T.
Esta herramienta se utiliza para monitorizar y/o controlar el estado del dispositivo.
#/usr/sbin/smartctl -a /dev/sda
La primera parte muestra información sobre el modelo / firmware, sobre el disco,
Model Family: Fujitsu MHY2 BH series
Device Model: FUJITSU MHY2120CH
Serial Number: K22LT7A2BCR4
Firmware Version: 0040020B
User Capacity: 320,034,123,776 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3c
Local Time is: Tue Jan 3 19:08:05 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Smartmontools tiene una base de datos de tipos de disco. Si el disco está en la base de datos, puede ser capaz de interpretar los valores de atributos correctamente.
La segunda parte muestra los resultados de la investigación del estado de salud del dispositivo. Esto es un resumen acerca de la salud del disco.
Si el estado de salud del disco es "FAILING", realice una copia de seguridad de sus datos inmediatamente. El resto de esta sección de la salida proporciona información sobre la capacidad del disco y el tiempo estimado para llevar a cabo a corto y largo disco auto-pruebas.
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 487) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
La tercera parte muestra las listas de la tabla del disco de hasta 30 atributos (de un máximo establecido de 255). Recuerde que los atributos no son parte del estándar ATA, pero la mayoría de los fabricantes todavía los soportan. A pesar de SFF-8035i no define el significado o la interpretación de los atributos, muchos tienen una interpretación estándar de facto.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 15122
2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 26476948
3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 1
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 3490
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 0 (2000, 0)
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always - 3563
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 4
9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 4832
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2517
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 211
193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always - 25520
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 35 (Lifetime Min/Max 12/55)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 708
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 (0, 6829)
197 Current_Pending_Sector 0x0012 100 091 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 094 094 000 Old_age Offline - 13
199 UDMA_CRC_Error_Count 0x003e 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 1859
203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 433761354638
240 Transfer_Error_Rate 0x003e 200 200 000 Old_age Always - 0
Cada atributo tiene un valor bruto de seis bytes (RAW_VALUE) y un valor normalizado de un byte (valor). En este caso, el valor bruto guarda tres temperaturas: la temperatura del disco en grados Celsius (35), además de su mínimo tiempo de vida (12) y los valores máximos (55).
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 35 (Lifetime Min/Max 12/55)
El formato de los datos en bruto es específico del proveedor y no especificado por cualquier estándar. Para realizar el seguimiento de los discos, el firmware del disco,smart lo convierte en un valor normalizado entre 1 y 253. Si este valor normalizado es menor o igual al umbral (THRESHOLD), se dice que ha fallado, como se indica en la columna WHEN_FAILED. La columna está vacía, porque ninguno de estos atributos ha fallado. El más bajo (WORST) es el menor valor alcanzado desde que SMART se ha habilitado en el disco. El tipo (TYPE) es un atributo el cual indica que si el atributo falló significa que el dispositivo ha llegado al final de su vida de diseño (Old_age) o es un fallo de disco inminente (Pre-fall). Por ejemplo, el disco spin-Tiempo (ID # 3) es un atributo de prefalla. Si este (o cualquier otro atributo prefail otros) falla, falla del disco se prevé en menos de 24 horas. (MAS ABAJO PODEIS VER UN LISTADO DE LOS ATRIBUTOS Y SU SIGNIFICADO)
La siguiente parte de la smartctl -a output es un registro de los errores de disco. Este disco en particular ha estado libre de errores, y el registro está vacío. Por lo general, uno debe preocuparse sólo si los errores de disco comienzan a aparecer en gran número. Un error ocasional transitorio que no vuelve a ocurrir por lo general es benigno. La página Web smartmontools tiene una serie de ejemplos de smartctl -a la salida que muestra algunas entradas de registro de error ilustrativos.
# smartctl -l error /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
SMART Error Log Version: 1
ATA Error Count: 4863 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 4863 occurred at disk power-on lifetime: 2137 hours (89 days + 1 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
-- -- -- -- -- -- --
40 41 93 a2 6f 2f 40
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 30 98 f5 2f 40 00 00:04:21.763 WRITE FPDMA QUEUED
60 40 28 a2 80 3a 40 00 00:04:21.763 READ FPDMA QUEUED
60 40 20 a2 1b 3f 40 00 00:04:21.763 READ FPDMA QUEUED
60 08 18 90 95 92 40 00 00:04:21.763 READ FPDMA QUEUED
61 00 10 28 08 d6 40 00 00:04:21.763 WRITE FPDMA QUEUED
La parte final de la salida de smartctl es un informe de las pruebas que se auto-ejecutan en el disco. Estos muestran dos tipos de pruebas automáticas, cortas y largas. Estas se pueden ejecutar con los comandos smartctl -t short /dev/sda y smartctl -t long /dev/sda y no alterar los datos en el disco . Por lo general, las pruebas cortas toman sólo un minuto o dos para completarse, y las pruebas largas tardan aproximadamente una hora. Estas pruebas personales no interfieren con el funcionamiento normal del disco, por lo que los comandos se pueden utilizar para los discos montados en un sistema en funcionamiento.
Si un auto-examen se encuentra un error, la dirección lógica de bloques (LBA) muestra dónde se produjo el error en el disco. La columna restante el porcentaje de la auto-prueba que restaba cuando el error fue encontrado. Si usted sospecha que algo anda mal con un disco, te recomiendo correr un tiempo de auto-test para detectar problemas.
# smartctl -l selftest /dev/sda
smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Conveyance offline Completed without error 00% 3730 -
# 2 Short offline Completed without error 00% 3729 -
# 3 Short offline Completed without error 00% 2893 -
Con todo esto ya tenemos suficiente información para interpretar los resultados.
In this example, the disk is failing self-tests at Logical Block Address LBA = 0x016561e9 = 23421417. The LBA counts sectors in units of 512 bytes, and starts at zero.
root]# smartctl -l selftest /dev/hda:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 217 0x016561e9
Note that other signs that there is a bad sector on the disk can be found in the non-zero value of the Current Pending Sector count:
root]# smartctl -A /dev/hda
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
First Step: We need to locate the partition on which this sector of the disk lives:
root]# fdisk -lu /dev/hda
Disk /dev/hda: 123.5 GB, 123522416640 bytes
255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 5269319 530145 82 Linux swap
/dev/hda3 5269320 238227884 116479282+ 83 Linux
/dev/hda4 238227885 241248104 1510110 83 Linux
The partition /dev/hda3 starts at LBA 5269320 and extends past the 'problem' LBA. The 'problem' LBA is offset 23421417 - 5269320 = 18152097 sectors into the partition /dev/hda3.
To verify the type of the file system and the mount point, look in /etc/fstab:
root]# grep hda3 /etc/fstab
/dev/hda3 /data ext2 defaults 1 2
You can see that this is an ext2 file system, mounted at /data.
Second Step: we need to find the block size of the file system (normally 4096 bytes for ext2):
root]# tune2fs -l /dev/hda3 | grep Block
Block count: 29119820
Block size: 4096
In this case the block size is 4096 bytes. Third Step: we need to determine which File System Block contains this LBA. The formula is:
b = (int)((L-S)*512/B)
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.
In our example, L=23421417, S=5269320, and B=4096. Hence the 'problem' LBA is in block number
b = (int)18152097*512/4096 = (int)2269012.125
so b=2269012.
Note: the fractional part of 0.125 indicates that this problem LBA is actually the second of the eight sectors that make up this file system block.
Fourth Step: we use debugfs to locate the inode stored in this block, and the file that contains that inode:
root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs: open /dev/hda3
debugfs: testb 2269012
Block 2269012 not in use
If the block is not in use, as in the above example, then you can skip the rest of this step and go ahead to Step Five.
If, on the other hand, the block is in use, we want to identify the file that uses it:
debugfs: testb 2269012
Block 2269012 marked in use
debugfs: icheck 2269012
Block Inode number
2269012 41032
debugfs: ncheck 41032
Inode Pathname
41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf
In this example, you can see that the problematic file (with the mount point included in the path) is: /data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf
When we are working with an ext3 file system, it may happen that the affected file is the journal itself. Generally, if this is the case, the inode number will be very small. In any case, debugfs will not be able to get the file name:
debugfs: testb 2269012
Block 2269012 marked in use
debugfs: icheck 2269012
Block Inode number
2269012 8
debugfs: ncheck 8
Inode Pathname
To get around this situation, we can remove the journal altogether:
tune2fs -O ^has_journal /dev/hda3
and then start again with Step Four: we should see this time that the wrong block is not in use any more. If we removed the journal file, at the end of the whole procedure we should remember to rebuild it:
tune2fs -j /dev/hda3
Fifth Step NOTE: This last step will permanently and irretrievably destroy the contents of the file system block that is damaged: if the block was allocated to a file, some of the data that is in this file is going to be overwritten with zeros. You will not be able to recover that data unless you can replace the file with a fresh or correct version.
To force the disk to reallocate this bad block we'll write zeros to the bad block, and sync the disk:
root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
root]# sync
Now everything is back to normal: the sector has been reallocated. Compare the output just below to similar output near the top of this article:
root]# smartctl -A /dev/hda
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
Note: for some disks it may be necessary to update the SMART Attribute values by using smartctl -t offline /dev/hda
We have corrected the first errored block. If more than one blocks were errored, we should repeat all the steps for the subsequent ones. After we do that, the disk will pass its self-tests again:
root]# smartctl -t long /dev/hda [wait until test completes, then]
root]# smartctl -l selftest /dev/hda
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 239 -
# 2 Extended offline Completed: read failure 90% 217 0x016561e9
# 3 Extended offline Completed: read failure 90% 212 0x016561e9
# 4 Extended offline Completed: read failure 90% 181 0x016561e9
# 5 Extended offline Completed without error 00% 14 -
# 6 Extended offline Completed without error 00% 4 -
and no longer shows any offline uncorrectable sectors:
root]# smartctl -A /dev/hda
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
Table - Legacy Attribute IDs
Decimal Hex Name Description
0 00h Invalid Invalid attribute identifier
1 01h Raw read error rate Frequency of errors while reading raw data from a disk
2 02h Throughput performance Average efficiency of a hard disk
3 03h Spinup time Time needed to spin up
4 04h Start/Stop count Number of spindle start/stop cycles
5 05h Reallocated sector count Quantity of remapped sectors
6 06h Read channel margin Reserve of channel while reading
7 07h Seek error rate Frequency of errors while positioning
8 08h Seek timer performance Average efficiency of operations while positioning
9 09h Power-on hours count Number of hours elapsed in the power-on state
10 0Ah Spinup retry count Number of retry attempts to spin up
11 0Bh Calibration retry count Number of attempts to calibrate the device
12 0Ch Power cycle count Number of power-on events
13 0Dh Soft read error rate Frequency of ‘program’ errors while reading from a disk
187 BBh vendor-specific vendor-specific
189 BDh vendor-specific vendor-specific
190 BEh vendor-specific vendor-specific
191 BFh G-sense error rate Fequency of mistakes as a result of impact loads
192 C0h Power-off retract count Number of power-off or emergency retract cycles
193 C1h Load/Unload cycle count Number of cycles into landing zone position
194 C2h HDA temperature Temperature of a hard disk assembly
195 C3h Hardware ECC recovered Number of ECC on-the-fly errors
196 C4h Reallocation count Number of remapping operations
197 C5h Current pending sector Number of unstable sectors (waiting for count remapping)
198 C6h Offline scan uncorrectable Number of uncorrected errors
199 C7h UDMA CRC error rate Number of CRC errors during UDMA mode
200 C8h Write error rate Number of errors while writing to disk (or)
multi-zone error rate (or) flying height
201 C9h Soft read error rate Number of off-track errors
202 Cah Data Address Mark errors Number of Data Address Mark (DAM) errors (or) vendor-specific
203 CBh Run out cancel Number of ECC errors
204 CCh Soft ECC correction Number of errors corrected by software ECC
205 CDh Thermal asperity rate Number of thermal asperity errors
206 CEh Flying height Height of heads above the disk surface
207 CFh Spin high current Amount of high current used to spin up the drive.
208 D0h Spin buzz Number of buzz routines to spin up the drive
209 D1h Offline seek performance Drive’s seek performance during offline operations.
220 DCh Shift of disk is possible as a result of strong
shock loading in the store, as a result of falling (or) temperature
221 DDh G-sense error rate Number of errors as a result of impact loads as detected by a shock sensor
222 DEh Loaded hours Number of hours in general operational state
223 DFh Load/unload retry count Loading on drive caused by numerous recurrences of operations, like reading, recording, positioning of heads, etc.
224 E0h Load friction Load on drive caused by friction in mechanical
parts of the store
225 E1h Load/Unload cycle count Total number of load cycles
226 E2h Load-in time General time for loading in a drive
227 E3h Torque amplification count Quantity efforts of the rotating moment of a drive
228 E4h Power-off retract count Number of power-off retract events.
230 E6h GMR head amplitude Amplitude of heads trembling (GMR-head) in running mode
231 E7h Temperature Temperature of a drive
240 F0h Head flying hours Time while head is positioning
250 FAh Read error retry rate Number of errors while reading from a disk.
Atributos de:
>> Fujitsu Devices <<
>> Maxtor Devices <<
>> Western-Digital Devices <<
PDF completo en ingles.
No hay comentarios:
Publicar un comentario