I am encountering following error in /var/log/messages:
Aug 15 03:55:42 hostname smartd: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Which cause the / partition to be mounted as read-only. The server is accessible anyway but you cant do anything much inside. Lets troubleshoot this.
I see read-only filesystem mounted when creating a test file in /root directory:
$ touch /root/testfile touch: cannot touch `/root/testfile': Read-only file system
What is SMART daemon (smartd)?
Self-Monitoring, Analysis and Reporting Technology (SMART) system built into many ATA-3 and later ATA, IDE and SCSI-3 hard drives. The purpose of SMART is to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests. We will use smartctl command to help us find out what is wrong with the disk.
Lets check the overall health of disk /dev/sda:
$ smartctl -H /dev/sda smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
It passed. But it just general information only. We need to go deeper by do self-test to the disk:
$ smartctl -q errorsonly -H -l selftest -l error /dev/sda ATA Error Count: 2 Error 2 occurred at disk power-on lifetime: 36795 hours (1533 days + 3 hours) Error 1 occurred at disk power-on lifetime: 31542 hours (1314 days + 6 hours) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 60% 39255 -
When I Google up the error above, it seems like the hard disk might have hardware problem. FSCK only might not helping much since it only fix logical error in file system, not the hardware error.
Errors reported by SMARTD is related to power-on lifetime attributes which explain as below (reference):
Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. A decrease of this attribute value to the critical level (threshold) indicates a decrease of the MTBF (Mean Time Between Failures).
However, in reality, even if the MTBF value falls to zero, it does not mean that the MTBF resource is completely exhausted and the drive will not function normally.
Since the hard disk is in read-only mode, we better do backup before proceed with any problem solving process. In this case, SCP to another server is good idea because we cannot write to the local disk at this moment. For me, “home” partition is the most important folder need to be saved:
$ scp -r /home user1@remoteserver:/home/user1/home_backup
Problem Solving Process
1. Remount the / partition:
$ mount -n -o remount / mount: block device /dev/sda2 is write-protected, mounting read-only
2. Run e2fsck command to check ext3 file system online:
$ e2fsck /dev/sda2 e2fsck 1.39 (29-May-2006) /: recovering journal Clearing orphaned inode 31672817 (uid=0, gid=0, mode=0100755, size=157913) Clearing orphaned inode 31672803 (uid=0, gid=0, mode=0100755, size=3532999) Clearing orphaned inode 31666625 (uid=0, gid=0, mode=0100755, size=150604) Clearing orphaned inode 31666619 (uid=0, gid=0, mode=0100755, size=383872) Clearing orphaned inode 27885882 (uid=0, gid=0, mode=0100755, size=1011760) Clearing orphaned inode 31666617 (uid=0, gid=0, mode=0100755, size=1141532) Clearing orphaned inode 31665420 (uid=0, gid=0, mode=0100755, size=398180) Clearing orphaned inode 31665416 (uid=0, gid=0, mode=0100755, size=71852) Clearing orphaned inode 31671503 (uid=0, gid=0, mode=0100755, size=1250176) /: clean, 80179/38273024 files, 2990728/38258797 blocks
Try remounting again the partition like step 1 but same error occurred. Proceed to next step.
3. Run full file system check using FSCK via rescue environment:
$ fsck -f -y /dev/sda2
Even the box remount correctly after that, the smartd status still haunting me up. This has force me to make final decision as my next step.
4. To avoid any sudden breakdown (since the disk already run more than 1000 days), I decided to replace the hard disk and re-install the box. Its better for me to do this as part of my maintenance task so I will not worrying much about ‘urgent’ maintenance when it breakdown during weekend or sleep time!