[RAID FAIL] How to swap a broken drive with a new one
Posted: Fri Dec 12, 2025 3:14 pm
Scenario:
In cat /proc/mdstat you found:
OR
smartctl -a /dev/sda said:
If smartctl is an unknown command for your Debian/Ubuntu, you need to install it first by running: apt update; apt -y install smartmontools;
OR
HDSentinel said:
If HDSentinel is not installed, you can install it by running:
Here is how to swap a broken drive with a new one
First, let's see which RAID 1 array it's in.
Run:
Output would be something like:
For better visualization, you can run lsblk to see a map of all drives:
So, let's conclude:
In case you are using NVMe SSD drives, the tutorial would be the same, just:
Okay, now we need to remove the broken drive from the RAID1 array.
Mark it failed:
Remove sda1 from md2 array:
Here, in case 'smartmontools' and 'HDSentinel' didn't show that the disk is dying, we could re-add the same partition back, and our story would end here:
So, sometimes the drive isn't so bad, and you can give it another chance.
In that case, we are done here.
With cat /proc/mdstat you can monitor the re-syncing process, and once it is done, and you see [UU] for md2, you are fine.
But... in this tutorial, our drive is really bad, and we need to swap it with a healthy one.
So, we are going further.
Now, to know which drive you need to remove from the computer/server, you need the broken drive's serial number (or the healthy drive's if we cannot retrieve it from the broken one).
Output would be something like:
Now you can turn off your computer/server (if it does not support hot-swapping), and remove the drive with 'ZC139HWL' serial number (or, if it is a remote server, send a ticket to your data center (request to swap a drive) with the serial number of the broken drive).
Okay, now, after the broken drive is removed and a new one is inserted, running lsblk should show that 'sda' is empty and without a partition:
Beware here - because, after the reboot, Linux can easily 'rotate' the drives - so the old healthy drive becomes 'sda' (before the reboot it was 'sdb'), and the new drive can become 'sdb'.
That is why we ran 'lsblk' - to make sure that the old healthy drive is still 'sdb'.
If they are rotated, then swap 'sda' and 'sdb' in all commands from here to the rest of this tutorial.
In this tutorial, lsblk shows that they are not rotated - 'sda' is the new one and has no partitions, 'sdb' is the old healthy drive.
So, once again, the facts are:
Now is the time for the most dangerous part - we should copy the 'partition table' from the healthy drive (sdb) to the new (empty) drive (sda).
I said dangerous because if you accidentally swap their names, you will copy the empty partition table from the empty drive to the old, healthy drive, and end up with both drives empty!
So, be really, really careful here.
My advice is to use 'rsync' before this part, to transfer all important files to some remote server, just in case.
To copy the partitions from 'sdb' (old healthy) to 'sda' (new empty) drive, first we need to find what type of partition table is on 'sdb' (old healthy).
It can be 'GPT' or 'MBR'.
We will use the 'parted' tool for this task, so make sure you have installed it:
We will run:
The output is:
This means that an old, healthy 'sdb' uses the GPT partition type.
In case it is an MBR type, we would see:
If you see 'msdos', it means that 'sdb' uses an MBR partition type.
Now we have two different tools for copying the partition table - one for GPT, one for MBR type.
In case it is an GPT type (which is our case now), we will run:
In case it is an MBR type (which is NOT our case right now), we will run:
Once again, in previous steps, be really careful not to copy the empty partition table to the healthy old drive.
Now, let's check if 'sda' has a partition now:
Now the only step left is to bring back 'sda1' to the 'md2' RAID1 array:
And we can check it has started to sync:
Output would be:
And lsblk will confirm that 'sda1' is again the part of 'md2' RAID1 array:
That's all, folks 
In cat /proc/mdstat you found:
Code: Select all
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sda1[1](F) sdb1[0] <---------- here we can see 'sda1' in failed (F) state
3906885440 blocks super 1.2 [2/1] [U_] <---------- [U_] means that one partition is not synchronized
bitmap: 4/4 pages [16KB], 65536KB chunksmartctl -a /dev/sda said:
Code: Select all
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
...OR
HDSentinel said:
Code: Select all
HDD Device 2: /dev/sda
HDD Model ID : ST4000NM0245-1Z2107
HDD Serial No: ZC139HWL
HDD Revision : SS05
HDD Size : 3815448 MB
Interface : S-ATA Gen3, 6 Gbps
Temperature : 32 °C
Highest Temp.: 55 °C
Health : 0 %
Performance : 100 %
Power on time: 1917 days, 22 hours
Est. lifetime: 0 days
Failure Predicted - Attribute: 5 Reallocated Sectors Count, Count of sectors moved to the spare area. Indicate problem with the disk surface or the read/write heads.
1777 errors occurred during data transfer.
In case of sudden system crash, reboot, blue-screen-of-death, inaccessible file(s)/folder(s), it is recommended to verify data and power cables, connections - and if possible try different cables to prevent further problems.
More information: https://www.hdsentinel.com/hard_disk_case_communication_error.php
Replace hard disk immediately.
It is recommended to backup immediately to prevent data loss.Code: Select all
wget https://www.hdsentinel.com/hdslin/hdsentinel-020c-x64.zip; unzip hdsentinel-020c-x64.zip chmod u+x HDSentinel; ./HDSentinelHere is how to swap a broken drive with a new one
First, let's see which RAID 1 array it's in.
Run:
Code: Select all
cat /proc/mdstatCode: Select all
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sda1[1](F) sdb1[0] <---------- here we can see 'sda1' in failed (F) state, in 'md2' RAID1 array
3906885440 blocks super 1.2 [2/1] [U_] <---------- [U_] means that one partition is not synchronized
bitmap: 13/30 pages [52KB], 65536KB chunk
md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
999547200 blocks super 1.2 [2/2] [UU]
bitmap: 8/8 pages [32KB], 65536KB chunk
md0 : active raid1 nvme0n1p1[0] nvme1n1p1[1]
523712 blocks super 1.2 [2/2] [UU]
unused devices: <none>Code: Select all
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.7T 0 disk
└─sda1 8:1 0 3.7T 0 part <------ here is our broken buddy
└─md2 9:127 0 3.7T 0 raid1 /hdd
sdb 8:16 0 3.7T 0 disk
└─sdb1 8:17 0 3.7T 0 part
└─md2 9:127 0 3.7T 0 raid1 /hdd
nvme0n1 259:0 0 953.9G 0 disk
├─nvme0n1p1 259:1 0 512M 0 part
│ └─md0 9:0 0 511.4M 0 raid1 /boot
└─nvme0n1p2 259:2 0 953.4G 0 part
└─md1 9:1 0 953.2G 0 raid1 /
nvme1n1 259:3 0 953.9G 0 disk
├─nvme1n1p1 259:4 0 512M 0 part
│ └─md0 9:0 0 511.4M 0 raid1 /boot
└─nvme1n1p2 259:5 0 953.4G 0 part
└─md1 9:1 0 953.2G 0 raid1 /
- sda is broken
- sda1 partition should be removed from the RAID1 array, and re-added and synchronized later when we swap the broken drive
- it is in md2 RAID1 array
In case you are using NVMe SSD drives, the tutorial would be the same, just:
- instead of sda it would be nvme0n1
- instead of sda1 it would be nvme0n1p1
- instead of sdb it would be nvme1n1
- instead of sdb1 it would be nvme1n1p1
Okay, now we need to remove the broken drive from the RAID1 array.
Mark it failed:
Code: Select all
mdadm --manage /dev/md2 --fail /dev/sda1Code: Select all
mdadm --manage /dev/md2 --remove /dev/sda1Code: Select all
mdadm --manage /dev/md2 --add /dev/sda1In that case, we are done here.
With cat /proc/mdstat you can monitor the re-syncing process, and once it is done, and you see [UU] for md2, you are fine.
But... in this tutorial, our drive is really bad, and we need to swap it with a healthy one.
So, we are going further.
Now, to know which drive you need to remove from the computer/server, you need the broken drive's serial number (or the healthy drive's if we cannot retrieve it from the broken one).
Code: Select all
smartctl -i /dev/sda | grep 'Serial Number'Code: Select all
Serial Number: ZC139HWLOkay, now, after the broken drive is removed and a new one is inserted, running lsblk should show that 'sda' is empty and without a partition:
Code: Select all
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.7T 0 disk <--------- we can see it is without partitions
sdb 8:16 0 3.7T 0 disk
└─sdb1 8:17 0 3.7T 0 part
└─md2 9:127 0 3.7T 0 raid1 /hdd
...That is why we ran 'lsblk' - to make sure that the old healthy drive is still 'sdb'.
If they are rotated, then swap 'sda' and 'sdb' in all commands from here to the rest of this tutorial.
In this tutorial, lsblk shows that they are not rotated - 'sda' is the new one and has no partitions, 'sdb' is the old healthy drive.
So, once again, the facts are:
- sda is the new (empty) drive without partitions
- sdb is the old healthy drive
- it is in md2 RAID1 array where we need to pull back sda1
Now is the time for the most dangerous part - we should copy the 'partition table' from the healthy drive (sdb) to the new (empty) drive (sda).
I said dangerous because if you accidentally swap their names, you will copy the empty partition table from the empty drive to the old, healthy drive, and end up with both drives empty!
So, be really, really careful here.
My advice is to use 'rsync' before this part, to transfer all important files to some remote server, just in case.
To copy the partitions from 'sdb' (old healthy) to 'sda' (new empty) drive, first we need to find what type of partition table is on 'sdb' (old healthy).
It can be 'GPT' or 'MBR'.
We will use the 'parted' tool for this task, so make sure you have installed it:
Code: Select all
apt update; apt -y install partedCode: Select all
parted -l | grep 'Partition Table:' -B 2Code: Select all
Error: /dev/sda: unrecognised disk label <-------- this is OK, since the new drive is empty
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: unknown
--
Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt <-------- here is what we are looking for
--In case it is an MBR type, we would see:
Code: Select all
Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos <-------- this is an MBR type Now we have two different tools for copying the partition table - one for GPT, one for MBR type.
In case it is an GPT type (which is our case now), we will run:
Code: Select all
# dump partition table from old healthy 'sdb'
sgdisk --backup=table.dmp /dev/sdb
# import dump to new empty 'sda'
sgdisk --load-backup=table.dmp /dev/sda
# Assign random UUID to new 'sda' drive:
sgdisk -G /dev/sdaCode: Select all
# dump partition table from old healthy 'sdb'
sfdisk --dump /dev/sdb > table.dmp
# import dump to new empty 'sda'
sfdisk /dev/sda < table.dmpNow, let's check if 'sda' has a partition now:
Code: Select all
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.7T 0 disk
└─sda1 8:1 0 3.7T 0 part <---- yup, it is here now
sdb 8:16 0 3.7T 0 disk
└─sdb1 8:17 0 3.7T 0 part
└─md2 9:127 0 3.7T 0 raid1 /hddCode: Select all
mdadm --manage /dev/md2 --add /dev/sda1Code: Select all
cat /proc/mdstatCode: Select all
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sda1[2] sdb1[1]
3906885440 blocks super 1.2 [2/1] [_U]
[>....................] recovery = 0.0% (1680128/3906885440) finish=309.9min speed=210016K/sec
bitmap: 22/30 pages [88KB], 65536KB chunk
md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
999547200 blocks super 1.2 [2/2] [UU]
bitmap: 8/8 pages [32KB], 65536KB chunk
md0 : active raid1 nvme0n1p1[0] nvme1n1p1[1]
523712 blocks super 1.2 [2/2] [UU]
unused devices: <none>Code: Select all
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3.7T 0 disk
└─sda1 8:1 0 3.7T 0 part
└─md2 9:127 0 3.7T 0 raid1 /hdd <------- Wohoooo !
sdb 8:16 0 3.7T 0 disk
└─sdb1 8:17 0 3.7T 0 part
└─md2 9:127 0 3.7T 0 raid1 /hdd
...