Page 1 of 1

[RAID FAIL] How to swap a broken drive with a new one

Posted: Fri Dec 12, 2025 3:14 pm
by myVesta
Scenario:

In cat /proc/mdstat you found:

Code: Select all

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sda1[1](F) sdb1[0]           <---------- here we can see 'sda1' in failed (F) state
      3906885440 blocks super 1.2 [2/1] [U_]    <---------- [U_] means that one partition is not synchronized 
      bitmap: 4/4 pages [16KB], 65536KB chunk
OR

smartctl -a /dev/sda said:

Code: Select all

...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
...
If smartctl is an unknown command for your Debian/Ubuntu, you need to install it first by running: apt update; apt -y install smartmontools;

OR

HDSentinel said:

Code: Select all

HDD Device  2: /dev/sda
HDD Model ID : ST4000NM0245-1Z2107
HDD Serial No: ZC139HWL
HDD Revision : SS05
HDD Size     : 3815448 MB
Interface    : S-ATA Gen3, 6 Gbps
Temperature  : 32 °C
Highest Temp.: 55 °C
Health       : 0 %
Performance  : 100 %
Power on time: 1917 days, 22 hours
Est. lifetime: 0 days
  Failure Predicted - Attribute: 5 Reallocated Sectors Count, Count of sectors moved to the spare area. Indicate problem with the disk surface or the read/write heads.
  1777 errors occurred during data transfer.
  In case of sudden system crash, reboot, blue-screen-of-death, inaccessible file(s)/folder(s), it is recommended to verify data and power cables, connections - and if possible try different cables to prevent further problems.
  More information: https://www.hdsentinel.com/hard_disk_case_communication_error.php
  Replace hard disk immediately.
    It is recommended to backup immediately to prevent data loss.
If HDSentinel is not installed, you can install it by running:

Code: Select all

wget https://www.hdsentinel.com/hdslin/hdsentinel-020c-x64.zip; unzip hdsentinel-020c-x64.zip chmod u+x HDSentinel; ./HDSentinel

Here is how to swap a broken drive with a new one

First, let's see which RAID 1 array it's in.
Run:

Code: Select all

cat /proc/mdstat
Output would be something like:

Code: Select all

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 sda1[1](F) sdb1[0]           <---------- here we can see 'sda1' in failed (F) state, in 'md2' RAID1 array
      3906885440 blocks super 1.2 [2/1] [U_]    <---------- [U_] means that one partition is not synchronized
      bitmap: 13/30 pages [52KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      999547200 blocks super 1.2 [2/2] [UU]
      bitmap: 8/8 pages [32KB], 65536KB chunk

md0 : active raid1 nvme0n1p1[0] nvme1n1p1[1]
      523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
For better visualization, you can run lsblk to see a map of all drives:

Code: Select all

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda           8:0    0   3.7T  0 disk  
└─sda1        8:1    0   3.7T  0 part              <------ here is our broken buddy
  └─md2     9:127  0   3.7T  0 raid1 /hdd        
sdb           8:16   0   3.7T  0 disk  
└─sdb1        8:17   0   3.7T  0 part  
  └─md2     9:127  0   3.7T  0 raid1 /hdd
nvme0n1     259:0    0 953.9G  0 disk  
├─nvme0n1p1 259:1    0   512M  0 part  
│ └─md0       9:0    0 511.4M  0 raid1 /boot
└─nvme0n1p2 259:2    0 953.4G  0 part  
  └─md1       9:1    0 953.2G  0 raid1 /
nvme1n1     259:3    0 953.9G  0 disk  
├─nvme1n1p1 259:4    0   512M  0 part  
│ └─md0       9:0    0 511.4M  0 raid1 /boot
└─nvme1n1p2 259:5    0 953.4G  0 part  
  └─md1       9:1    0 953.2G  0 raid1 /
So, let's conclude:
  • sda is broken
  • sda1 partition should be removed from the RAID1 array, and re-added and synchronized later when we swap the broken drive
  • it is in md2 RAID1 array

In case you are using NVMe SSD drives, the tutorial would be the same, just:
  • instead of sda it would be nvme0n1
  • instead of sda1 it would be nvme0n1p1
  • instead of sdb it would be nvme1n1
  • instead of sdb1 it would be nvme1n1p1


Okay, now we need to remove the broken drive from the RAID1 array.

Mark it failed:

Code: Select all

mdadm --manage /dev/md2 --fail /dev/sda1
Remove sda1 from md2 array:

Code: Select all

mdadm --manage /dev/md2 --remove /dev/sda1
Here, in case 'smartmontools' and 'HDSentinel' didn't show that the disk is dying, we could re-add the same partition back, and our story would end here:

Code: Select all

mdadm --manage /dev/md2 --add /dev/sda1
So, sometimes the drive isn't so bad, and you can give it another chance.
In that case, we are done here.
With cat /proc/mdstat you can monitor the re-syncing process, and once it is done, and you see [UU] for md2, you are fine.

But... in this tutorial, our drive is really bad, and we need to swap it with a healthy one.
So, we are going further.


Now, to know which drive you need to remove from the computer/server, you need the broken drive's serial number (or the healthy drive's if we cannot retrieve it from the broken one).

Code: Select all

smartctl -i /dev/sda | grep 'Serial Number'
Output would be something like:

Code: Select all

Serial Number:    ZC139HWL
Now you can turn off your computer/server (if it does not support hot-swapping), and remove the drive with 'ZC139HWL' serial number (or, if it is a remote server, send a ticket to your data center (request to swap a drive) with the serial number of the broken drive).

Okay, now, after the broken drive is removed and a new one is inserted, running lsblk should show that 'sda' is empty and without a partition:

Code: Select all

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda           8:0    0   3.7T  0 disk   <--------- we can see it is without partitions
sdb           8:16   0   3.7T  0 disk  
└─sdb1        8:17   0   3.7T  0 part  
  └─md2     9:127  0   3.7T  0 raid1 /hdd
...
Beware here - because, after the reboot, Linux can easily 'rotate' the drives - so the old healthy drive becomes 'sda' (before the reboot it was 'sdb'), and the new drive can become 'sdb'.
That is why we ran 'lsblk' - to make sure that the old healthy drive is still 'sdb'.
If they are rotated, then swap 'sda' and 'sdb' in all commands from here to the rest of this tutorial.

In this tutorial, lsblk shows that they are not rotated - 'sda' is the new one and has no partitions, 'sdb' is the old healthy drive.

So, once again, the facts are:
  • sda is the new (empty) drive without partitions
  • sdb is the old healthy drive
  • it is in md2 RAID1 array where we need to pull back sda1
Yes, sda1 is still not created on the new (empty) drive.

Now is the time for the most dangerous part - we should copy the 'partition table' from the healthy drive (sdb) to the new (empty) drive (sda).
I said dangerous because if you accidentally swap their names, you will copy the empty partition table from the empty drive to the old, healthy drive, and end up with both drives empty!

So, be really, really careful here.
My advice is to use 'rsync' before this part, to transfer all important files to some remote server, just in case.

To copy the partitions from 'sdb' (old healthy) to 'sda' (new empty) drive, first we need to find what type of partition table is on 'sdb' (old healthy).
It can be 'GPT' or 'MBR'.

We will use the 'parted' tool for this task, so make sure you have installed it:

Code: Select all

apt update; apt -y install parted
We will run:

Code: Select all

parted -l | grep 'Partition Table:' -B 2
The output is:

Code: Select all

Error: /dev/sda: unrecognised disk label    <-------- this is OK, since the new drive is empty
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: unknown
--
Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt    <-------- here is what we are looking for 
--
This means that an old, healthy 'sdb' uses the GPT partition type.
In case it is an MBR type, we would see:

Code: Select all

Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos    <-------- this is an MBR type 
If you see 'msdos', it means that 'sdb' uses an MBR partition type.

Now we have two different tools for copying the partition table - one for GPT, one for MBR type.

In case it is an GPT type (which is our case now), we will run:

Code: Select all

# dump partition table from old healthy 'sdb'
sgdisk --backup=table.dmp /dev/sdb

# import dump to new empty 'sda'
sgdisk --load-backup=table.dmp /dev/sda

# Assign random UUID to new 'sda' drive:
sgdisk -G /dev/sda
In case it is an MBR type (which is NOT our case right now), we will run:

Code: Select all

# dump partition table from old healthy 'sdb'
sfdisk --dump /dev/sdb > table.dmp

# import dump to new empty 'sda'
sfdisk /dev/sda < table.dmp
Once again, in previous steps, be really careful not to copy the empty partition table to the healthy old drive.

Now, let's check if 'sda' has a partition now:

Code: Select all

# lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda           8:0    0   3.7T  0 disk  
└─sda1        8:1    0   3.7T  0 part    <---- yup, it is here now
sdb           8:16   0   3.7T  0 disk  
└─sdb1        8:17   0   3.7T  0 part  
  └─md2     9:127  0   3.7T  0 raid1 /hdd
Now the only step left is to bring back 'sda1' to the 'md2' RAID1 array:

Code: Select all

mdadm --manage /dev/md2 --add /dev/sda1
And we can check it has started to sync:

Code: Select all

cat /proc/mdstat
Output would be:

Code: Select all

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 sda1[2] sdb1[1]
      3906885440 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (1680128/3906885440) finish=309.9min speed=210016K/sec
      bitmap: 22/30 pages [88KB], 65536KB chunk

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      999547200 blocks super 1.2 [2/2] [UU]
      bitmap: 8/8 pages [32KB], 65536KB chunk

md0 : active raid1 nvme0n1p1[0] nvme1n1p1[1]
      523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
And lsblk will confirm that 'sda1' is again the part of 'md2' RAID1 array:

Code: Select all

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda           8:0    0   3.7T  0 disk  
└─sda1        8:1    0   3.7T  0 part  
  └─md2     9:127  0   3.7T  0 raid1 /hdd    <------- Wohoooo !
sdb           8:16   0   3.7T  0 disk  
└─sdb1        8:17   0   3.7T  0 part  
  └─md2     9:127  0   3.7T  0 raid1 /hdd
...
That's all, folks :D