Replacing OSD disks
The procedural steps given in this guide will show how to recreate a Ceph OSD disk within a Charmed Ceph deployment.
Applicable use cases are defined by a collection of the following three factors:
- the disk is associated with a bcache device created by MAAS
- encryption:
- the disk is not encrypted
- the disk is LUKS encrypted without Vault
- the disk is LUKS encrypted with Vault
- disk replacement
- the existing disk is re-initialised
- the disk is physically replaced with a new one
For example, a permutation of the above that would be served by the procedural steps would be:
- the disk is associated with a bcache device created by MAAS
- the disk is LUKS encrypted with Vault
- the existing disk is re-initialised
Since Ceph Luminous, BlueStore is the default storage backend for Ceph. BlueStore is assumed in these instructions.
Environment
These instructions use as a reference a Ceph cluster that has three ceph-osd units. Each of the unit machines has the following storage specifications:
-
Three SATA disks. One for root and three for bcache backing devices:
-
/dev/sdd
(root) /dev/sda
/dev/sdb
/dev/sdc
-
-
Three bcache devices are used as OSD disks:
/dev/bcache0
/dev/bcache1
/dev/bcache2
-
One NVMe disk partition for the cache used for all the bcache devices:
/dev/nvme0n1p1
Disk identification and properties
Identify the storage node and the target disk. This section shows commands to run on a ceph-osd unit or a ceph-mon unit that will help in the gathering of information.
Map OSDs to machine hosts and ceph-osd units
The below commands can be run on any ceph-mon unit.
To map OSDs to machine hosts:
sudo ceph osd tree
Sample output:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 8.18729 root default
-3 2.72910 host node-gadomski
1 ssd 0.90970 osd.1 up 1.00000 1.00000
4 ssd 0.90970 osd.4 up 1.00000 1.00000
6 ssd 0.90970 osd.6 up 1.00000 1.00000
-7 2.72910 host node-lepaute
2 ssd 0.90970 osd.2 up 1.00000 1.00000
5 ssd 0.90970 osd.5 up 1.00000 1.00000
8 ssd 0.90970 osd.8 up 1.00000 1.00000
-5 2.72910 host node-pytheas
0 ssd 0.90970 osd.0 up 1.00000 1.00000
3 ssd 0.90970 osd.3 up 1.00000 1.00000
7 ssd 0.90970 osd.7 up 1.00000 1.00000
To query an individual OSD:
sudo ceph osd find osd.4
To map OSDs to ceph-osd units:
juju run -a ceph-osd mount | grep ceph
Sample output:
tmpfs on /var/lib/ceph/osd/ceph-1 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-4 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-6 type tmpfs (rw,relatime)
UnitId: ceph-osd/0
tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-3 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-7 type tmpfs (rw,relatime)
UnitId: ceph-osd/1
tmpfs on /var/lib/ceph/osd/ceph-2 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-5 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-8 type tmpfs (rw,relatime)
UnitId: ceph-osd/2
Get an overview of how devices are used
Use the lsblk
and pvs
commands on a ceph-osd unit. We’ll consider three scenarios:
- Unencrypted OSD
- Encrypted OSD without Vault
- Encrypted OSD with Vault
Vault is software for secrets management (see the charm for details).
Unencrypted OSD
lsblk -i -o NAME,FSTYPE
Sample (partial) output:
sda bcache
`-bcache0 LVM2_member
`-ceph--4397bb54--XXXX-osd--block--4397bb54--XXXX--8fc537870c13 ceph_bluestore
sdb bcache
`-bcache2 LVM2_member
`-ceph--0a7a51ae--XXXX-osd--block--0a7a51ae--XXXX--96435dcaa476 ceph_bluestore
sdc bcache
`-bcache1 LVM2_member
`-ceph--a9c36911--XXXX-osd--block--a9c36911--XXXX--1808c100300a ceph_bluestore
The ‘bcache’ values in this output are the actual kernel device names. For instance, for the sda stanza it is bcache0. This device is associated with a “by-dname” value used by the ceph-osd charm (and created by and using udev rules). It ensures persistent naming upon reboot.
sudo pvs
Sample (partial) output:
PV VG
/dev/bcache0 ceph-4397bb54-2b58-4b81-b681-8fc537870c13
/dev/bcache1 ceph-a9c36911-2a6b-4643-b4d8-1808c100300a
/dev/bcache2 ceph-0a7a51ae-7709-4f8f-9b06-96435dcaa476
Encrypted OSD without Vault
lsblk -i -o NAME,FSTYPE
Sample (partial) output:
sda bcache
`-bcache0 LVM2_member
`-ceph--85727873--XXXX-osd--block--85727873--XXXX--dec2c9c25a83 crypto_LUKS
`-jcnsgd-LdFy-nnle-DYXt-fdu1-cmYg-eKnutV ceph_bluestore
sdb bcache
`-bcache2 LVM2_member
`-ceph--e273e62f--XXXX-osd--block--e273e62f--XXXX--7e4f54a6a50a crypto_LUKS
`-JCAi7T-Q0lh-chEP-rehy-R843-FYPb-O5cRWO ceph_bluestore
sdc bcache
`-bcache1 LVM2_member
`-ceph--7a4dc4df--XXXX-osd--block--7a4dc4df--XXXX--5d1e3df749e1 crypto_LUKS
`-guhvFj-S5mX-oncK-WLmA-LN3A-B9gd-ky3jbV ceph_bluestore
sudo pvs
Sample (partial) output:
PV VG
/dev/bcache0 ceph-85727873-f06f-45b6-9aee-dec2c9c25a83
/dev/bcache1 ceph-7a4dc4df-4199-4d72-8f96-5d1e3df749e1
/dev/bcache2 ceph-e273e62f-c05b-4f54-9be2-7e4f54a6a50a
The encrypted device is the entry just below the one labelled crypto_LUKS. For example, for the sdb stanza, assign the encrypted device to a variable:
OSD_CRYPT=JCAi7T-Q0lh-chEP-rehy-R843-FYPb-O5cRWO
Encrypted OSD with Vault
lsblk -i -o NAME,FSTYPE
Sample (partial) output:
sda bcache
`-bcache0 crypto_LUKS
`-crypt-868854eb-dd1a-47a2-9bce-dc503b2f0fd4 LVM2_member
`-ceph--868854eb--XXXX-osd--block--868854eb--XXXX--dc503b2f0fd4 ceph_bluestore
sdb bcache
`-bcache2 crypto_LUKS
`-crypt-c4897473-ac04-4b86-a143-12e2322c6eb5 LVM2_member
`-ceph--c4897473--XXXX-osd--block--c4897473--XXXX--12e2322c6eb5 ceph_bluestore
sdc bcache
`-bcache1 crypto_LUKS
`-crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2 LVM2_member
`-ceph--13d2f2a3--XXXX-osd--block--13d2f2a3--XXXX--385b12e372a2 ceph_bluestore
sudo pvs
Sample (partial) output:
PV VG
/dev/mapper/crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2 ceph-13d2f2a3-2e20-40e2-901a-385b12e372a2
/dev/mapper/crypt-868854eb-dd1a-47a2-9bce-dc503b2f0fd4 ceph-868854eb-dd1a-47a2-9bce-dc503b2f0fd4
/dev/mapper/crypt-c4897473-ac04-4b86-a143-12e2322c6eb5 ceph-c4897473-ac04-4b86-a143-12e2322c6eb5
The encrypted device is the entry just below the one labelled crypto_LUKS. For example, for the sdc stanza, assign the encrypted device to a variable:
OSD_CRYPT=crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
Map ceph-osd unit, OSD UUID, PV, and VG
Here, a specific OSD (osd.4) is being queried on a ceph-osd unit:
sudo ceph-volume lvm list | grep -A 12 "= osd.4 =" | grep 'osd fsid'
Sample output:
osd fsid 13d2f2a3-2e20-40e2-901a-385b12e372a2
This UUID in combination with the lsblk
and pvs
outputs allows us to determine the PV and VG that correspond to the OSD disk. This is because the name of the VG is based on the OSD_UUID.
For example, for the Encrypted OSD with Vault scenario:
OSD_UUID=13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_PV=/dev/mapper/crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_VG=ceph-13d2f2a3-2e20-40e2-901a-385b12e372a2
And since the OSD is encrypted:
OSD_CRYPT=crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
Also make note of the ceph-osd unit being queried (here: ceph-osd/0
):
OSD_UNIT=ceph-osd/0
Discover the disk by-dname entry
The by-dname device name is a symbolic link to the actual kernel device used by the disk. For example, for the Encrypted OSD with Vault scenario, with an OSD_UUID of 13d2f2a3-2e20-40e2-901a-385b12e372a2
, the actual device is bcache1
. We can use that in the below command to find the by-dname entry.
On a ceph-osd unit:
ls -la /dev/disk/by-dname/bcache* | egrep "bcache1$"
Sample output:
lrwxrwxrwx 1 root root 13 Oct 1 04:28 /dev/disk/by-dname/bcache3 -> ../../bcache1
Assign the by-dname entry to a variable:
OSD_DNAME=/dev/disk/by-dname/bcache3
Pre-replace procedure
The Encrypted OSD with Vault example scenario will be used in these steps. The target disk has been identified according to the these properties:
OSD_UNIT=ceph-osd/0
OSD=osd.4
OSD_ID=4
OSD_UUID=13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_PV=/dev/mapper/crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_DNAME=/dev/disk/by-dname/bcache3
OSD_VG=ceph-13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_CRYPT=crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
The procedural steps will refer to these variables.
-
Mark the OSD as ‘out’.
Ensure cluster health is good (
HEALTH_OK
) before and after.On any ceph-mon unit do:
sudo ceph -s sudo ceph osd out $OSD sudo ceph -s
-
Stop the OSD daemon.
Reweight the OSD volume to zero (to prevent data rebalancing from occurring) and then stop the OSD daemon. From the Juju client do:
juju run-action --wait ceph-mon/leader change-osd-weight osd=$OSD_ID weight=0 juju run-action --wait $OSD_UNIT stop osds=$OSD_ID
-
Confirm that the OSD is ‘down’.
On any ceph-mon unit do:
sudo ceph osd tree down
Sample output:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 8.18729 root default -3 2.72910 host node-gadomski 4 ssd 0.90970 osd.4 down 0 1.00000
-
Clean up the OSD’s resources.
On unit OSD_UNIT perform cleanup actions based on the given scenario:
If not encrypted:
sudo vgremove -y $OSD_VG sudo pvremove -y $OSD_PV
If encrypted without Vault:
sudo cryptsetup close /dev/mapper/$OSD_CRYPT sudo vgremove -y $OSD_VG sudo pvremove -y $OSD_PV
If encrypted with Vault:
sudo vgremove -y $OSD_VG sudo cryptsetup close /dev/mapper/$OSD_CRYPT sudo systemctl disable vaultlocker-decrypt@$OSD_UUID
-
Wipe the OSD disk of data.
On unit OSD_UNIT do:
sudo ceph-volume lvm zap --destroy $OSD_DNAME
Ensure the OSD_DNAME entry exists for when the charm action
zap-disk
is eventually run:sudo udevadm trigger ls -la /dev/disk/by-dname/*
Note: The
zap-disk
action will be needed in order to remove the OSD device from the ceph-osd charm’s internal database. -
Purge the Ceph cluster of all traces of the OSD.
From the Juju client do:
juju run-action --wait ceph-mon/leader purge-osd osd=$OSD_ID i-really-mean-it=yes
Replace procedure
Skip this entire section if the disk does not need to be physically replaced. Continue to the Post-replace procedure.
The values used in these steps originate with the Encryption with Vault scenario described in the Pre-replace procedure section.
All the commands in this section are invoked on unit OSD_UNIT.
-
Confirm the bcache device and its backing device.
The disk being replaced is the bcache backing device but in order to replace it we must also take into account its associated bcache device. Confirm the device names based on the
lsblk
output:BCACHE=bcache1 BACKING=/dev/sdc BACKING_SIMPLE=sdc
Warning: Failure to correctly assign the above variables may result in data loss. Due to the way subsequent commands will be used, do not add ‘/dev/’ to the BCACHE variable.
-
Determine the cache set UUID.
Display the underlying devices of the bcache block:
ls -l /sys/block/$BCACHE/slaves
Sample output:
lrwxrwxrwx 1 root root 0 Oct 1 04:09 nvme0n1p1 -> ../../../../pci0000:80/0000:80:03.0/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1 lrwxrwxrwx 1 root root 0 Oct 1 04:09 sdc -> ../../../../pci0000:00/0000:00:01.0/0000:02:00.0/host0/target0:0:2/0:0:2:0/block/sdc
This shows that the devices partaking in this bcache block are
/dev/nvme0n1p1
and/dev/sdc
. One of these should correspond to device BACKING from step #1.Now use the backing device to obtain the cache set UUID:
sudo bcache-super-show $BACKING | grep cset.uuid
Output:
cset.uuid 7ee80bd1-97e3-464e-9b28-f26733c6dc2c
Assign it to a variable:
CSET_UUID=7ee80bd1-97e3-464e-9b28-f26733c6dc2c
-
Save the current cache mode.
Show the current cache mode:
sudo cat /sys/block/$BCACHE/bcache/cache_mode
Output:
writethrough [writeback] writearound none
The above output shows a mode of ‘writeback’. Assign it to a variable:
CACHE_MODE=writeback
-
Stop the bcache device.
echo 1 | sudo tee /sys/block/$BCACHE/bcache/stop
Check that the bcache device no longer exists:
ls -l /dev/$BCACHE
Ensure that the backing device is not associated with any bcache:
lsblk $BACKING
-
Clear the disk’s data blocks where bcache metadata is stored.
Warning: Incorrect device names will most likely result in data loss.
sudo wipefs -a $BACKING sudo dd if=/dev/zero of=$BACKING bs=512 count=8
Info: If you are dealing with a faulty disk, the above commands may fail. You can ignore any error messages.
-
Replace the disk.
Follow the instructions from your hardware vendor to properly remove the disk and add a new one.
Important: For the remainder of this procedure it is assumed that the name of the newly added disk is identical to that of the old one. Re-assign the variable BACKING if this is not the case.
-
Create a new bcache device.
Create the new bcache device using the new disk as its backing device:
sudo make-bcache -B $BACKING
Sample output:
UUID: b3bc521e-ca04-458b-a106-db874e4f8c57 Set UUID: f60b6bc6-a623-42f8-8643-64a5cc8f98c6 version: 1 block_size: 1 data_offset: 16
UUID is the dev.uuid of the newly created bcache device. Assign it to a variable:
DEV_UUID=b3bc521e-ca04-458b-a106-db874e4f8c57
-
Determine the new bcache device.
Get the new bcache device name (the asterisk in the command is intended):
ls -d /sys/block/bcache*/slaves/$BACKING_SIMPLE
Sample output:
/sys/block/bcache1/slaves/sdc
From this output, it can be derived that the new bcache device is
bcache1
. Assign it to a variable:NEW_BCACHE=bcache1
-
Add the new bcache device to the cache set.
Link the new bcache device to the bcache caching device:
echo $CSET_UUID | sudo tee /sys/block/$NEW_BCACHE/bcache/attach
Confirm the operation:
sudo bcache-super-show $BACKING | grep cset.uuid
The output must match the value of CSET_UUID.
-
Set the caching mode of the new bcache block
Set the caching mode to the original value:
echo $CACHE_MODE | sudo tee /sys/block/$NEW_BCACHE/bcache/cache_mode
Confirm the operation:
sudo cat /sys/block/$NEW_BCACHE/bcache/cache_mode
The output must match the value of CACHE_MODE.
-
Confirm the status of the newly created bcache block.
To show the current status of the newly created bcache block:
cat /sys/block/$NEW_BCACHE/bcache/state ls /sys/block/$NEW_BCACHE/slaves
The cache state should be ‘dirty’ or ‘clean’, but not ‘inconsistent’ nor no cache. Both slaves should be correctly listed.
-
Modify the udev rule that creates the by-dname entry.
Edit the udev rule associated with the current by-dname entry so as to match DEV_UUID.
From the Pre-replace procedure:
OSD_DNAME=/dev/disk/by-dname/bcache3
The file to edit is therefore:
/etc/udev/rules.d/bcache3.rules
Replace the existing value of
ENV{CACHED_UUID}
with DEV_UUID, as in the example below:# Written by curtin SUBSYSTEM=="block", ACTION=="add|change", ENV{CACHED_UUID}=="b3bc521e-ca04-458b-a106-db874e4f8c57", SYMLINK+="disk/by-dname/bcache3"
Note: The values should be on a single line. The above content has been formatted to improve the legibility of these instructions.
Save and close the file.
-
Regenerate the udev rules.
Regenerate the udev rules to create the by-dname symbolic link:
sudo udevadm trigger
Confirm that the desired by-dname entry is now in place by recalling the original name of the bcache device (BCACHE=bcache1):
ls -la /dev/disk/by-dname/bcache* | egrep "bcache1$"
Output:
lrwxrwxrwx 1 root root 13 Oct 1 19:15 /dev/disk/by-dname/bcache3 -> ../../bcache1
The output must correspond to the original by-dname entry (OSD_DNAME=/dev/disk/by-dname/bcache3).
The new bcache device is now ready.
Post-replace procedure
The values used in these steps originate with the Encryption with Vault scenario described in the Pre-replace procedure section.
-
Remove the OSD device from charm’s database.
Even though the disk has already been wiped of data (
lvm zap
), the OSD device still needs to be removed from the ceph-osd charm’s internal database.From the Juju client do:
juju run-action --wait $OSD_UNIT zap-disk devices=$OSD_DNAME i-really-mean-it=yes
If bcache is not being used then refer to the underlying device (e.g.
devices=/dev/sdc
). -
Ensure that the by-dname entries are up-to-date.
On unit OSD_UNIT do:
sudo udevadm trigger
Again confirm that the by-dname entries are visible, our OSD_DNAME in particular:
ls -la /dev/disk/by-dname/*
-
Recreate the OSD.
From the Juju client do:
juju run-action --wait $OSD_UNIT add-disk osd-devices=$OSD_DNAME
If bcache is not being used then refer to the underlying device (e.g.
osd-devices=/dev/sdc
). -
Verify the newly added OSD.
On any ceph-mon unit do:
sudo ceph osd tree up sudo ceph -s
The replaced OSD should be ‘up’ and in the cluster should be in the process of being rebalanced.
Last updated 5 months ago. Help improve this document in the forum.