Replacing OSD disks

The procedural steps given in this guide will show how to recreate a Ceph OSD disk within a Charmed Ceph deployment.

Applicable use cases are defined by a collection of the following three factors:

  • the disk is associated with a bcache device created by MAAS
  • encryption:
    • the disk is not encrypted
    • the disk is LUKS encrypted without Vault
    • the disk is LUKS encrypted with Vault
  • disk replacement
    • the existing disk is re-initialised
    • the disk is physically replaced with a new one

For example, a permutation of the above that would be served by the procedural steps would be:

  1. the disk is associated with a bcache device created by MAAS
  2. the disk is LUKS encrypted with Vault
  3. the existing disk is re-initialised

Since Ceph Luminous, BlueStore is the default storage backend for Ceph. BlueStore is assumed in these instructions.

Environment

These instructions use as a reference a Ceph cluster that has three ceph-osd units. Each of the unit machines has the following storage specifications:

  • Three SATA disks. One for root and three for bcache backing devices:

    • /dev/sdd (root)
    • /dev/sda
    • /dev/sdb
    • /dev/sdc
  • Three bcache devices are used as OSD disks:

    • /dev/bcache0
    • /dev/bcache1
    • /dev/bcache2
  • One NVMe disk partition for the cache used for all the bcache devices:

    • /dev/nvme0n1p1

Disk identification and properties

Identify the storage node and the target disk. This section shows commands to run on a ceph-osd unit or a ceph-mon unit that will help in the gathering of information.

Map OSDs to machine hosts and ceph-osd units

The below commands can be run on any ceph-mon unit.

To map OSDs to machine hosts:

sudo ceph osd tree

Sample output:

ID  CLASS  WEIGHT   TYPE NAME               STATUS  REWEIGHT  PRI-AFF
-1         8.18729  root default                                     
-3         2.72910      host node-gadomski                           
 1    ssd  0.90970          osd.1               up   1.00000  1.00000
 4    ssd  0.90970          osd.4               up   1.00000  1.00000
 6    ssd  0.90970          osd.6               up   1.00000  1.00000
-7         2.72910      host node-lepaute                            
 2    ssd  0.90970          osd.2               up   1.00000  1.00000
 5    ssd  0.90970          osd.5               up   1.00000  1.00000
 8    ssd  0.90970          osd.8               up   1.00000  1.00000
-5         2.72910      host node-pytheas                            
 0    ssd  0.90970          osd.0               up   1.00000  1.00000
 3    ssd  0.90970          osd.3               up   1.00000  1.00000
 7    ssd  0.90970          osd.7               up   1.00000  1.00000

To query an individual OSD:

sudo ceph osd find osd.4

To map OSDs to ceph-osd units:

juju run -a ceph-osd mount | grep ceph

Sample output:

    tmpfs on /var/lib/ceph/osd/ceph-1 type tmpfs (rw,relatime)
    tmpfs on /var/lib/ceph/osd/ceph-4 type tmpfs (rw,relatime)
    tmpfs on /var/lib/ceph/osd/ceph-6 type tmpfs (rw,relatime)
  UnitId: ceph-osd/0
    tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime)
    tmpfs on /var/lib/ceph/osd/ceph-3 type tmpfs (rw,relatime)
    tmpfs on /var/lib/ceph/osd/ceph-7 type tmpfs (rw,relatime)
  UnitId: ceph-osd/1
    tmpfs on /var/lib/ceph/osd/ceph-2 type tmpfs (rw,relatime)
    tmpfs on /var/lib/ceph/osd/ceph-5 type tmpfs (rw,relatime)
    tmpfs on /var/lib/ceph/osd/ceph-8 type tmpfs (rw,relatime)
  UnitId: ceph-osd/2

Get an overview of how devices are used

Use the lsblk and pvs commands on a ceph-osd unit. We’ll consider three scenarios:

  1. Unencrypted OSD
  2. Encrypted OSD without Vault
  3. Encrypted OSD with Vault

Vault is software for secrets management (see the charm for details).

Unencrypted OSD
lsblk -i -o NAME,FSTYPE

Sample (partial) output:

sda                                                                   bcache    
`-bcache0                                                             LVM2_member
  `-ceph--4397bb54--XXXX-osd--block--4397bb54--XXXX--8fc537870c13     ceph_bluestore
sdb                                                                   bcache    
`-bcache2                                                             LVM2_member
  `-ceph--0a7a51ae--XXXX-osd--block--0a7a51ae--XXXX--96435dcaa476     ceph_bluestore
sdc                                                                   bcache    
`-bcache1                                                             LVM2_member
  `-ceph--a9c36911--XXXX-osd--block--a9c36911--XXXX--1808c100300a     ceph_bluestore

The ‘bcache’ values in this output are the actual kernel device names. For instance, for the sda stanza it is bcache0. This device is associated with a “by-dname” value used by the ceph-osd charm (and created by and using udev rules). It ensures persistent naming upon reboot.

sudo pvs

Sample (partial) output:

PV           VG
/dev/bcache0 ceph-4397bb54-2b58-4b81-b681-8fc537870c13
/dev/bcache1 ceph-a9c36911-2a6b-4643-b4d8-1808c100300a 
/dev/bcache2 ceph-0a7a51ae-7709-4f8f-9b06-96435dcaa476
Encrypted OSD without Vault
lsblk -i -o NAME,FSTYPE

Sample (partial) output:

sda                                                                   bcache    
`-bcache0                                                             LVM2_member
  `-ceph--85727873--XXXX-osd--block--85727873--XXXX--dec2c9c25a83     crypto_LUKS
    `-jcnsgd-LdFy-nnle-DYXt-fdu1-cmYg-eKnutV                          ceph_bluestore
sdb                                                                   bcache    
`-bcache2                                                             LVM2_member
  `-ceph--e273e62f--XXXX-osd--block--e273e62f--XXXX--7e4f54a6a50a     crypto_LUKS
    `-JCAi7T-Q0lh-chEP-rehy-R843-FYPb-O5cRWO                          ceph_bluestore
sdc                                                                   bcache    
`-bcache1                                                             LVM2_member
  `-ceph--7a4dc4df--XXXX-osd--block--7a4dc4df--XXXX--5d1e3df749e1     crypto_LUKS
    `-guhvFj-S5mX-oncK-WLmA-LN3A-B9gd-ky3jbV                          ceph_bluestore
sudo pvs

Sample (partial) output:

PV           VG
/dev/bcache0 ceph-85727873-f06f-45b6-9aee-dec2c9c25a83
/dev/bcache1 ceph-7a4dc4df-4199-4d72-8f96-5d1e3df749e1
/dev/bcache2 ceph-e273e62f-c05b-4f54-9be2-7e4f54a6a50a

The encrypted device is the entry just below the one labelled crypto_LUKS. For example, for the sdb stanza, assign the encrypted device to a variable:

OSD_CRYPT=JCAi7T-Q0lh-chEP-rehy-R843-FYPb-O5cRWO
Encrypted OSD with Vault
lsblk -i -o NAME,FSTYPE

Sample (partial) output:

sda                                                                   bcache    
`-bcache0                                                             crypto_LUKS
  `-crypt-868854eb-dd1a-47a2-9bce-dc503b2f0fd4                        LVM2_member
    `-ceph--868854eb--XXXX-osd--block--868854eb--XXXX--dc503b2f0fd4   ceph_bluestore
sdb                                                                   bcache    
`-bcache2                                                             crypto_LUKS
  `-crypt-c4897473-ac04-4b86-a143-12e2322c6eb5                        LVM2_member
    `-ceph--c4897473--XXXX-osd--block--c4897473--XXXX--12e2322c6eb5   ceph_bluestore
sdc                                                                   bcache    
`-bcache1                                                             crypto_LUKS
  `-crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2                        LVM2_member
    `-ceph--13d2f2a3--XXXX-osd--block--13d2f2a3--XXXX--385b12e372a2   ceph_bluestore
sudo pvs

Sample (partial) output:

PV                                                     VG
/dev/mapper/crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2 ceph-13d2f2a3-2e20-40e2-901a-385b12e372a2
/dev/mapper/crypt-868854eb-dd1a-47a2-9bce-dc503b2f0fd4 ceph-868854eb-dd1a-47a2-9bce-dc503b2f0fd4
/dev/mapper/crypt-c4897473-ac04-4b86-a143-12e2322c6eb5 ceph-c4897473-ac04-4b86-a143-12e2322c6eb5

The encrypted device is the entry just below the one labelled crypto_LUKS. For example, for the sdc stanza, assign the encrypted device to a variable:

OSD_CRYPT=crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
Map ceph-osd unit, OSD UUID, PV, and VG

Here, a specific OSD (osd.4) is being queried on a ceph-osd unit:

sudo ceph-volume lvm list | grep -A 12 "= osd.4 =" | grep 'osd fsid'

Sample output:

osd fsid                  13d2f2a3-2e20-40e2-901a-385b12e372a2

This UUID in combination with the lsblk and pvs outputs allows us to determine the PV and VG that correspond to the OSD disk. This is because the name of the VG is based on the OSD_UUID.

For example, for the Encrypted OSD with Vault scenario:

OSD_UUID=13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_PV=/dev/mapper/crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_VG=ceph-13d2f2a3-2e20-40e2-901a-385b12e372a2

And since the OSD is encrypted:

OSD_CRYPT=crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2

Also make note of the ceph-osd unit being queried (here: ceph-osd/0):

OSD_UNIT=ceph-osd/0
Discover the disk by-dname entry

The by-dname device name is a symbolic link to the actual kernel device used by the disk. For example, for the Encrypted OSD with Vault scenario, with an OSD_UUID of 13d2f2a3-2e20-40e2-901a-385b12e372a2, the actual device is bcache1. We can use that in the below command to find the by-dname entry.

On a ceph-osd unit:

ls -la /dev/disk/by-dname/bcache* | egrep "bcache1$"

Sample output:

lrwxrwxrwx 1 root root 13 Oct  1 04:28 /dev/disk/by-dname/bcache3 -> ../../bcache1

Assign the by-dname entry to a variable:

OSD_DNAME=/dev/disk/by-dname/bcache3

Pre-replace procedure

The Encrypted OSD with Vault example scenario will be used in these steps. The target disk has been identified according to the these properties:

OSD_UNIT=ceph-osd/0
OSD=osd.4
OSD_ID=4
OSD_UUID=13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_PV=/dev/mapper/crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_DNAME=/dev/disk/by-dname/bcache3
OSD_VG=ceph-13d2f2a3-2e20-40e2-901a-385b12e372a2
OSD_CRYPT=crypt-13d2f2a3-2e20-40e2-901a-385b12e372a2

The procedural steps will refer to these variables.

  1. Mark the OSD as ‘out’.

    Ensure cluster health is good (HEALTH_OK) before and after.

    On any ceph-mon unit do:

    sudo ceph -s
    sudo ceph osd out $OSD
    sudo ceph -s
    
  2. Stop the OSD daemon.

    Reweight the OSD volume to zero (to prevent data rebalancing from occurring) and then stop the OSD daemon. From the Juju client do:

    juju run-action --wait ceph-mon/leader change-osd-weight osd=$OSD_ID weight=0
    juju run-action --wait $OSD_UNIT stop osds=$OSD_ID
    
  3. Confirm that the OSD is ‘down’.

    On any ceph-mon unit do:

    sudo ceph osd tree down
    

    Sample output:

    ID  CLASS  WEIGHT   TYPE NAME               STATUS  REWEIGHT  PRI-AFF
    -1         8.18729  root default                                     
    -3         2.72910      host node-gadomski                           
     4    ssd  0.90970          osd.4             down         0  1.00000
    
  4. Clean up the OSD’s resources.

    On unit OSD_UNIT perform cleanup actions based on the given scenario:

    If not encrypted:

    sudo vgremove -y $OSD_VG
    sudo pvremove -y $OSD_PV
    

    If encrypted without Vault:

    sudo cryptsetup close /dev/mapper/$OSD_CRYPT
    sudo vgremove -y $OSD_VG
    sudo pvremove -y $OSD_PV
    

    If encrypted with Vault:

    sudo vgremove -y $OSD_VG
    sudo cryptsetup close /dev/mapper/$OSD_CRYPT
    sudo systemctl disable vaultlocker-decrypt@$OSD_UUID
    
  5. Wipe the OSD disk of data.

    On unit OSD_UNIT do:

    sudo ceph-volume lvm zap --destroy $OSD_DNAME
    

    Ensure the OSD_DNAME entry exists for when the charm action zap-disk is eventually run:

    sudo udevadm trigger
    ls -la /dev/disk/by-dname/*
    

    Note: The zap-disk action will be needed in order to remove the OSD device from the ceph-osd charm’s internal database.

  6. Purge the Ceph cluster of all traces of the OSD.

    From the Juju client do:

    juju run-action --wait ceph-mon/leader purge-osd osd=$OSD_ID i-really-mean-it=yes
    

Replace procedure

Skip this entire section if the disk does not need to be physically replaced. Continue to the Post-replace procedure.

The values used in these steps originate with the Encryption with Vault scenario described in the Pre-replace procedure section.

All the commands in this section are invoked on unit OSD_UNIT.

  1. Confirm the bcache device and its backing device.

    The disk being replaced is the bcache backing device but in order to replace it we must also take into account its associated bcache device. Confirm the device names based on the lsblk output:

    BCACHE=bcache1
    BACKING=/dev/sdc
    BACKING_SIMPLE=sdc
    

    Warning: Failure to correctly assign the above variables may result in data loss. Due to the way subsequent commands will be used, do not add ‘/dev/’ to the BCACHE variable.

  2. Determine the cache set UUID.

    Display the underlying devices of the bcache block:

    ls -l /sys/block/$BCACHE/slaves
    

    Sample output:

    lrwxrwxrwx 1 root root 0 Oct  1 04:09 nvme0n1p1 -> 
    ../../../../pci0000:80/0000:80:03.0/0000:82:00.0/nvme/nvme0/nvme0n1/nvme0n1p1
    lrwxrwxrwx 1 root root 0 Oct  1 04:09 sdc -> 
    ../../../../pci0000:00/0000:00:01.0/0000:02:00.0/host0/target0:0:2/0:0:2:0/block/sdc
    

    This shows that the devices partaking in this bcache block are /dev/nvme0n1p1 and /dev/sdc. One of these should correspond to device BACKING from step #1.

    Now use the backing device to obtain the cache set UUID:

    sudo bcache-super-show $BACKING | grep cset.uuid
    

    Output:

    cset.uuid               7ee80bd1-97e3-464e-9b28-f26733c6dc2c
    

    Assign it to a variable:

    CSET_UUID=7ee80bd1-97e3-464e-9b28-f26733c6dc2c
    
  3. Save the current cache mode.

    Show the current cache mode:

    sudo cat /sys/block/$BCACHE/bcache/cache_mode
    

    Output:

    writethrough [writeback] writearound none
    

    The above output shows a mode of ‘writeback’. Assign it to a variable:

    CACHE_MODE=writeback
    
  4. Stop the bcache device.

    echo 1 | sudo tee /sys/block/$BCACHE/bcache/stop
    

    Check that the bcache device no longer exists:

    ls -l /dev/$BCACHE
    

    Ensure that the backing device is not associated with any bcache:

    lsblk $BACKING
    
  5. Clear the disk’s data blocks where bcache metadata is stored.

    Warning: Incorrect device names will most likely result in data loss.

    sudo wipefs -a $BACKING
    sudo dd if=/dev/zero of=$BACKING bs=512 count=8
    

    Info: If you are dealing with a faulty disk, the above commands may fail. You can ignore any error messages.

  6. Replace the disk.

    Follow the instructions from your hardware vendor to properly remove the disk and add a new one.

    Important: For the remainder of this procedure it is assumed that the name of the newly added disk is identical to that of the old one. Re-assign the variable BACKING if this is not the case.

  7. Create a new bcache device.

    Create the new bcache device using the new disk as its backing device:

    sudo make-bcache -B $BACKING
    

    Sample output:

    UUID:                   b3bc521e-ca04-458b-a106-db874e4f8c57
    Set UUID:               f60b6bc6-a623-42f8-8643-64a5cc8f98c6
    version:                1
    block_size:             1
    data_offset:            16
    

    UUID is the dev.uuid of the newly created bcache device. Assign it to a variable:

    DEV_UUID=b3bc521e-ca04-458b-a106-db874e4f8c57
    
  8. Determine the new bcache device.

    Get the new bcache device name (the asterisk in the command is intended):

    ls -d /sys/block/bcache*/slaves/$BACKING_SIMPLE
    

    Sample output:

    /sys/block/bcache1/slaves/sdc
    

    From this output, it can be derived that the new bcache device is bcache1. Assign it to a variable:

    NEW_BCACHE=bcache1
    
  9. Add the new bcache device to the cache set.

    Link the new bcache device to the bcache caching device:

    echo $CSET_UUID | sudo tee /sys/block/$NEW_BCACHE/bcache/attach
    

    Confirm the operation:

    sudo bcache-super-show $BACKING | grep cset.uuid
    

    The output must match the value of CSET_UUID.

  10. Set the caching mode of the new bcache block

    Set the caching mode to the original value:

    echo $CACHE_MODE | sudo tee /sys/block/$NEW_BCACHE/bcache/cache_mode
    

    Confirm the operation:

    sudo cat /sys/block/$NEW_BCACHE/bcache/cache_mode
    

    The output must match the value of CACHE_MODE.

  11. Confirm the status of the newly created bcache block.

    To show the current status of the newly created bcache block:

    cat /sys/block/$NEW_BCACHE/bcache/state
    ls /sys/block/$NEW_BCACHE/slaves
    

    The cache state should be ‘dirty’ or ‘clean’, but not ‘inconsistent’ nor no cache. Both slaves should be correctly listed.

  12. Modify the udev rule that creates the by-dname entry.

    Edit the udev rule associated with the current by-dname entry so as to match DEV_UUID.

    From the Pre-replace procedure:

    OSD_DNAME=/dev/disk/by-dname/bcache3
    

    The file to edit is therefore:

    /etc/udev/rules.d/bcache3.rules

    Replace the existing value of ENV{CACHED_UUID} with DEV_UUID, as in the example below:

    # Written by curtin
    SUBSYSTEM=="block", ACTION=="add|change",
    ENV{CACHED_UUID}=="b3bc521e-ca04-458b-a106-db874e4f8c57",
    SYMLINK+="disk/by-dname/bcache3"
    

    Note: The values should be on a single line. The above content has been formatted to improve the legibility of these instructions.

    Save and close the file.

  13. Regenerate the udev rules.

    Regenerate the udev rules to create the by-dname symbolic link:

    sudo udevadm trigger
    

    Confirm that the desired by-dname entry is now in place by recalling the original name of the bcache device (BCACHE=bcache1):

    ls -la /dev/disk/by-dname/bcache* | egrep "bcache1$"
    

    Output:

    lrwxrwxrwx 1 root root 13 Oct  1 19:15 /dev/disk/by-dname/bcache3 -> ../../bcache1
    

    The output must correspond to the original by-dname entry (OSD_DNAME=/dev/disk/by-dname/bcache3).

    The new bcache device is now ready.

Post-replace procedure

The values used in these steps originate with the Encryption with Vault scenario described in the Pre-replace procedure section.

  1. Remove the OSD device from charm’s database.

    Even though the disk has already been wiped of data (lvm zap), the OSD device still needs to be removed from the ceph-osd charm’s internal database.

    From the Juju client do:

    juju run-action --wait $OSD_UNIT zap-disk devices=$OSD_DNAME i-really-mean-it=yes
    

    If bcache is not being used then refer to the underlying device (e.g. devices=/dev/sdc).

  2. Ensure that the by-dname entries are up-to-date.

    On unit OSD_UNIT do:

    sudo udevadm trigger
    

    Again confirm that the by-dname entries are visible, our OSD_DNAME in particular:

    ls -la /dev/disk/by-dname/*
    
  3. Recreate the OSD.

    From the Juju client do:

    juju run-action --wait $OSD_UNIT add-disk osd-devices=$OSD_DNAME
    

    If bcache is not being used then refer to the underlying device (e.g. osd-devices=/dev/sdc).

  4. Verify the newly added OSD.

    On any ceph-mon unit do:

    sudo ceph osd tree up
    sudo ceph -s
    

    The replaced OSD should be ‘up’ and in the cluster should be in the process of being rebalanced.

Last updated 5 months ago. Help improve this document in the forum.