Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 5: install Ceph in the lab
Part 6: Mount Ceph as a block device on linux machines
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster
At the end of Part 6, we finally mounted our Ceph cluster as a block device in a linux server and started to use it. I described how to create a RBD device and how this thin-provisioned volume can be expanded. There is however a moment where all the available space of the existing cluster is consumed, and the only way to further increase its size is to add an additional node. This is the moment when Ceph shows its scale-out capabilities. You will see in this part how you can quickly and most of all transparently add an additional node and rebalance the resources of the expanded cluster.
Prepare the new node
Obviously, we first of all need a new server to be added as an additional OSD node. You can follow Part 4 to understand how to properly create, install and configure an OSD node based on CentOS 7. In my lab, I’m going to add a new virtual machine with these parameters:
osd4.skunkworks.local, frontend network 10.2.50.204, replication network 10.2.0.204
CentOS 7.0, 2 vCPU, 2 GB RAM, and this disk layout:
disk0 (boot) = 20 GB
disk1 (journal) = 30 GB (running over SSD)
disk2 (data) = 100GB
disk3 (data) = 100GB
disk4 (data) = 100GB
Once CentOS 7 has been installed and configured as described in Part 4, we are ready to deploy Ceph on it. As before, from the admin console, you need to run:
ceph-deploy install osd4
As explained in previous Part 5, this command will update repositories if necessary (probably it will happen everytime on a clean machine, since the main repository will be epel…) and install ceph and all its dependencies. If during installation you encounter an error like this:
Error: Package: 1:python-rados-0.80.7-0.4.el7.x86_64 (epel) Requires: librados2 = 1:0.80.7 Installed: 1:librados2-0.87-0.el7.centos.x86_64 (@Ceph) librados2 = 1:0.87-0.el7.centos Available: 1:librados2-0.86-0.el7.centos.x86_64 (Ceph) librados2 = 1:0.86-0.el7.centos
It’s because, at some point at the beginning of 2015, yum had some changes and is not respecting some priorities in custom repositories. The quick solution is to run, before ceph-deploy, this command on the osd4 machine:
echo "check_obsoletes=1" >> /etc/yum/pluginconf.d/priorities.conf
Once the error is fixed, let the installation run; at the end of the process, you should see lines like these:
[mon1][DEBUG ] Complete! [mon1][INFO ] Running command: sudo ceph --version [mon1][DEBUG ] ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
This will give you the confirmation Ceph is installed correctly on the new node. Let’s check first that ceph-deploy is able to see the disks of the new node:
ceph-deploy disk list osd4
The output is what we are expecting:
[osd1][DEBUG ] /dev/sda : [osd1][DEBUG ] /dev/sda1 other, xfs, mounted on /boot [osd1][DEBUG ] /dev/sda2 other, LVM2_member [osd1][DEBUG ] /dev/sdb : [osd1][DEBUG ] /dev/sdb1 other [osd1][DEBUG ] /dev/sdb2 other [osd1][DEBUG ] /dev/sdb3 other [osd1][DEBUG ] /dev/sdc : [osd1][DEBUG ] /dev/sdc1 other [osd1][DEBUG ] /dev/sdd : [osd1][DEBUG ] /dev/sdd1 other [osd1][DEBUG ] /dev/sde : [osd1][DEBUG ] /dev/sde1 other
Time to create the new OSDs and their journals. As we did with previous nodes, here is the command (read again Part 5 to learn the details about these commands):
ceph-deploy disk zap osd4:sdc osd4:sdd osd4:sde ceph-deploy osd create osd4:sdc:/dev/sdb1 osd4:sdd:/dev/sdb2 osd4:sde:/dev/sdb3
After the preparation, you should receive an output like this:
[osd4][INFO ] checking OSD status... [osd4][INFO ] Running command: sudo ceph --cluster=ceph osd stat --format=json [ceph_deploy.osd][DEBUG ] Host osd4 is now ready for osd use.
The new OSDs on server osd4 are ready to be used. The last operation to do is to add the administration keys to the node so it can be managed locally (otherwise you have to run every command from the admin node):
ceph-deploy admin osd4
Add the new node to the cluster
Well, in reality, there is nothing more to do on the cluster, since the previous procedure has already added osd4 to the running cluster! Just as a remainder, this was the situation before the addition of the 4th OSD node (use the command “ceph status”):
cluster aa4d1282-c606-4d8d-8f69-009761b63e8f health HEALTH_OK monmap e1: 3 mons at {mon1=10.2.50.211:6789/0,mon2=10.2.50.212:6789/0,mon3=10.2.50.213:6789/0}, election epoch 6, quorum 0,1,2 mon1,mon2,mon3 osdmap e47: 9 osds: 9 up, 9 in pgmap v115: 256 pgs, 1 pools, 0 bytes data, 0 objects 310 MB used, 899 GB / 899 GB avail 256 active+clean
But if you run the same command after the preparation of osd4, this is the new output:
cluster aa4d1282-c606-4d8d-8f69-009761b63e8f health HEALTH_OK monmap e1: 3 mons at {mon1=10.2.50.211:6789/0,mon2=10.2.50.212:6789/0,mon3=10.2.50.213:6789/0}, election epoch 8, quorum 0,1,2 mon1,mon2,mon3 osdmap e93: 12 osds: 12 up, 12 in pgmap v7717: 256 pgs, 1 pools, 15136 kB data, 47 objects 487 MB used, 1198 GB / 1199 GB avail 256 active+clean
As you can see, the 3 new OSDs were added to the pool, so that is has been increased from 9 to 12 OSDs, and the total available space is now 1200 GB up from 900; the difference is exactly the 300 GB available on osd4. Super easy!
Remove a node
A Ceph cluster can dinamically grow, but also shrink. Considerations about disk utilization should be taken before ANY operation involving the decommission of a node: if the used space is bigger than the surviving nodes, we will end up having the cluster in a degraded state since there will be not enough space to create the replicated copies of all objects, or even worse there will be not enough space on surviving OSDs to hold the actual volumes. So, be careful when dismissing a node.
Apart from these considerations, right now there is no command on ceph-deploy to decommission a node (ceph-deploy destroy is in the works as of April 2015 when I’m writing this article), but you can reach the same result with a combination of commands. first, identify the osd running on a given node:
[ceph@ceph-admin ceph-deploy]$ ceph osd tree # id weight type name up/down reweight -1 1.2 root default -2 0.3 host osd1 0 0.09999 osd.0 up 1 1 0.09999 osd.1 up 1 2 0.09999 osd.2 up 1 -3 0.3 host osd2 3 0.09999 osd.3 up 1 4 0.09999 osd.4 up 1 5 0.09999 osd.5 up 1 -4 0.3 host osd3 6 0.09999 osd.6 up 1 7 0.09999 osd.7 up 1 8 0.09999 osd.8 up 1 -5 0.3 host osd4 9 0.09999 osd.9 up 1 10 0.09999 osd.10 up 1 11 0.09999 osd.11 up 1
Say we want to remove the just added node osd4. Its OSDs are .9 .10 and .11. For each of them, the commands to run directly on the osd node are:
ceph osd out <osd.id> stop ceph-osd <osd.id> umount /var/lib/ceph/osd/<cluster>.<osdid> ceph osd crush remove <osd.id> ceph auth del <osd.id> ceph osd rm <osd.id>
The OSD are removed from the cluster, and CRUSH will immediately rebalance the surviving OSDs to guarantee replication rules are complaint. After this, you can remove the node itself from the cluster using ceph-deploy:
ceph-deploy purge <node>
In the future, there will be a simple command to remove an OSD (still in the works):
ceph-deploy osd destroy {host-name}:{path-to-disk}[:{path/to/journal}]
Maintenance mode
Finally, a quick tip on how to properly manage a maintenance mode situation. Whenever a node is unavailable in a Ceph cluster in fact, the CRUSH algorytm starts to rebalance the objects among available nodes to guarantee consistency and availability. However, if you are planning to do maintenance activities in one of the OSD nodes, and you know the node will come back later, there’s no sense in spending a lot of I/O and network bandwidth, thus reducing the performances of the cluster, to rebalance the cluster itself; also, especially on large nodes holding many TBs of data, a simple rebalance operation is anyway a heavy operation.
Before working on a node, you simply run:
ceph osd set noout
This command actually doesn’t put a node in maintenance mode. What it does is to prevent any OSD to be marked out of the cluster. Because of this, the PG replica count can’t be properly honored anymore, because when taking down an OSD the cluster will be in degraded state, and recover will not start. To stop an OSD, the command is:
sudo service ceph stop osd.N
(remember you can get the list of OSDs using ceph osd tree). Once all the OSDs on a node are stopped (or you can even disable the entire ceph service if you are planning to have multiple reboots…), you are free to work on the stopped node; replication will not happen for those OSDs involved in the maintenance, while all other objects will still be replicated. Once the maintenance is over, you can restart the OSDs services:
sudo service ceph start osd.N
and finally remove the noout option:
ceph osd unset noout
After a while, the status of the cluster should be back to normal.