My adventures with Ceph Storage. Part 4: deploy the nodes in the Lab

In the previous episode of this series, I described how I’m going to design and configure my nodes for Ceph Storage. Finally, it’s time to hit the lab and deploy those servers.

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 5: install Ceph in the lab
Part 6: Mount Ceph as a block device on linux machines
Part 7: Add a node and expand the cluster storage
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster

A recap of my design

This is my Ceph lab, it’s always a good choice to print out all the configurations before starting to create the cluster.

3 * OSD servers:
osd1.skunkworks.local, frontend 10.2.50.201, replication 10.2.0.201
osd2.skunkworks.local, frontend 10.2.50.202, replication 10.2.0.202
osd3.skunkworks.local, frontend 10.2.50.203, replication 10.2.0.203

Each server is a virtual machine running CentOS 7.0, 2 vCPU, 2 GB RAM, and this disk layout:
disk0 (boot) = 20 GB
disk1 (journal) = 30 GB (running over SSD)
disk2 (data) = 100GB
disk3 (data) = 100GB
disk4 (data) = 100GB

3* MON servers:
mon1.skunkworks.local, 10.2.50.211
mon2.skunkworks.local, 10.2.50.212
mon3.skunkworks.local, 10.2.50.213

Each server is a virtual machine running CentOS 7.0, 2 vCPU, 2 GB RAM, and a single 20 GB disk. There will be also a 7th machine, a simple linux VM used as the admin node:

ceph-admin.skunkworks.local, 10.2.50.125

Why CentOS?

First of all, Ceph runs on Linux. If you read around, many docs and posts explains how to configure and use Ceph based on Ubuntu Linux. The reason is that Ceph requires a really recent kernel and other up-to-date libraries in order to be properly executed, and Ubuntu is usually one of the most frequently and quickly updated Linux distributions, and is always using really recent components. Before CentOS 7.0, I would have gone for Ubuntu myself, as CentOS 6.6 (the latest 6.x at the time of this blog post) is using kernel 2.6.32. Ceph technically supports also CentOS 6.x, as you can check in the OS Recommendations page, but there are some limitations if you use an old kernel. With CentOS 7, the kernel in use is at least 3.10, so I can finally use my preferred distribution also for Ceph.

For my installation, I’ve created the different virtual machines, configured the network, and installed CentOS 7.0 Minimal install. I like to use this one as it allows to reduce the amount of disk space to a minimum, without installing any un-needed component. After installation and network configuration is completed, patch the machines in order to have them updated.

Prepare the Linux machine

Before starting to deploy Ceph, there are some activities that needs to be completed on any Linux machine. Some of them can be accomplished directly during the new installer of CentOS 7. If you are planning to use a different distribution or you are not using the installer, configure these options manually. The configuration screen has almost anything we need to setup the server:

Here you will configure networking, hostname, timezone. In “installation destination” you select the first disk we created, the one at 20GB in size:

The only other option to configure is the users. You have to setup a password for root, and you can create an additional user. The “ceph-deploy” utility that I will later explain needs to login to each server with a user that can do a password-less sudo. You could directly use root, but as usual this is not recommended; for this reason I created a user called “cephuser”. DO NOT create a user called ceph, since from version v9.2 (Infernalis) Ceph automatically creates a user named exactly ceph.

That’s it. The installer will complete in few minutes, and after a reboot the server will be available on the network directly via SSH. If you are using a template that does not have the ceph user, to create it afterwards is quite simple:

useradd -d /home/cephuser -m cephuser
passwd cephuser
echo "cephuser ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/cephuser
chmod 0440 /etc/sudoers.d/cephuser

There are some few additional steps to be completed:

– time sync is incredibly important in Ceph. Being a scale-out system with synchronous replications, nodes needs to have the same exact time, otherwise bad things can happen, especially on the monitor nodes. Note by default the maximum allowed drift between nodes is 0.05 seconds!!! So, let’s be sure time is in sync:

yum install -y ntp ntpdate ntp-doc

Default configuration of NTP should already have ntp servers, but just in case check in the file /etc/ntp.conf to see there are these lines (or change them to what you prefer):

server 0.centos.pool.ntp.org iburst
server 1.centos.pool.ntp.org iburst
server 2.centos.pool.ntp.org iburst
server 3.centos.pool.ntp.org iburst

Finally, fix any drift in the clock and be sure NTP is automatically running, by doing this:

ntpdate 0.us.pool.ntp.org
hwclock --systohc
systemctl enable ntpd.service
systemctl start ntpd.service

– if you are running the nodes as VMware virtual machines, install Open Virtual Machine tools. This is the opensource version of VMware tools, and VMware itself recommends to install this package on CentOS 7:

yum install -y open-vm-tools

After the tools are installed, you will see them as “3rd-party/Independent” in vCenter. Don’t worry, it’s correct.

– disable firewall. Remember, I’m working in a lab, in production environments you should leave the firewall enabled and create dedicated rules (Ceph Monitors communicate using port 6789 by default. Ceph OSDs communicate in a port range of 6800:7810 by default.):

systemctl disable firewalld
systemctl stop firewalld

Let’s also disable SElinux; this is a lab environment, so let’s not complicate things by creating SElinux rules:

sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

Edit the sudoers file using visudo and comment the line “Defaults requiretty” or use this command:

sed -i s'/Defaults    requiretty/#Defaults    requiretty'/g /etc/sudoers

Usually visudo is the accepted method for editing the sudoers file, because it does sanity checks before updating values. But if you feel confident that you wrote the line correctly, go for it; I never had issues editing sudoers with sed.

Finally, update the system:

yum -y update

and reboot the server to apply all the changes.

A couple additional operations need to be done on the admin machine. First, we need to configure the password-less SSH for the “cephuser” user: on the administration machine where you will run ceph-deploy first of all create the same “cephuser” user. After you login into the machine with this user, run ssh-keygen to create its ssh keys, with a blank passphrase. Finally, copy the SSH key to each Ceph node using the command:

ssh-copy-id cephuser@osd1
(repeat the command for each Ceph node, like osd2, mon1 and so on…)

Then, again on the admin machine, modify the ~/.ssh/config file of your ceph-deploy admin node so that ceph-deploy can log in to Ceph nodes as the user you created without requiring you to specify –username {username} each time you execute ceph-deploy. This has the added benefit of streamlining ssh and scp usage:

Host osd1
Hostname osd1
User cephuser
Host osd2
Hostname osd2
User cephuser
Host osd3
Hostname osd3
User cephuser
Host mon1
Hostname mon1
User cephuser
Host mon2
Hostname mon2
User cephuser
Host mon3
Hostname mon3
User cephuser

After editing the file, set the correct permissions:

chmod 440 ~/.ssh/config

XFS or BTRFS?

Once all the seven machines are prepared like this, there are some additional activities that need to be done on the OSD nodes only. These involve the preparation of the data disks. If you read around, you will find many articles about the file system that should be used for Ceph. Ceph supports also ext4, but at the end the two options are usually XFS or BTRFS. I’m not going to explain the differences between the two filesystems in details, you will find plenty of informations all over the Internet, also in regards to Ceph. In a nutshell, BTRFS is more advanced than XFS (which has been around now for more than 20 years, even if widely improved since it’s first release in 1993…) but it’s still suffering some “problems of youth”. Some Ceph users are reporting successful use in production environments, some are complaining about some bad issues they have faced. Personally, I’d like to move to BTRFS but it will probably happen in few years at least, and for now I will stick with XFS. XFS has just become the default file system for RedHat Enterprise Linux 7 and its “free version” CentOS 7, and this to me this is a great statement about its stability.

Ok, time to prepare the disks. Remember from the beginning of this post, I created the machines with multiple disks. sda is already used by the Linux installation, so the actual condition is like this (by the way, this is the output of the nice lsblk command):

sdb will be the Journal disk, and sdc, sdd and sde the three data disks. We need to format these drives with xfs and mount them properly. Let’s go. First, we create GPT partition tables, repeating the commands for each data disk (NOT sdb):

# parted /dev/sd<c/d/e>
(parted) mklabel gpt
(parted) mkpart primary xfs 0% 100%
(parted) quit

After partitions are prepared, format them with XFS:

mkfs.xfs /dev/sd<x>1

For the journal, we are going to use a raw/unformatted volume, so we will not format it with XFS and we will not mark it as XFS with parted. However, a journal partition needs to be dedicated to each OSD, so we need to create three different partitions. In production environment, you can decide either to dedicate a disk (probably an SSD) to each journal, or like me to share the same SSD to different journal mount points. In both cases, the commands in parted for the journal disk will be:

# parted /dev/sdb
(parted) mklabel gpt
(parted) mkpart primary 0% 33%
(parted) mkpart primary 34% 66%
(parted) mkpart primary 67% 100%
(parted) quit

Or, if you like, you can script the entire process of disk preparation with something like this:

parted -s /dev/sdc mklabel gpt mkpart primary xfs 0% 100%
mkfs.xfs /dev/sdc -f
parted -s /dev/sdd mklabel gpt mkpart primary xfs 0% 100%
mkfs.xfs /dev/sdd -f
parted /dev/sde mklabel gpt mkpart primary xfs 0% 100%
mkfs.xfs /dev/sde -f
parted -s /dev/sdb mklabel gpt mkpart primary 0% 33% mkpart primary 34% 66% mkpart primary 67% 100%

The final result, using again lsblk, will be something like this:

Once these activities are completed for all 3 OSD servers, we are finally ready to deploy Ceph. We will do this in the next post!

A recap of my design

Why CentOS?

Prepare the Linux machine

XFS or BTRFS?

Share this: