My adventures with Ceph Storage. Part 5: install Ceph in the Lab

When I started this series of posts, I didn’t thought how many posts it would have required to just arrive to the actual Ceph installation. I could have wrote a quick and dirty guide with step-by-step instructions, but then you would have been stuck into my personal design choices. Instead, I preferred to start from the very beginning, explaining in details what Ceph is, how it works, and how I prepared my lab to use it. This took me 4 blog posts.

However, I know this one is the post you were really waiting for. With all the lab correctly configured and ready, it’s now time to finally deploy Ceph!

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 6: Mount Ceph as a block device on linux machines
Part 7: Add a node and expand the cluster storage
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster

Install ceph-deploy

In addition to the 6 specific Ceph machines I’ve created, there’s another Linux VM that I use as an administration console. As I explained in Part 4, this machine can login via ssh without password to any Ceph node, and from there use the “cephuser” user to elevate its rights to root. This machine will run any command against the Ceph cluster itself. It’s not a mandatory choice, you can have one of the Ceph nodes also acting as a management node; it’s up to you.

The Ceph administration node is mainly used for running ceph-deploy: this tool is specifically designed to provision Ceph clusters with ease. It’s not the only way you can create a Ceph cluster, just the most simple.

Once all the nodes have been configured with the password-less sudo-capable cephuser user, you need to verify that the administration node is able to reach every node by its hostname, since these will be the names of the nodes registered in Ceph. Which means, a command like “ping mon1” should be successful for each of the nodes. If not, check again your dns servers and/or modify the hosts file.

With all the networking verified, install ceph-deploy using the user cephuser. On a CentOS 7 machine like mine, you need to first add the Ceph repository. If you are running Ubuntu, check the Ceph pre-flight page.

sudo vi /etc/yum.repos.d/ceph.repo

This will be a new empty file. On it, paste this text and replace the values in brackets:

[ceph-noarch]
name=Ceph noarch packages
baseurl=http://ceph.com/rpm-{ceph-release}/{distro}/noarch
enabled=1
gpgcheck=1
type=rpm-md
gpgkey=https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc

In my example, Ceph release will be “giant” and distro will be el7. Again, check the preflight page if you are using different distributions. Then, update your repositories and install ceph-deploy:

sudo yum update && sudo yum install ceph-deploy

Finally, I prefer to create a dedicated directory on the admin node to collect all output files and logs while I use ceph-deploy. Simply run:

mkdir ceph-deploy
cd ceph-deploy

Remember: EVERYTIME you will login into the admin node to work on Ceph, you will need first of all to move into this folder, since configuration files and logs of the ceph-deploy commands will be saved here. Ceph-deploy is ready! Time to install Ceph in our nodes.

Setup the cluster

The first operation is to setup the monitor nodes. In my case, they will be the three MON servers. So, my command will be:

ceph-deploy new mon1 mon2 mon3

After few seconds, if there are no errors, you should see the command ending successfully with lines like these:

[ceph_deploy.new][DEBUG ] Monitor initial members are [‘mon1', ‘mon2', ‘mon3']
[ceph_deploy.new][DEBUG ] Monitor addrs are ['10.2.50.211', '10.2.50.212', '10.2.50.213']
[ceph_deploy.new][DEBUG ] Creating a random mon key...
[ceph_deploy.new][DEBUG ] Writing monitor keyring to ceph.mon.keyring...
[ceph_deploy.new][DEBUG ] Writing initial config to ceph.conf...

The initial ceph.conf configuration file has been created. For now, we will simply add few additional lines to reflect our public and cluster networks, as explained in Part 3, plus some other parameters to start with (all to be placed under the [global] section of the configuration file):

public network = 10.2.50.0/24
cluster network = 10.2.0.0/24

#Choose reasonable numbers for number of replicas and placement groups.
osd pool default size = 2 # Write an object 2 times
osd pool default min size = 1 # Allow writing 1 copy in a degraded state
osd pool default pg num = 256
osd pool default pgp num = 256

#Choose a reasonable crush leaf type
#0 for a 1-node cluster.
#1 for a multi node cluster in a single rack
#2 for a multi node, multi chassis cluster with multiple hosts in a chassis
#3 for a multi node cluster with hosts across racks, etc.
osd crush chooseleaf type = 1

I’m not going to explain you here what a placement group is and how it should be configured. It is mandatory to choose the value of pg_num because it cannot be calculated automatically. For more informations, read here.
Finally, as a quick visual reminder, this is what we are trying to achieve with the double network:

Then, install Ceph in all the nodes in the cluster and the admin node:

ceph-deploy install ceph-admin mon1 mon2 mon3 osd1 osd2 osd3

The command will run for a while, and on each node it will update repositories if necessary (probably it will happen everytime on a clean machine, since the main repository will be epel…) and install ceph and all its dependencies. If you want to follow the process, just look for these lines at the end of each node installation:

[mon1][DEBUG ] Complete!
[mon1][INFO ] Running command: sudo ceph --version
[mon1][DEBUG ] ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

This will give you the confirmation Ceph is installed correctly. Once all the nodes are installed, create monitor and gather keys:

ceph-deploy mon create-initial

(note: if for any reason the command fails at some point, you will need to run it again, this time writing it as ceph-deploy –overwrite-conf mon create-initial)

Prepare OSDs and OSD Daemons

So far, we have installed Ceph on all the cluster nodes. We are still missing the most important part of a storage cluster like Ceph: the storage space itself! So, in this chapter we will configure it, by preparing the OSDs and OSD daemons.

Remember? We setup the osd nodes with 4 disks, one for the journal and 3 for data. Let’s check first that ceph-deploy is able to see these disks:

ceph-deploy disk list osd1

As you can see from the output, everytime ceph-deploy connects remotely to a given node, and using sudo runs locally a ceph command, in this case /usr/sbin/ceph-disk list. And the output is what we are expecting:

[osd1][DEBUG ] /dev/sda :
[osd1][DEBUG ] /dev/sda1 other, xfs, mounted on /boot
[osd1][DEBUG ] /dev/sda2 other, LVM2_member
[osd1][DEBUG ] /dev/sdb :
[osd1][DEBUG ] /dev/sdb1 other
[osd1][DEBUG ] /dev/sdb2 other
[osd1][DEBUG ] /dev/sdb3 other
[osd1][DEBUG ] /dev/sdc :
[osd1][DEBUG ] /dev/sdc1 other
[osd1][DEBUG ] /dev/sdd :
[osd1][DEBUG ] /dev/sdd1 other
[osd1][DEBUG ] /dev/sde :
[osd1][DEBUG ] /dev/sde1 other

An OSD can be created with this two commands, one after the other:

ceph-deploy osd prepare {node-name}:{data-disk}[:{journal-disk}]
ceph-deploy osd activate {node-name}:{data-disk-partition}[:{journal-disk-partition}]

Or the combined command:

ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]

In any case, you see as there is a 1:1 relationship between a OSD and its journal. So, regardless the fact in our case the sdb device will be a shared between all OSDs, we will have to define it as a journal for each OSD by specifying the single partition we created inside sdb. In my case, the command will be:

ceph-deploy disk zap osd1:sdc osd1:sdd osd1:sde
ceph-deploy osd create osd1:sdc:/dev/sdb1 osd1:sdd:/dev/sdb2 osd1:sde:/dev/sdb3

After repeating the same command for all the other nodes, our OSD daemons are ready to be used. Note: the first command, zap, is to clean everything that could eventually be on the disk; since it erases any data, be sure you are firing it against the correct disk!

Finalizing

To have a functioning cluster, we just need to copy the different keys and configuration files from the admin node (ceph-admin) to all the nodes:

ceph-deploy admin ceph-admin mon1 mon2 mon3 osd1 osd2 osd3

The cluster is ready! You can check it from the admin-node using these commands:

ceph health
HEALTH_OK

ceph status
cluster aa4d1282-c606-4d8d-8f69-009761b63e8f
health HEALTH_OK
monmap e1: 3 mons at {mon1=10.2.50.211:6789/0,mon2=10.2.50.212:6789/0,mon3=10.2.50.213:6789/0}, election epoch 6, quorum 0,1,2 mon1,mon2,mon3
osdmap e47: 9 osds: 9 up, 9 in
pgmap v115: 256 pgs, 1 pools, 0 bytes data, 0 objects
310 MB used, 899 GB / 899 GB avail
256 active+clean

Here, you can see the 3 monitors all partecipating in the quorum, the 9 OSDs we created (it’s important they are all in status UP and IN), the 256 protection groups grouped in 1 pool, and the 900 GB that we have available (3 * 100Gb disks per node, * 3 nodes).

The most common warning you could see at this point, especially in labs when calculations of PGs are overlooked, is:

health HEALTH_WARN too few pgs per osd (7 < min 20)

And for example a count of 64 total PGs. Honestly, protection group calculations is something that still does not convince me totally, I don’t get the reason why it should be left to the Ceph admin to be manually configured, and then often complain that is wrong. Anyway, as long as it cannot be configured automatically, the rule of thumb I’ve find out to get rid of the error is that Ceph seems to be expecting between 20 and 32 PGs per OSD. A value below 20 gives you this error, and a value above 32 gives another error:

Error E2BIG: specified pg_num 512 is too large (creating 448 new PGs on ~9 OSDs exceeds per-OSD max of 32)

So, since in my case there are 9 OSDs, the minimum value would be 9*20=180, and the maximum value 9*32=288. I chose 256 and configured it dinamically:

ceph osd lspools #this gets the list of existing pools, so you can find out that the default name of the created pool is “rbd”
ceph osd pool get rbd pg_num #and we verify the actual value is 64
ceph osd pool set rbd pg_num 256
ceph osd pool set rbd pgp_num 256

That’s it! The cluster is up and running, and you can also try to reboot some of the OSD servers, and see with ceph -w in real time how the overall cluster keeps running and adjusts dinamically its status:

2015-01-06 09:06:45.843622 mon.0 [INF] pgmap v115: 256 pgs: 256 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:08.573068 mon.0 [INF] osd.0 marked itself down
2015-01-06 09:16:08.573200 mon.0 [INF] osd.1 marked itself down
2015-01-06 09:16:08.573481 mon.0 [INF] osd.2 marked itself down
2015-01-06 09:16:08.635138 mon.0 [INF] osdmap e48: 9 osds: 6 up, 9 in
2015-01-06 09:16:08.649844 mon.0 [INF] pgmap v116: 256 pgs: 78 stale+active+clean, 178 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:09.649700 mon.0 [INF] osdmap e49: 9 osds: 6 up, 9 in
2015-01-06 09:16:09.662046 mon.0 [INF] pgmap v117: 256 pgs: 78 stale+active+clean, 178 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:10.672190 mon.0 [INF] osdmap e50: 9 osds: 6 up, 9 in
2015-01-06 09:16:10.675173 mon.0 [INF] pgmap v118: 256 pgs: 78 stale+active+clean, 178 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:13.811669 mon.0 [INF] pgmap v119: 256 pgs: 18 active+undersized+degraded, 70 stale+active+clean, 168 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:14.830343 mon.0 [INF] pgmap v120: 256 pgs: 79 active+undersized+degraded, 37 stale+active+clean, 140 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:15.853743 mon.0 [INF] pgmap v121: 256 pgs: 137 active+undersized+degraded, 13 stale+active+clean, 106 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:16.865934 mon.0 [INF] pgmap v122: 256 pgs: 166 active+undersized+degraded, 90 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:37.392171 mon.2 [INF] from='client.? 10.2.50.201:0/1001322' entity='osd.1' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 1, "weight": 0.1}]: dispatch
2015-01-06 09:16:37.393895 mon.0 [INF] from='client.4196 :/0' entity='forwarded-request' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 1, "weight": 0.1}]: dispatch
2015-01-06 09:16:38.966524 mon.0 [INF] osd.1 10.2.50.201:6800/1715 boot
2015-01-06 09:16:38.968029 mon.0 [INF] osdmap e51: 9 osds: 7 up, 9 in
2015-01-06 09:16:38.972170 mon.0 [INF] pgmap v123: 256 pgs: 166 active+undersized+degraded, 90 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:39.459148 mon.2 [INF] from='client.? 10.2.50.201:0/1002486' entity='osd.2' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 2, "weight": 0.1}]: dispatch
2015-01-06 09:16:39.460702 mon.0 [INF] from='client.4202 :/0' entity='forwarded-request' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 2, "weight": 0.1}]: dispatch
2015-01-06 09:16:39.982552 mon.0 [INF] osdmap e52: 9 osds: 7 up, 9 in
2015-01-06 09:16:39.986404 mon.0 [INF] pgmap v124: 256 pgs: 166 active+undersized+degraded, 90 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:40.985897 mon.2 [INF] from='client.? 10.2.50.201:0/1002822' entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 0, "weight": 0.1}]: dispatch
2015-01-06 09:16:40.996677 mon.0 [INF] from='client.4205 :/0' entity='forwarded-request' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=osd1", "root=default"], "id": 0, "weight": 0.1}]: dispatch
2015-01-06 09:16:41.001278 mon.0 [INF] osd.2 10.2.50.201:6803/2536 boot
2015-01-06 09:16:41.003106 mon.0 [INF] osdmap e53: 9 osds: 8 up, 9 in
2015-01-06 09:16:41.008089 mon.0 [INF] pgmap v125: 256 pgs: 166 active+undersized+degraded, 90 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:42.006215 mon.0 [INF] osd.0 10.2.50.201:6806/2917 boot
2015-01-06 09:16:42.008021 mon.0 [INF] osdmap e54: 9 osds: 9 up, 9 in
2015-01-06 09:16:42.011302 mon.0 [INF] pgmap v126: 256 pgs: 166 active+undersized+degraded, 90 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:43.048187 mon.0 [INF] osdmap e55: 9 osds: 9 up, 9 in
2015-01-06 09:16:43.053407 mon.0 [INF] pgmap v127: 256 pgs: 166 active+undersized+degraded, 90 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:45.059178 mon.0 [INF] pgmap v128: 256 pgs: 141 active+undersized+degraded, 115 active+clean; 0 bytes data, 310 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:47.077584 mon.0 [INF] pgmap v129: 256 pgs: 92 active+undersized+degraded, 164 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail
2015-01-06 09:16:48.093766 mon.0 [INF] pgmap v130: 256 pgs: 256 active+clean; 0 bytes data, 311 MB used, 899 GB / 899 GB avail

I’ve rebooted OSD1: the three OSDs it contains went down, and the pgmap starts to update itself to reflect the new condition where some PGs where in degraded mode. When the server came up again, immediately the 166 degraded PGs started to resync and in few seconds the state came back to all 256 PGs in active+clean state. But the important part to look after, for the entire duration of the reboot of one node, the overall size of the cluster has always been 899 GB.

Next time, we will create an RBD volume and connect it to a linux machine as a local device!

Install ceph-deploy

Setup the cluster

Prepare OSDs and OSD Daemons

Finalizing

Share this: