My adventures with Ceph Storage. Part 8: Veeam clustered repository

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 5: install Ceph in the lab
Part 6: Mount Ceph as a block device on linux machines
Part 7: Add a node and expand the cluster storage
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster

So far, we have created a redundant/scale-out storage to store our data, and we mounted the Ceph cluster as a local device to a Veeam Linux proxy. Ceph per se is completely protected from failures happening at any of its components, but if you think about the overall design of the solution, the Veeam repository is still a single machine, and as such a single point of failure. It makes little sense to have a solution like Ceph in the back-end, if then the front-end cannot be protected at almost the same level. The topic of this part is exactly about this: how to create a redundant front-end for our storage solution.

The overall design

What we want to achieve is the creation of a clustered front-end for our Veeam Linux repository. The overall design is like this:

You already know by now the back-end: we have 3 Ceph monitors managing the cluster, and 4 OSD nodes holdind data and replicating to each other using a dedicated network. On the front-end, we will deploy a second linux node in addition to the first one we created in Part 6, and we will change some configurations to make them work (almost) as a cluster.

I say almost because the Veeam Linux datamover cannot be clustered: it’s a binary component that is dinamically deployed into the Linux machine everytime Veeam connects to it. This can sounds strange at first, but I (and some other colleagues) really like it for several reasons. For example, there is not even a service/daemon to be maintained, everything is deployed fresh at each connection. So, there is no risk a daemon can crash or have problems. Second, whenever there’s a new version, it does not require any update, simply at the next connection the new version of the datamover will be copied and executed. No binaries stored in the machine, no init scripts )or no systemd to fight against…), no daemons, no configuration files.

But because there is no permanent daemon, it’s not possible to clusterize it. What you can create however, is an active-passive configuration that gives you high availability nonetheless. Here’s how.

Build the second node

First, you need to have a second Veeam linux repository. As in Part 6, follow the instructions and create a machine with these parameters:

repo2.skunkworks.local
CentOS 7 minimal
10.2.50.162
2 vCPU, 2 Gb RAM, 20 Gb disk

Once the machine is ready, configure and connect the Ceph block device “veeamrepo” as I explained in Part 6.

Next, let’s deal with the access to the block device from both nodes. XFS is the chosen filesystem to format it, but it’s not a cluster aware file system. If you try on node2 just after you mounted the block device to write a file like this:

[root@repo2 ~]# echo test > /mnt/veeamrepo/test.txt

And you then list the content on repo1:

[root@repo1 ~]# ll /mnt/veeamrepo/
total 3
drwxrwxr-x. 2 root root 6 Feb 16 14:10 backups

The file does not appear. And you are going to see even more worrying problems if you try to write to the shared volume at the same time from both the nodes.

There are two ways to deal with this limit. The first one would be to use a cluster aware file system like ocfs2 or gfs2. These filesystems allow multiple nodes to access, read and write data on the shared block device at the same time, avoiding locks and corruption issues. But the use of these filesystems has not been so common in Ceph block devices, and so I’m not so confident to use them; also performances have proved to be not at the level of XFS, and I don’t want to sacrifice them. Last but not least, if you are converting an existing single node repository to a cluster, the RBD device is already formatted with XFS and you probably cannot afford to format it and convert it to OCFS2 for example, because you would have to move all the backups out and in again.

For all these reasons, I’ve done something much more simple: since only one node at a time will use the RBD device, there’s no real need to have a cluster-aware filesystem. what we need to be sure is that only one node at a time is mounting the device itself. So, we will first of all remove the automount line from the file /etc/systemd/system/rbd-veeamrepo.service by commenting it out or deleting it:

# ExecStart=/bin/mount -L veeamrepo /mnt/veeamrepo

In this way, the rbd kernel module is loaded and the device is connected on each node (you can still see the /dev/rbd0 device on both), BUT the device is not mounted. Only one node at a time will mount the device; the procedure to “failover” wil be like this one:
– node2 has rbd unmounted, node1 is mounted
– node1 writes a new block
– node1 unmounts the device
– node2 mounts the device
– node2 can see the change

Keepalived will do these operations for us.

Keepalived

I talked about this software some years ago when I used it to build a load balancer. Keepalived is a free software for Linux, designed to (as per its website) “provide simple and robust facilities for loadbalancing and high-availability to Linux system and Linux based infrastructures. Loadbalancing framework relies on well-known and widely used Linux Virtual Server (IPVS) kernel module providing Layer4 loadbalancing. Keepalived implements a set of checkers to dynamically and adaptively maintain and manage loadbalanced server pool according their health. On the other hand high-availability is achieved by VRRP protocol. VRRP is a fundamental brick for router failover. In addition, Keepalived implements a set of hooks to the VRRP finite state machine providing low-level and high-speed protocol interactions. Keepalived frameworks can be used independently or all together to provide resilient infrastructures.”

By using keepalived, we will share a virtual IP address between the two linux repositories.

First we need to activate the Extra Packages for Enterprise Linux (EPEL) repository on both machines. This software repository hosts the packages of the Keepalived software:

yum -y install epel-release

Then we install Keepalived:

yum -y install keepalived

On CentOS 7, the command killall is not installed by default. Systemd has its own methods to kill processes, but for the keepalived check scripts to work we need this command anyway, so before proceeding:

yum -y install psmisc

The psmisc package also contains killall.
Then, edit the Keepalived config file (vi /etc/keepalived/keepalived.conf) and make it looks like below:

vrrp_script chk_sshd {
script "killall -0 sshd"
interval 2
weight -4
}
vrrp_instance REPO_CLUSTER {
state BACKUP
nopreempt
interface ens160
virtual_router_id 1
priority 101
notify /usr/local/bin/keepalivednotify.sh
advert_int 1

track_interface {
ens160
}

authentication {
auth_type PASS
auth_pass 1111
}

virtual_ipaddress {
10.2.50.160/24 dev ens160
}

track_script {
chk_sshd
}
}

virtual_server 10.2.50.160 22 {
delay_loop 30
lb_algo wrr
lb_kind NAT
persistence_timeout 50
protocol TCP
real_server 10.2.50.161 22 {
weight 1
TCP_CHECK {
connect_port 22
connect_timeout 3
nb_get_retry 3
delay_before_retry 1
}
}
}

In the configuration file you can spot a script that is invoked: /usr/local/bin/keepalivednotify.sh. In this script, the code will be:

#!/bin/bash
TYPE=$1
NAME=$2
STATE=$3

case $STATE in
"MASTER") /bin/mount -L veeamrepo /mnt/veeamrepo
exit 0
;;
"BACKUP") /bin/umount /mnt/veeamrepo
exit 0
;;
"FAULT") /bin/umount /mnt/veeamrepo
exit 0
;;
*) echo "unknown state"
exit 1
;;
esac

PS: remember to set it to executable after saving it!

So, what’s happening with this configuration? Keepalived will publish a Virtual IP 10.2.50.160, and during failovers the IP will move from one node to the other. Also, when a node will change its status to MASTER, which means the active node, the Ceph RBD block device will be mounted. When the node becomes a BACKUP, which means the passive node, the same block device will be unmounted; same will happen if a node will go in FAULT state. In this way, only the active node will have the partition mounted, as planned. On a crash, there will be no surviving node locking the partition, so the new master will be able to mount it without issues.

When registering a new repository for Veeam, we will create a dns entry as:

repo-cluster.skunkworks.local    10.2.50.160

and by registering this hostname or directly the virtual IP in Veeam, we will always connect to the active node via this IP, instead of the physical IP of each node. Some other notes on the most important parameters in the configuration file:

– state BACKUP and nopreempt: usually, VRRP (Virtual Router Redundancy Protocol, the protocol used by Keepalived to assign virtual IPs) will preempt a lower priority machine when a higher priority machine comes online. “nopreempt” allows the lower priority machine to maintain the master role, even when a higher priority machine comes back online. Both nodes will have the initial BACKUP state for this to work. Since the Veeam binaries are running on the active node, we do not want a failback to crash these binaries. The failed-over node will be the active one as long as a new failover will happen.

– track_interface: the linux repositories can work properly only if they can connect to the back-end Ceph cluster and expose their services to Veeam. For this reason, we monitor the state of the connection, and if anything happens we failover to the other node.

– virtual_server: this is the service published via the virtual IP. Since all connections in a Veeam linux repositories are initiated via ssh, we go publish the ssh service via keepalived. The interval between each check is 30 seconds (delay_loop), it could be lowered but it would be useless in our scenario since connection timeouts of the veeam datamover service are higher than this.

– lb_kind NAT: using NAT, all replies from the real SSH server will be natted behind Keepalived. In this way, Veeam components will always see packets coming back from the virtual IP instead of the real IP of the active node.

– lb_algo: I’m using this part just to explain how I’m using keepalived here. Usually, keepalived is going to balance a virtual IP between multiple real servers. You will find in the Internet several configuration examples where multiple “real servers” paragraphs are listed in the configuration file. In our case, I defined the virtual server and the real server only to NAT the physical IP behind the virtual IP; all other parameters are not going to be used, since there is only one service to be published.

If you want to learn more about all the options of Keepalived, read its User Guide.

Once you configured keepalived on both nodes with this configuration file, go on the second node and change priority from 101 to 100, and the IP of the real_server to 10.2.50.162 (its own IP address). This means when both nodes will start, node with higher priority will be the master, until a failover will happen and the “nopreempt” parameter will keep the master role to the secondary node. Before starting the service, you need to allow the kernel to bind non-local IP into the host and apply the changes:

echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.conf
sysctl -p

Finally you can start keepalived on both nodes and set it to be automatically started at every reboot:

systemctl start keepalived.service
systemctl enable keepalived.service

You can check on both servers that only one will list the virtual IP and has the XFS device mounted. This is node1 in ACTIVE state:

[root@repo1 ~]# ip addr show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:b9:06:08 brd ff:ff:ff:ff:ff:ff
inet 10.2.50.161/24 brd 10.2.50.255 scope global ens160
valid_lft forever preferred_lft forever
inet 10.2.50.160/24 scope global secondary ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feb9:608/64 scope link
valid_lft forever preferred_lft forever

[root@repo1 ~]# mount -l | grep rbd

/dev/rbd0 on /mnt/veeamrepo type xfs (rw,relatime,seclabel,attr2,inode64,sunit=8192,swidth=8192,noquota) [veeamrepo]

And this is node2 in BACKUP state:

[root@repo2 ~]# ip addr show ens160
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:b9:38:a4 brd ff:ff:ff:ff:ff:ff
inet 10.2.50.162/24 brd 10.2.50.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:feb9:38a4/64 scope link
valid_lft forever preferred_lft forever

[root@repo2 ~]# mount -l | grep rbd
[root@repo2 ~]#

To test the failover, you can disable the interface of node1 or reboot it completely, so to force a change to FAULT state, and you will see keepalived doing its job on both nodes:

Feb 21 18:34:07 repo1.skunkworks.local Keepalived_healthcheckers[7736]: Netlink reflector reports IP fe80::70ba:3fff:fe30:d40a removed

Feb 21 18:34:07 repo1.skunkworks.local Keepalived_vrrp[7737]: Netlink reflector reports IP fe80::70ba:3fff:fe30:d40a removed

Feb 21 18:34:08 repo1.skunkworks.local Keepalived_vrrp[7737]: Kernel is reporting: interface ens160 DOWN

Feb 21 18:34:08 repo1.skunkworks.local Keepalived_vrrp[7737]: VRRP_Instance(REPO_CLUSTER) Entering FAULT STATE

Feb 21 18:34:08 repo1.skunkworks.local Keepalived_vrrp[7737]: VRRP_Instance(REPO_CLUSTER) removing protocol VIPs.

Feb 21 18:34:08 repo1.skunkworks.local Keepalived_vrrp[7737]: Opening script file /usr/local/bin/keepalivednotify.sh

Feb 21 18:34:08 repo1.skunkworks.local Keepalived_vrrp[7737]: VRRP_Instance(REPO_CLUSTER) Now in FAULT state

Feb 21 18:34:08 repo1.skunkworks.local Keepalived_healthcheckers[7736]: Netlink reflector reports IP 10.2.50.160 removed Feb 21 18:34:08 repo1.skunkworks.local avahi-daemon[603]: Withdrawing address record for 10.2.50.160 on ens160.

Feb 21 18:34:08 repo2.skunkworks.local Keepalived_vrrp[12136]: VRRP_Instance(REPO_CLUSTER) Transition to MASTER STATE

Feb 21 18:34:09 repo2.skunkworks.local Keepalived_vrrp[12136]: VRRP_Instance(REPO_CLUSTER) Entering MASTER STATE

Feb 21 18:34:09 repo2.skunkworks.local Keepalived_vrrp[12136]: VRRP_Instance(REPO_CLUSTER) setting protocol VIPs.

Feb 21 18:34:09 repo2.skunkworks.local Keepalived_vrrp[12136]: VRRP_Instance(REPO_CLUSTER) Sending gratuitous ARPs on ens160 for 10.2.50.160

Feb 21 18:34:09 repo2.skunkworks.local Keepalived_vrrp[12136]: Opening script file /usr/local/bin/keepalivednotify.sh

Feb 21 18:34:09 repo2.skunkworks.local avahi-daemon[606]: Registering new address record for 10.2.50.160 on ens160.IPv4.

Feb 21 18:34:09 repo2.skunkworks.local Keepalived_healthcheckers[12135]: Netlink reflector reports IP 10.2.50.160 added

Feb 21 18:34:09 repo2.skunkworks.local kernel: XFS (rbd0): Mounting Filesystem

Feb 21 18:34:10 repo2.skunkworks.local kernel: XFS (rbd0): Ending clean mount

Feb 21 18:34:10 repo2.skunkworks.local kernel: SELinux: initialized (dev rbd0, type xfs), uses xattr

As you can see in these logs, repo1 goes in FAULT state and xfs is unmounted by executing the script, repo2 becomes the new MASTER and does two things: publishes the virtual IP 10.2.50.160 and mounts the XFS partition from the device RBD0 (our Ceph block device).

Nice, failover is working!

SSH Keys and Veeam repository

Before we can safely use the clustered repository, there’s still one last step we need to do.

When sshd is first executed during Linux first boot, a passwordless public/private key pair is generated on the host to assist clients in identification of the host. When a client first connects to sshd, the server’s public key fingerprint is stored on the client’s machine in ~/.ssh/known_hosts; on subsequent attempts, the given fingerprint is compared with the stored key fingerprint in the same file. If the fingerprints does not match, ssh asks for confirmation, warning you of potential “man in the middle” attacks. This behavior can lead to difficulties in our scenario: each of the two Linux nodes have their own ssh keypair, generated at first boot, so at each failover, there will be an ssh mismatch when Veeam connects to the other node because it’s supposed to connect always to the same “virtual” SSH service.

The solution is simple. SSH Server host keys are saved in /etc/ssh:

[root@repo1 ssh]# ll *key *pub
-rw-r-----. 1 root ssh_keys  227 Feb 20 10:38 ssh_host_ecdsa_key
-rw-r--r--. 1 root root      162 Feb 20 10:38 ssh_host_ecdsa_key.pub
-rw-r-----. 1 root ssh_keys 1679 Feb 20 10:38 ssh_host_rsa_key
-rw-r--r--. 1 root root      382 Feb 20 10:38 ssh_host_rsa_key.pub

Four files, 2 key pairs, for RSA and DSA. Copy these files from the first node to the second, and replace the existing ones, and restart sshd service. In these way, both nodes will have the same SSH keys, and since the security check of SSH controls the match between the IP and the corresponding SSH key (via the known_hosts file), the match will be always verified since we are matching the virtual IP. Basically, Veeam will always believe it is the same node that it’s connecting to.

Finally, the clustered repository is ready to be connected and used by Veeam. In the next part, we will connect the clustered repository to Veeam, and I’ll show you two different failover scenarios: one ceph ODS node stopped, and the failover of the front-end. And what happens to Veeam running jobs.

The overall design

Build the second node

Keepalived

SSH Keys and Veeam repository

Share this: