During my tests with keepalived as a balancer for a Linux cluster, I was searching for a way to quickly simulate a node failure and to check keepalived was correctly failing over to the other node. Here is a quick and smart way to do it!
Dummy!
Keepalived can track a service or a network connection, and when one of these resources fails, it starts the failover. The problem during a test phase is quite obvious: in a test scenario, you do not really want to crash a service on purpose or disconnect a network connection to test the failover; you still want to keep the ssh connection open to monitor both nodes for example, and still see the failover happening.
Keepalived does not have a “manual” failover command, but I’ve found a way to do it. Kudos to my friend PJ Spagnolatti, one of his posts in the keepalived mailing list (back in 2001!!!) was a great help to achieve this, plus a couple emails I exchanged with him. The “trick” is really nice: we will load a fake network interface, and by failing it over, we will start the failover. Linux has a network interface called exactly “dummy”, designed for such needs! How cool!
First, you need to load dummy in the kernel:
echo "modprobe dummy" >/etc/sysconfig/modules/rcsysinit.modules chmod +x /etc/sysconfig/modules/rcsysinit.modules modprobe -a dummy
Then, you configure dummy0 to be up at boot:
vi /etc/sysconfig/network-scripts/ifcfg-dummy0 DEVICE=dummy0 BOOTPROTO=none IPV6INIT=no NAME="dummy0" ONBOOT=yes TYPE=Ethernet USERCTL=no NM_CONTROLLED="no"
Once the device is “up and running” on both keepalived nodes, you add the network interface as a resource to be monitored. I’m posting here my complete keepalived.conf configuration:
vrrp_script chk_sshd { script "killall -0 sshd" interval 2 weight -4 } vrrp_instance REPO_CLUSTER { state BACKUP nopreempt interface ens160 virtual_router_id 1 priority 101 notify /usr/local/bin/keepalivednotify.sh advert_int 1 track_interface { ens160 dummy0 } authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 10.2.50.160/24 dev ens160 } track_script { chk_sshd } } virtual_server 10.2.50.160 22 { delay_loop 30 lb_algo wrr lb_kind NAT persistence_timeout 50 protocol TCP real_server 10.2.50.161 22 { weight 1 TCP_CHECK { connect_port 22 connect_timeout 3 nb_get_retry 3 delay_before_retry 1 } } }
As you can read in this configuration file, the “real” monitoring happens against the sshd service and the ens160 interface (this is the new way of systemd in CentOS 7 to name what once was eth0 when it’s a VMware virtual interface…). when anything happens to one of these two resources, the virtual IP 10.2.50.160 is no more published on this node, and the failover happens towards the other node (10.2.50.162 is the real IP of the second node, the rest of the configuration file is exactly the same).
But, by simply adding dummy0 in the track_interface section, a manual failover is as simple as running in the command line:
ifdown dummy0
dummy0 is usually in state unknown:
# ip addr show dummy0 3: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether ae:18:a4:0a:17:ab brd ff:ff:ff:ff:ff:ff inet6 fe80::ac18:a4ff:fe0a:17ab/64 scope link valid_lft forever preferred_lft forever
when we take down the dummy interface, it goes into state down:
# ip addr show dummy0 3: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN link/ether ae:18:a4:0a:17:ab brd ff:ff:ff:ff:ff:ff
and the failover starts. Remember to bring back the interface into the initial state after the failover, what will happen depends on the keepalived configuration: in my case I configured “nopreempt” which disable the failback to the master node, so even if I bring dummy0 back online on the master node, the virtual IP stays into the secondary node.
Once you’ve finished your tests, you can either decide to remove dummy0 from keepalived configuration, or keep it and use it as a way to run manual failovers when needed!