Linux HA 集群原理和配置-01

CentOS7上可通过三大模块corosync（心跳管理），pacemaker（资源管理），pcs（配置工具）来实现最基本的高可用性集群功能，本文将介绍这些工具的工作原理和配置过程。

1. 实验环境

虚拟化软件：
- Virtual Box 5.6.2
- Vagrant 2.0.2
实验虚机：
- iscsi-disks: 192.168.56.20，通过iSCSI协议提供共享存储，默认配置1个cpu，1G内存。
- ha-host1: 192.168.56.31，默认配置1个cpu，1G内存。
- ha-host2: 192.168.56.32，默认配置1个cpu，1G内存。
- ha-host3: 192.168.56.33，默认配置1个cpu，1G内存。
安装和管理网络：192.168.56.0/24，该网络为VirtualBox的Host-Only网络，支持物理机和VirtualBox虚机间的互相访问。

2. 克隆项目并启动上述虚拟机

$ git clone https://github.com/lprincewhn/iscsi.git
$ cd iscsi
$ vagrant up iscsi-disks
$ cd ..
$ git clone https://github.com/lprincewhn/linuxha.git
$ cd LinuxHA
$ vagrant up

虚拟机启动完毕后可使用以下用户登陆：

root/vagrant
vagrant/vagrant

3. 创建pcs集群

Step 1 安装软件包

在3台集群主机上安装corosync，pacemaker，pcs包, 启动pcsd服务。

# yum -y install corosync pacemaker pcs
# systemctl start pcsd && systemctl enable pcsd

Step 2 修改hacluster用户密码

在3台集群主机上修改用户hacluster的密码，3台主机的密码保持一致。这个用户仅用于集群主机间通信，无法登陆系统。因此该密码仅在主机集群认证时一次性使用。

[root@ha-host1 ~]# passwd hacluster
Changing password for user hacluster.
New password:
BAD PASSWORD: The password is shorter than 8 characters
Retype new password:
passwd: all authentication tokens updated successfully.

Step 3 互相认证集群主机

[root@ha-host1 ~]# pcs cluster auth ha-host1 ha-host2 ha-host3
Username: hacluster
Password:
ha-host1: Authorized
ha-host2: Authorized
ha-host3: Authorized
[root@ha-host1 ~]#

Step 4 创建集群

[root@ha-host1 ~]# pcs cluster setup --name linuxha ha-host1 ha-host2 ha-host3
Destroying cluster on nodes: ha-host1, ha-host2, ha-host3...
ha-host3: Stopping Cluster (pacemaker)...
ha-host1: Stopping Cluster (pacemaker)...
ha-host2: Stopping Cluster (pacemaker)...
ha-host2: Successfully destroyed cluster
ha-host1: Successfully destroyed cluster
ha-host3: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'ha-host1', 'ha-host2', 'ha-host3'
ha-host2: successful distribution of the file 'pacemaker_remote authkey'
ha-host1: successful distribution of the file 'pacemaker_remote authkey'
ha-host3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
ha-host1: Succeeded
ha-host2: Succeeded
ha-host3: Succeeded

Synchronizing pcsd certificates on nodes ha-host1, ha-host2, ha-host3...
ha-host1: Success
ha-host2: Success
ha-host3: Success
Restarting pcsd on the nodes in order to reload the certificates...
ha-host1: Success
ha-host2: Success
ha-host3: Success

Step 5 启动集群并设置自动启动

[root@ha-host1 ~]# pcs cluster start --all
ha-host1: Starting Cluster...
ha-host2: Starting Cluster...
ha-host3: Starting Cluster...
[root@ha-host1 ~]# pcs cluster enable --all
ha-host1: Cluster Enabled
ha-host2: Cluster Enabled
ha-host3: Cluster Enabled

Step 6 查看集群状态

[root@ha-host1 ~]# pcs status
Cluster name: linuxha
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: ha-host3 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum
Last updated: Mon Apr 23 03:39:50 2018
Last change: Mon Apr 23 03:38:08 2018 by hacluster via crmd on ha-host3

3 nodes configured
0 resources configured

Online: [ ha-host1 ha-host2 ha-host3 ]

No resources


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

pcs status的结果显示:

当前的仲裁节点(Current DC, DC即为Designated Co-ordinator）为ha-host3，这个节点负责向集群中的节点发出一系列指令，使各个资源按照定义（存储在cib数据库中）启动或停止。
3台集群主机ha-host1，ha-host2，ha-host3都已经online，但是目前没有任何资源。

有一个关于stonith设备的Warning:

WARNING: no stonith devices and stonith-enabled is not false

stonith设备在后面将会介绍，目前因为还没有创建，因此先向stonith-enabled属性设为false。关闭后该WARNING消失。

 [root@ha-host1 ~]# pcs property set stonith-enabled=false
 [root@ha-host1 ~]# pcs status
 Cluster name: linuxha
 Stack: corosync
 Current DC: ha-host3 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum
 Last updated: Mon Apr 23 03:40:17 2018
 Last change: Mon Apr 23 03:40:15 2018 by root via cibadmin on ha-host1

 3 nodes configured
 0 resources configured

 Online: [ ha-host1 ha-host2 ha-host3 ]

 No resources


 Daemon Status:
   corosync: active/enabled
   pacemaker: active/enabled
   pcsd: active/enabled

4. 创建资源

创建一个最简单的IP资源：

[root@ha-host1 ~]# pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.56.24 cidr_netmask=24 op monitor interval=30s

查看pcs的状态：

[root@ha-host1 ~]# pcs status
Cluster name: linuxha
Stack: corosync
Current DC: ha-host3 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum
Last updated: Mon Apr 23 03:41:46 2018
Last change: Mon Apr 23 03:41:40 2018 by root via cibadmin on ha-host1

3 nodes configured
1 resource configured

Online: [ ha-host1 ha-host2 ha-host3 ]

Full list of resources:

 vip    (ocf::heartbeat:IPaddr2):       Started ha-host1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

上面显示新创建的资源vip，这个IP被分配在ha-host1上。

检查ha-host1的网口，可以看到ha-host1的eth1网口上分配了新的IP 192.168.56.24

[root@ha-host1 ~]# ip a
...
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:55:a1:19 brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.21/24 brd 192.168.56.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet 192.168.56.24/24 brd 192.168.56.255 scope global secondary eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe55:a119/64 scope link
       valid_lft forever preferred_lft forever
[root@ha-host1 ~]# ssh 192.168.56.24
Last login: Wed Apr 18 05:38:11 2018 from 192.168.56.21
[root@ha-host1 ~]# exit
logout
Connection to 192.168.56.24 closed.

5. 触发切换

将ha-host1的eth1网口停掉，可发现192.168.56.24这个IP被切换到了ha-host2的eth1端口。

[root@ha-host1 ~]# ifconfig eth1 down

[root@ha-host2 ~]# pcs status
Cluster name: linuxha
Stack: corosync
Current DC: ha-host3 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum
Last updated: Mon Apr 23 03:43:36 2018
Last change: Mon Apr 23 03:41:39 2018 by root via cibadmin on ha-host1

3 nodes configured
1 resource configured

Online: [ ha-host2 ha-host3 ]
OFFLINE: [ ha-host1 ]

Full list of resources:

 vip    (ocf::heartbeat:IPaddr2):       Started ha-host2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@ha-host2 ~]# ip a
...
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:bc:94:42 brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.22/24 brd 192.168.56.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet 192.168.56.24/24 brd 192.168.56.255 scope global secondary eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:febc:9442/64 scope link
       valid_lft forever preferred_lft forever

上面pcs status的结果显示了ha-host1变为了OFFLINE状态，这是因为eth1也是ha-host1的集群间的通信网口。在实际部署中，虚拟IP资源一般用于承载业务，应该和集群通信用的网络分开。

6. Pacemaker中的资源定义

Pacemaker中的资源类型使用standard, provider（仅当standard为ocf使用）和agent来进行标识，格式如下： <standard>:[provider]:<agent>

可用pcs resources list指令列出当前支持的资源类型：

[root@ha-host1 ~]# pcs resource list
...
ocf:heartbeat:iface-vlan - Manages VLAN network interfaces.
ocf:heartbeat:IPaddr - Manages virtual IPv4 and IPv6 addresses (Linux specific
                       version)
ocf:heartbeat:IPaddr2 - Manages virtual IPv4 and IPv6 addresses (Linux specific
                        version)
ocf:heartbeat:IPsrcaddr - Manages the preferred source address for outgoing IP
                          packets
ocf:heartbeat:iSCSILogicalUnit - Manages iSCSI Logical Units (LUs)
ocf:heartbeat:iSCSITarget - iSCSI target export agent
ocf:heartbeat:LVM - Controls the availability of an LVM Volume Group
ocf:heartbeat:MailTo - Notifies recipients by email in the event of resource
                       takeover
ocf:heartbeat:mysql - Manages a MySQL database instance
ocf:heartbeat:nagios - Nagios resource agent
...

以上输出中可以找到之前创建的vip资源类型 ocf:heartbeat:IPaddr2。

然后使用pcs resource describe指令查看该类型资源所需参数

[root@ha-host1 ~]# pcs resource describe ocf:heartbeat:IPaddr2  
ocf:heartbeat:IPaddr2 - Manages virtual IPv4 and IPv6 addresses (Linux specific
                        version)

This Linux-specific resource manages IP alias IP addresses.
It can add an IP alias, or remove one.
In addition, it can implement Cluster Alias IP functionality
if invoked as a clone resource.

If used as a clone, you should explicitly set clone-node-max >= 2,
and/or clone-max < number of nodes. In case of node failure,
clone instances need to be re-allocated on surviving nodes.
This would not be possible if there is already an instance on those nodes,
and clone-node-max=1 (which is the default).

Resource options:
  ip (required): The IPv4 (dotted quad notation) or IPv6 address (colon
                 hexadecimal notation) example IPv4 "192.168.1.1". example IPv6
                 "2001:db8:DC28:0:0:FC57:D4C8:1FFF".
  nic: The base network interface on which the IP address will be brought
       online. If left empty, the script will try and determine this from the
       routing table. Do NOT specify an alias interface in the form eth0:1 or
       anything here; rather, specify the base interface only. If you want a
       label, see the iflabel parameter. Prerequisite: There must be at least
       one static IP address, which is not managed by the cluster, assigned to
       the network interface. If you can not assign any static IP address on the
       interface, modify this kernel parameter: sysctl -w
       net.ipv4.conf.all.promote_secondaries=1 # (or per device)
  cidr_netmask: The netmask for the interface in CIDR format (e.g., 24 and not
                255.255.255.0) If unspecified, the script will also try to
                determine this from the routing table.
  broadcast: Broadcast address associated with the IP. If left empty, the script
             will determine this from the netmask.
  iflabel: You can specify an additional label for your IP address here. This
           label is appended to your interface name. A label can be specified in
           nic parameter but it is deprecated. If a label is specified in nic
           name, this parameter has no effect.
  lvs_support: Enable support for LVS Direct Routing configurations. In case a
               IP address is stopped, only move it to the loopback device to
               allow the local node to continue to service requests, but no
               longer advertise it on the network. Notes for IPv6: It is not
               necessary to enable this option on IPv6. Instead, enable
               'lvs_ipv6_addrlabel' option for LVS-DR usage on IPv6.
  lvs_ipv6_addrlabel: Enable adding IPv6 address label so IPv6 traffic
                      originating from the address's interface does not use this
                      address as the source. This is necessary for LVS-DR health
                      checks to realservers to work. Without it, the most
                      recently added IPv6 address (probably the address added by
                      IPaddr2) will be used as the source address for IPv6
                      traffic from that interface and since that address exists
                      on loopback on the realservers, the realserver response to
                      pings/connections will never leave its loopback. See
                      RFC3484 for the detail of the source address selection.
                      See also 'lvs_ipv6_addrlabel_value' parameter.
  lvs_ipv6_addrlabel_value: Specify IPv6 address label value used when
                            'lvs_ipv6_addrlabel' is enabled. The value should be
                            an unused label in the policy table which is shown
                            by 'ip addrlabel list' command. You would rarely
                            need to change this parameter.
  mac: Set the interface MAC address explicitly. Currently only used in case of
       the Cluster IP Alias. Leave empty to chose automatically.
  clusterip_hash: Specify the hashing algorithm used for the Cluster IP
                  functionality.
  unique_clone_address: If true, add the clone ID to the supplied value of IP to
                        create a unique address to manage
  arp_interval: Specify the interval between unsolicited ARP packets in
                milliseconds.
  arp_count: Number of unsolicited ARP packets to send at resource
             initialization.
  arp_count_refresh: Number of unsolicited ARP packets to send during resource
                     monitoring. Doing so helps mitigate issues of stuck ARP
                     caches resulting from split-brain situations.
  arp_bg: Whether or not to send the ARP packets in the background.
  arp_mac: MAC address to send the ARP packets to. You really shouldn't be
           touching this.
  arp_sender: The program to send ARP packets with on start. For infiniband
              interfaces, default is ipoibarping. If ipoibarping is not
              available, set this to send_arp.
  flush_routes: Flush the routing table on stop. This is for applications which
                use the cluster IP address and which run on the same physical
                host that the IP address lives on. The Linux kernel may force
                that application to take a shortcut to the local loopback
                interface, instead of the interface the address is really bound
                to. Under those circumstances, an application may, somewhat
                unexpectedly, continue to use connections for some time even
                after the IP address is deconfigured. Set this parameter in
                order to immediately disable said shortcut when the IP address
                goes away.
  run_arping: Whether or not to run arping for IPv4 collision detection check.
  preferred_lft: For IPv6, set the preferred lifetime of the IP address. This
                 can be used to ensure that the created IP address will not be
                 used as a source address for routing. Expects a value as
                 specified in section 5.5.4 of RFC 4862.

Default operations:
  start: interval=0s timeout=20s
  stop: interval=0s timeout=20s
  monitor: interval=10s timeout=20s

之前创建vip资源时使用的资源参数（ip，cidr_netmask）和操作参数（monitor:interval）都可在以上输出中找到。