Linux HA 集群原理和配置-03

本文介绍在Linux HA集群中的stonith模块功能。

Stonith，全称Shoot The Other Node In The Head，用于防止集群出现脑裂现象。简单来说，一旦集群中的节点相互之间失去了通信，无法知道其他节点的状态，此时集群中的每个节点将尝试fence（隔离或“射杀”）失去通信的节点，确保这些节点不再抢夺资源，然后才继续启动服务资源，对外提供服务。

1. Stonith安装及其Agent简介

在3台集群主机上安装fence-agents软件包。

# yum -y install fence-agents

安装完毕后可查看到系统支持的stonith设备类型：

[root@ha-host1 ~]# pcs stonith list
fence_apc - Fence agent for APC over telnet/ssh
fence_apc_snmp - Fence agent for APC, Tripplite PDU over SNMP
fence_bladecenter - Fence agent for IBM BladeCenter
fence_brocade - Fence agent for HP Brocade over telnet/ssh
fence_cisco_mds - Fence agent for Cisco MDS
fence_cisco_ucs - Fence agent for Cisco UCS
fence_compute - Fence agent for the automatic resurrection of OpenStack compute
                instances
fence_drac5 - Fence agent for Dell DRAC CMC/5
fence_eaton_snmp - Fence agent for Eaton over SNMP
fence_emerson - Fence agent for Emerson over SNMP
fence_eps - Fence agent for ePowerSwitch
fence_evacuate - Fence agent for the automatic resurrection of OpenStack compute
                 instances
fence_hpblade - Fence agent for HP BladeSystem
fence_ibmblade - Fence agent for IBM BladeCenter over SNMP
fence_idrac - Fence agent for IPMI
fence_ifmib - Fence agent for IF MIB
fence_ilo - Fence agent for HP iLO
fence_ilo2 - Fence agent for HP iLO
fence_ilo3 - Fence agent for IPMI
fence_ilo3_ssh - Fence agent for HP iLO over SSH
fence_ilo4 - Fence agent for IPMI
fence_ilo4_ssh - Fence agent for HP iLO over SSH
fence_ilo_moonshot - Fence agent for HP Moonshot iLO
fence_ilo_mp - Fence agent for HP iLO MP
fence_ilo_ssh - Fence agent for HP iLO over SSH
fence_imm - Fence agent for IPMI
fence_intelmodular - Fence agent for Intel Modular
fence_ipdu - Fence agent for iPDU over SNMP
fence_ipmilan - Fence agent for IPMI
fence_kdump - Fence agent for use with kdump
fence_mpath - Fence agent for multipath persistent reservation
fence_rhevm - Fence agent for RHEV-M REST API
fence_rsa - Fence agent for IBM RSA
fence_rsb - I/O Fencing agent for Fujitsu-Siemens RSB
fence_sbd - Fence agent for sbd
fence_scsi - Fence agent for SCSI persistent reservation
fence_virt - Fence agent for virtual machines
fence_vmware_soap - Fence agent for VMWare over SOAP API
fence_wti - Fence agent for WTI
fence_xvm - Fence agent for virtual machines

以上输出中的每个Fence agent都是一种Stonith设备，从名字的后缀可以看出，这些Agent有以下几类：

通过服务器的管理口来关闭被fencing节点的电源，如ilo，ipmi，drac，绝大多数Agent属于此类，这些用于控制物理服务器节点。
通过Hybervisor虚拟层或云平台关闭被fencing的节点，如virt，vmware，xvm，compute，这些用于控制虚机节点。
通过禁止被fencing节点访问特定资源阻止起启动，如scsi，math，brocade。

前两种都属于电源类型的Stonith设备，而第三种和电源无关，之所以要这样划分，是因为：

使用非电源类型Stonith设备时，被fenced的节点没有关闭电源，仅仅是服务没有启动。在对其重启前，必须进行unfence，这样节点才能正常重启。因此创建此种类型的Stonith设备时需指定参数meta provides=unfencing。
使用电源类型的stonith设备则无需指定，因为被fenced的节点电源已经被关闭，而启动节点这个操作本身即为unfenced。

2 创建stonith设备：

以下以fence_scsi为例进行实验。

2.1 创建共享存储

安装《在CentOS7上配置iSCSI》中的方法，通过一台专用的存储节点ha-disks为集群中的3个主机提供共享存储（即在ha-disks上创建iscsi硬盘，然后将其映射到3个集群主机上）。

在iscsi-disks上创建3个100M的硬盘fen1，fen2，fen3，挂载到主机上后设备名称分别为sdb,sdc,sdd

[root@ha-host1 ~]# fdisk -l | grep dev
Disk /dev/sda: 42.9 GB, 42949672960 bytes, 83886080 sectors
/dev/sda1            2048        4095        1024   83  Linux
/dev/sda2   *        4096     2101247     1048576   83  Linux
/dev/sda3         2101248    83886079    40892416   8e  Linux LVM
Disk /dev/mapper/VolGroup00-LogVol00: 40.2 GB, 40231763968 bytes, 78577664 sectors
Disk /dev/mapper/VolGroup00-LogVol01: 1610 MB, 1610612736 bytes, 3145728 sectors
Disk /dev/sdb: 104 MB, 104857600 bytes, 204800 sectors
Disk /dev/sdc: 104 MB, 104857600 bytes, 204800 sectors
Disk /dev/sdd: 104 MB, 104857600 bytes, 204800 sectors

[root@ha-host2 ~]# fdisk -l | grep dev
Disk /dev/sda: 42.9 GB, 42949672960 bytes, 83886080 sectors
/dev/sda1            2048        4095        1024   83  Linux
/dev/sda2   *        4096     2101247     1048576   83  Linux
/dev/sda3         2101248    83886079    40892416   8e  Linux LVM
Disk /dev/mapper/VolGroup00-LogVol00: 40.2 GB, 40231763968 bytes, 78577664 sectors
Disk /dev/mapper/VolGroup00-LogVol01: 1610 MB, 1610612736 bytes, 3145728 sectors
Disk /dev/sdb: 104 MB, 104857600 bytes, 204800 sectors
Disk /dev/sdc: 104 MB, 104857600 bytes, 204800 sectors
Disk /dev/sdd: 104 MB, 104857600 bytes, 204800 sectors

[root@ha-host3 ~]# fdisk -l | grep dev
Disk /dev/sda: 42.9 GB, 42949672960 bytes, 83886080 sectors
/dev/sda1            2048        4095        1024   83  Linux
/dev/sda2   *        4096     2101247     1048576   83  Linux
/dev/sda3         2101248    83886079    40892416   8e  Linux LVM
Disk /dev/mapper/VolGroup00-LogVol00: 40.2 GB, 40231763968 bytes, 78577664 sectors
Disk /dev/mapper/VolGroup00-LogVol01: 1610 MB, 1610612736 bytes, 3145728 sectors
Disk /dev/sdb: 104 MB, 104857600 bytes, 204800 sectors
Disk /dev/sdc: 104 MB, 104857600 bytes, 204800 sectors
Disk /dev/sdd: 104 MB, 104857600 bytes, 204800 sectors

测试一下这些硬盘是否支持PR Key：

[root@ha-host1 ~]# sg_persist /dev/sdc
>> No service action given; assume Persistent Reserve In command
>> with Read Keys service action
  LIO-ORG   fen2              4.0
  Peripheral device type: disk
  PR generation=0x5, there are NO registered reservation keys

2.2 创建stonith设备

首先使用一个fence盘/dev/sdb来进行实验：

[root@ha-host1 ~]# pcs stonith create scsi-shooter fence_scsi pcmk_host_list="ha-host1 ha-host2 ha-host3" devices=/dev/sdb meta provides=unfencing
[root@ha-host1 ~]# pcs status            
Cluster name: linuxha
Stack: corosync
Current DC: ha-host2 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum
Last updated: Fri May  4 07:10:33 2018
Last change: Fri May  4 07:07:14 2018 by root via cibadmin on ha-host1

3 nodes configured
2 resources configured

Online: [ ha-host1 ha-host2 ha-host3 ]

Full list of resources:

 vip    (ocf::heartbeat:IPaddr2):       Started ha-host1
 scsi-shooter   (stonith:fence_scsi):   Started ha-host2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

使用sg_persist -s参数获取/dev/sdb上的所有信息：

[root@ha-host1 ~]# sg_persist -s /dev/sdb
  LIO-ORG   fen1              4.0
  Peripheral device type: disk
  PR generation=0xa
    Key=0x35fc0000
      All target ports bit clear
      Relative port address: 0x1
      << Reservation holder >>
      scope: LU_SCOPE,  type: Write Exclusive, registrants only
      Transport Id of initiator:
        iSCSI name and session id: iqn.2016-06.com.ha-host1:iscsi-host1
    Key=0x35fc0001
      All target ports bit clear
      Relative port address: 0x1
      not reservation holder
      Transport Id of initiator:
        iSCSI name and session id: iqn.2016-06.com.ha-host2:iscsi-host2
    Key=0x35fc0002
      All target ports bit clear
      Relative port address: 0x1
      not reservation holder
      Transport Id of initiator:
        iSCSI name and session id: iqn.2016-06.com.ha-host3:iscsi-host3

可以看到，3个节点使用不同的PR Key在这个磁盘上进行了注册(register)，并且ha-host1保留(reservation)成功，类型为“Write Exclusive, registrants only”。表明此时只有ha-host1对该磁盘进行写操作。

此时如果断开其中两个节点的的链接，如ha-host1和ha-host3：

[root@ha-host1 ~]# pcs status
Cluster name: linuxha
Stack: corosync
Current DC: ha-host2 (version 1.1.16-12.el7_4.8-94ff4df) - partition with quorum
Last updated: Fri May  4 07:30:53 2018
Last change: Fri May  4 07:07:13 2018 by root via cibadmin on ha-host1

3 nodes configured
2 resources configured

Node ha-host3: UNCLEAN (offline)
Online: [ ha-host1 ha-host2 ]

Full list of resources:

 vip    (ocf::heartbeat:IPaddr2):       Started ha-host1
 scsi-shooter   (stonith:fence_scsi):   Started ha-host2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@ha-host1 ~]# sg_persist -s /dev/sdb
  LIO-ORG   fen1              4.0
  Peripheral device type: disk
  PR generation=0xb
    Key=0x35fc0000
      All target ports bit clear
      Relative port address: 0x1
      << Reservation holder >>
      scope: LU_SCOPE,  type: Write Exclusive, registrants only
      Transport Id of initiator:
        iSCSI name and session id: iqn.2016-06.com.ha-host1:iscsi-host1
    Key=0x35fc0001
      All target ports bit clear
      Relative port address: 0x1
      not reservation holder
      Transport Id of initiator:
        iSCSI name and session id: iqn.2016-06.com.ha-host2:iscsi-host2

可以看到，经过协商后，ha-host3退出集群，并且也删除在fencing磁盘中的注册信息。由于stonith资源运行在ha-host2上，所以在ha-host2的日志中可以看到ha-host3被fence的过程：

[root@ha-host2 ~]# tail -1000 /var/log/cluster/corosync.log | grep ha-host3     May 04 07:30:51 [1437] ha-host2    pengine:   notice: LogNodeActions:    * Fence (reboot) ha-host3 'peer is no longer part of the cluster'
May 04 07:30:51 [1438] ha-host2       crmd:   notice: te_fence_node:    Requesting fencing (reboot) of node ha-host3 | action=1 timeout=60000
May 04 07:30:51 [1434] ha-host2 stonith-ng:   notice: handle_request:   Client crmd.1438.0cea319b wants to fence (reboot) 'ha-host3' with device '(any)'
May 04 07:30:51 [1434] ha-host2 stonith-ng:   notice: initiate_remote_stonith_op:       Requesting peer fencing (reboot) of ha-host3 | id=0cf426c7-666f-4299-8285-fa500fa5ac09 state=0
May 04 07:30:52 [1434] ha-host2 stonith-ng:   notice: can_fence_host_with_device:       scsi-shooter can fence (reboot) ha-host3: static-list
May 04 07:30:52 [1434] ha-host2 stonith-ng:     info: process_remote_stonith_query:     Query result 1 of 2 from ha-host2 for ha-host3/reboot (1 devices) 0cf426c7-666f-4299-8285-fa500fa5ac09
May 04 07:30:52 [1434] ha-host2 stonith-ng:     info: call_remote_stonith:     Total timeout set to 60 for peer's fencing of ha-host3 for crmd.1438|id=0cf426c7-666f-4299-8285-fa500fa5ac09
May 04 07:30:52 [1434] ha-host2 stonith-ng:     info: call_remote_stonith:     Requesting that 'ha-host2' perform op 'ha-host3 reboot' for crmd.1438 (72s, 0s)
May 04 07:30:52 [1434] ha-host2 stonith-ng:     info: process_remote_stonith_query:     Query result 2 of 2 from ha-host1 for ha-host3/reboot (1 devices) 0cf426c7-666f-4299-8285-fa500fa5ac09
May 04 07:30:52 [1434] ha-host2 stonith-ng:   notice: can_fence_host_with_device:       scsi-shooter can fence (reboot) ha-host3: static-list
May 04 07:30:52 [1434] ha-host2 stonith-ng:     info: stonith_fence_get_devices_cb:     Found 1 matching devices for 'ha-host3'
May 04 07:30:53 [1434] ha-host2 stonith-ng:  warning: log_action:       fence_scsi[2603] stderr: [ WARNING:root:Parse error: Ignoring unknown option 'port=ha-host3' ]
May 04 07:30:53 [1434] ha-host2 stonith-ng:   notice: log_operation:    Operation 'reboot' [2603] (call 6 from crmd.1438) for host 'ha-host3' with device 'scsi-shooter' returned: 0 (OK)
May 04 07:30:53 [1434] ha-host2 stonith-ng:   notice: remote_op_done:   Operation reboot of ha-host3 by ha-host2 for crmd.1438@ha-host2.0cf426c7: OK
May 04 07:30:53 [1438] ha-host2       crmd:     info: tengine_stonith_callback:Stonith operation 6 for ha-host3 passed
May 04 07:30:53 [1438] ha-host2       crmd:     info: crm_update_peer_expected:crmd_peer_down: Node ha-host3[3] - expected state is now down (was member)

ha-host3被fence之后，必须重启才能重新注册PR Key，否则即使网络恢复，其也无法运行需要stonith支持的资源。

问题：仲裁机制保证了必须有超过半数的节点的partition才能启动资源，拿为什么还需要stonith设备？

当集群中有只有两个节点的时候，我们必须允许partition在只有一个节点的时候也可以启动资源，此时Stonith设备为必须。
仲裁机制是以主机间的pcs进程通信为基础的，存在一种可能性是pcs进程已经停掉了，但是相关的资源进程仍然在运行（资源进程处于脱管状态），此时节点已经脱离集群但仍会争抢资源。