Wednesday, June 19, 2019

Reference Architecture - Windows Failover Cluster - Two Nodes Multi-Site


Introduction

A two node multi-site Windows failover cluster is typically used in non-production environments. The key benefits of such an approach include providing a “cut down” replica of a standard four-node failover cluster used in production and facilitating simplified failover of services to disaster recovery sites.
In this architecture we are deploying a single Windows server in both the primary and secondary sites. These are clustered using Windows failover clustering. A quorum file server is deployed (if one does not already exist) in the primary site to provide a third “tie-breaker” node when it comes to voting on the primary node in the cluster.

 


Limitations

This approach provides a solid disaster recovery failover solution however has the following limitations:
i)                 No high availability in the primary and / or secondary sites. There is no option to failover to a secondary node in the primary data centre.
ii)                Disaster recovery is a manual process. The server node at the secondary site will only become the primary if both node A fails and node B still has access to the quorum file server.
iii)              This guide is aimed at Windows Server 2016 and Server 2019. Settings definitely need to be reviewed for older versions of Windows.

Out of Scope

The following items are out of scope in this document. They may be incorporated into future revisions.
i)                 SQL Server Always On. This can be deployed on top of the Windows failover cluster and should be treated as a separate work item.
ii)                Shared storage – Not incorporated into this document however may be added in future revisions.
iii)              Application resources and roles – Not incorporated into this document however common roles may be added in future revisions.





Pre-Build Checklist

Before building the Windows failover cluster confirm the following details for the cluster:

Cluster Details

Cluster Name

Cluster Primary Site IP Address
The IP address for the cluster at the primary site. This resolves to the Clustered Server Name

Cluster Secondary Site IP Address

Quorum File Share
Does a quorum file server already exist or does one need to be created?


Server Node A Details

Server Node A Name
The name of the server. This should be of the same format as the cluster name with A,B,C,or D appended depending on which node this is in the cluster.

Server Node A Site
Enter the data centre code here

Server Node A Network
The network the server will be deployed into. Eg. 10.254.34.65/27

Server Node A IP Address
IP address of the server.

Server Node A vCPU
# vCPUs allocated to this server. This should generally be 4 or lower. It should not exceed 8 without management approval.

Server Node A Memory
Server memory in GB.

Server Node A – Disk C
Typically the standard 40GB unless otherwise required.

Server Node A – Disk E
Applications and data drive

Server Node A – Disk F
F drive typically reserved for SQL Server files. Eg. Filestream, replication extracts

Server Node A – Disk L
L drive typically reserved for SQL Server Transaction Logs (LDF)

Server Node A – Disk T
T drive typically reserved for SQL Server tempdb (MDF and LDF)


Server Node B Details

Server Node B Name
The name of the server. This should be of the same format as the cluster name with A,B,C,orD appended depending on which node this is in the cluster.

Server Node B Site
Enter the data centre code here

Server Node B Network
The network the server will be deployed into. Eg. 10.254.34.65/27

Server Node B IP Address
IP address of the server.

Server Node B vCPU
# vCPUs allocated to this server. This should generally be 4 or lower. It should not exceed 8 without management approval.

Server Node B Memory
Server memory in GB.

Server Node B – Disk C
Typically the standard 40GB unless otherwise required.

Server Node B – Disk E
Applications and data drive

Server Node B – Disk F
F drive typically reserved for SQL Server files. Eg. Filestream, replication extracts

Server Node B – Disk L
L drive typically reserved for SQL Server Transaction Logs (LDF)

Server Node B – Disk T
T drive typically reserved for SQL Server tempdb (MDF and LDF)




Cluster Quorum Server Build Process

If a quorum file server does not already exist, one will need to be created. If one does exist, skip this section and create the file share witness on the existing server.

Quorum File Server Build

·        Deploy a Windows Server as per the standard server build process. Resource requirements are low with 2 vCPU and 2GB ram being enough. A small secondary drive of 5GB is enough to host multiple quorum file shares.
Create a quorums folder on the E drive of the server. (Note screenshot shows C, this should be E)

Share this folder out. The Windows share permissions are set to EveryoneFull Control. Note that access to the share is restricted by NTFS.



 
On the NTFS Security tab add the group ROL SEC Quorum Clients with Full Control.

Configure Active Directory Groups

In Active Directory if the ROL SEC Quorum Clients group does not exist, create it. This group will be used by computer server accounts to access the file share quorum.

In the group membership add the computer accounts for the cluster and nodes to the group.


Cluster Build Process

Server Builds

·        Deploy Server Node A using the standard server build process
·        Install the Failover Clustering feature on Server Node A as shown in figure 1 below.
·        Reboot the server
·        Deploy Server Node B using the standard server build process
·        Install the Failover Clustering feature on Server Node B as shown in figure 1 below.
·        Reboot the server



Networking

Test Inter-Node RPC Connectivity

The Windows Failover Clustering service on each node must have full communication with every other node in the cluster over RPC. This occurs on TCP_135 but also my require high or ephemeral ports in addition. Note that this may already be enabled on a per-zone basis depending on the network implementation. To test connectivity between nodes:
From Server Node A run:
telnet <server node B> 135
This should enter into a connection on Server Node B.

From Server Node B run:
telnet <server node A> 135
This should enter into a connection on Server Node A.

If either of these tests fail troubleshoot connectivity between the two servers.

Test Quorum File share Connectivity


From Server Node A and Server Node B run:
telnet <quorum file server> 445
This should enter into a connection on the quorum file server.


VMWare Configuration

Anti-affinity rules are configured in VMWare to ensure that two server nodes of a cluster do not reside on the same physical host. If this was to occur an issue with the physical host would impact more than one node in the cluster and cause an extended outage.
As this is a two-node cluster with one server node per data centre this is not an issue and anti-affinity rules do not need to be configured. 

Configure Cluster

On the first Windows failover cluster server node open Failover Cluster Manager:


Right click on Failover Cluster Manager and click on Create Cluster…
This starts the wizard, click Next.

Next add the server nodes deployed in the Server Builds section to the cluster.

Now run the validation tests. These tests are run to ensure that the hardware, operating system, and software are all compatible with Windows Failover Clustering.

Select Run all tests (Recommended)

Click Next to continue.

Once complete the interface should display The test passed. This is mandatory for some applications such as SQL Server Always On which will refuse to install unless the cluster validation checks pass.
As shown in the screenshot below the test Validate Network Communication will report a warning. This test will only pass if there are at least two network interfaces on each node and the inter-node connectivity checks pass. This is OK as we’re using high availability of the underlying network rather than at the operating system level. The underlying network is not exposed to the operating system hence Windows cannot confirm redundancy and reports the warning.

It should also be noted that a pass for the Validate Network Communication does not always meet Microsoft requirements. If a secondary network adapter in Windows shares common infrastructure with the primary network adapter and that underlying infrastructure is not highly available this will technical not be a valid configuration.

Once the nodes have been added and validation checks performed the cluster can be created. Enter the cluster name and the IP addresses for the cluster itself. Note that the screenshot below shows only one cluster IP address. In a multi-site configuration, there will be one cluster IP address per site.

Click Next to continue.

Click Finish to complete creation of the cluster.

Cluster manager will now show the newly created cluster. On the main dashboard we expect to see the cluster server name and the cluster IP addresses online as shown below.

Under Nodes we’re expecting to see each cluster node. Each node should be online with Status Up.

Under Networks the default setup should be correct. As we’re using a single network both the cluster communications and clients should use the same network.


Disk I/O Timeout Configuration

VMWare recommend increasing the Disk I/O timeout for Windows clusters. This is to make the clustering less sensitive to vMotion events.
On each cluster node set the disk I/O timeout to 60 seconds. Modify the registry key: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Disk\TimeOutValue.
Note that the system might reset this I/O timeout value if you re-create a cluster.


Cluster HeartBeart TimeOut

No modifications to the cluster heartbeat settings are required for Windows Server 2016 and above. The best practice recommendations for VMWare match the default values of Server 2016.

Cluster Service Accounts

No service accounts are created. Give the cluster computer account permissions to read / write / create objects on the parent OU.
In ADUC ensure Advanced Features are shown.

Right click on the parent OU for the cluster and node objects and select properties.


Add the cluster computer account to the OU security. Ensure Read, Write, and Create all child objects permissions are granted.

In the advanced security properties for the account ensure Create Computer Objects permission is granted.

Cluster Quorum Configuration

Ensure the cluster and server computer accounts have been added to the groups ROL SEC Quorum Clients.
On the quorum file server add a folder to the quorums share for this cluster. Note that this will inherit the permissions
From the cluster dashboard click on More Actions à Configure Cluster Quorum Settings…



Select the option Select the quorum witness.


Select Configure a file share witness.


Add the full path of the quorum file share as shown below. Make sure that the file share path terminates into a dedicated folder for the cluster.


Click Next and Finish to complete the quorum configuration.
Once the quorum creation has finished check the file share. Two folders like the screenshot below should be present in the folder indicating that the quorum file share is being used by the cluster.
In the cluster dashboard the File Share Witness should now be displayed as shown below.






Appendix A – References

Windows Failover Clustering Requirements

VMWare Windows Clustering

Cluster Heartbeat Settings




Appendix B – Design Considerations

Cluster Networking

Traditional cluster networking involved two discrete networks on each cluster node, a private network for cluster communications and a public network for application traffic. Since Windows 2008 there is no such thing as a dedicated cluster network. Instead Windows now probes the networking state to determine the optimal network connection to send heartbeat traffic over meaning that the heartbeat traffic can go over any network adapter available to Windows.
The recommendation is to eliminate single points of failure in the network by either providing multiple redundant networks to the Windows Server or by providing a single network connection with full redundancy built in.
From the Microsoft documentation:
In the network infrastructure that connects your cluster nodes, avoid having single points of failure. There are multiple ways of accomplishing this. You can connect your cluster nodes by multiple, distinct networks. Alternatively, you can connect your cluster nodes with one network that is constructed with teamed network adapters, redundant switches, redundant routers, or similar hardware that removes single points of failure.
To maintain simplicity in the environment the decision has been made to only use a single server network and not implement a dedicated cluster network. As the cluster network makes use of the same physical interface on the host and the same network fabric there is no advantage in maintaining a dedicated cluster network.

Cluster Heartbeat

For a Windows Server 2016 and above failover cluster no changes to heartbeat settings are recommended. Server 2016 implemented default settings in-line with VMWare recommendations supporting vMotion and DRS.


No comments:

Post a Comment