Server Admin Interview Questions: Cluster Concepts

Cluster Concepts

Posted by Nageswari Vijayakumar | Posted in | Posted on 1:46 AM

When the physical disks are not powering up or spinning, Cluster service cannot initialize any quorum resources.
Cause: Cables are not correctly connected, or the physical disks are not configured to spin when they receive power.
Solution: After checking that the cables are correctly connected, check that the physical disks are configured to spin when they receive power.
The Cluster service fails to start and generates an Event ID 1034 in the Event log after you replace a failed hard disk, or change drives for the quorum resource.
Cause: If a hard disk is replaced, or the bus is reenumerated, the Cluster service may not find the expected disk signatures, and consequently may fail to mount the disk.
Solution: Write down the expected signature from the Description section of the Event ID 1034 error message. Then follow these steps:
1. Backup the server cluster.
2. Set the Cluster service to start manually on all nodes, and then turn off all but one node.
3. If necessary, partition the new disk and assign a drive letter.
4. Use the confdisk.exe tool (available in the Microsoft Windows Server 2003 Resource Kit) to write that signature to the disk.
5. Start the Cluster service and bring the disk online
6. If necessary, restore the cluster configuration information.
7. Turn on each node, one at a time.
For information on replacing disks in a server cluster, see Knowledge Base article Q305793, "How to Replace a Disk with Windows 2000 or Windows Server 2003 family Clusters" in the Microsoft Knowledge Base.
Drive on the shared storage bus is not recognized.
Cause: Scanning for storage devices is not disabled on each controller on the shared storage bus.
Solution: Verify that scanning for storage devices is disabled on each controller on the shared storage bus.
Many times, the second computer you turn on does not recognize the shared storage bus during the BIOS scan if the first computer is running. This situation can manifest itself in a "Device not ready" error being generated by the controller or in substantial delays during startup.
To correct this, disable the option to scan for devices on the shared controller.
Note
• This symptom can manifest itself as one of several errors, depending on the attached controller. It is normally accompanied with a one- to two-minute start delay and an error indicating the failure of some device.
Configuration cannot be accessed through Disk Management.
Under normal cluster operations, the node that owns a quorum resource locks the drive storing the quorum resource, preventing the other nodes from using the device. If you find that the cluster node that owns a quorum resource cannot access configuration information through Disk Management, the source of the problem and the solution might be one of the following:
Cause: A device does not have physical connectivity and power.
Solution: Reseat controller cards, reseat cables, and make sure the drive spins up when you start.
Cause: You attached the cluster storage device to all nodes and started all the nodes before installing the Cluster service on any node.
Solution: After you attach all servers to the cluster drives, you must install the Cluster service on one node before starting all the nodes. Attaching the drive to all the nodes before you have the cluster installed can corrupt the file system on the disk resources on the shared storage bus.
SCSI or fiber channel storage devices do not respond.
Cause: The SCSI bus is not properly terminated.
Solution: Make sure that the SCSI bus is not terminated early and that the SCSI bus is terminated at both ends.
Cause: The SCSI or fiber channel cable is longer than the specification allows.
Solution: Make sure that the SCSI or fiber channel cable is not longer than the cable specification allows.
Cause: The SCSI or fiber channel cable is damaged.
Solution: Make sure that the SCSI or fiber channel cable is not damaged. (For example, check for bent pins and loose connectors on the cable and replace it if necessary.)
Disk groups do not move or stay online pending after move.
Cause: Cables are damaged or not properly installed.
Solution: Check for bent pins on cables and make sure that all cables are firmly anchored to the chassis of the server and drive cabinet.
Disks do not come online or Cluster service does not start when a node is turned off.
Cause: If the quorum log is corrupted, the Cluster service cannot start.
Solution: If you suspect the quorum resource is corrupted, see the information on the problem "Quorum log becomes corrupted" in Node-to-node connectivity problems.
Drives do not fail over or come online.
Cause: The drive is not on a shared storage bus.
Solution: If drives on the shared storage bus do not fail over or come online, make sure the disk is on a shared storage bus, not on a nonsystem bus.
Cause: If you have more than one local storage bus, some drives in Shared cluster disks will not be on a shared storage bus.
Solution: If you do not remove these drives from Shared cluster disks, the drives do not fail over, even though you can configure them as resources.
Shared cluster disks is in the Cluster Application Wizard.
Mounted drives disappear, do not fail over, or do not come online.
Cause: The clustered mounted drive was not configured correctly.
Solution: Look at the Cluster service errors in the Event Log (ClusSvc under the Source column). You need to recreate or reconfigure the clustered mounted drive if the description of any Cluster service error is similar to the following:
Cluster disk resource "disk resource": Mount point "mount drive" for target volume "target volume" is not acceptable for a clustered disk because reason. This mount point will not be maintained by the disk resource.
When recreating or reconfiguring the mounted drive(s), follow these guidelines:
• Make sure that you create unique mounted drives so that they do not conflict with existing local drives on any node in the cluster.
• Do not create mounted drives between disks on the cluster storage device (cluster disks) and local disks.
• Do not create a mounted drive from a clustered disk to the cluster disk that contains the quorum resource (the quorum disk). You can, however, create a mounted drive from the quorum disk to a clustered disk.
• Mounted drives from one cluster disk to another must be in the same cluster resource group, and must be dependent on the root disk.
Basic Troubleshooting Steps
When working with SQL Server failover clustering, remember that the server cluster consists of a failover cluster instance that runs under Microsoft Cluster Services (MSCS). The instance of SQL Server might be hosted by Microsoft MSCS-based nodes that provide the Microsoft Server Cluster.
If problems exist on the nodes that host the server cluster, those problems may manifest themselves as issues with your failover cluster instance. To investigate and resolve these issues, troubleshoot a SQL Server failover cluster in the following order:
1. Hardware: Review Microsoft Windows system event logs.
2. Operating system: Review Windows system and application event logs.
3. Network: Review Windows system and application event logs. Verify the current configuration against the Knowledge Base article, Recommended Private "Heartbeat" Configuration on a Cluster Server.
4. Security: Review Windows application and security event logs.
5. MSCS: Review Windows system, application event, and cluster logs.
6. SQL Server: Troubleshoot as normal after the hardware, operating system, network, security, and MSCS foundations are verified to be problem-free.
Recovering from Failover Cluster Failure
Usually, failover cluster failure is to the result of one of two causes:
• Hardware failure in one node of a two-node cluster. This hardware failure could be caused by a failure in the SCSI card or in the operating system.
To recover from this failure, remove the failed node from the failover cluster using the SQL Server Setup program, address the hardware failure with the computer offline, bring the machine back up, and then add the repaired node back to the failover cluster instance.
For more information, see How to: Create a New SQL Server Failover Cluster (Setup) and How to: Recover from Failover Cluster Failure in Scenario 1.
• Operating system failure. In this case, the node is offline, but is not irretrievably broken.
To recover from an operating system failure, recover the node and test failover. If the SQL Server instance does not fail over properly, you must use the SQL Server Setup program to remove SQL Server from the failover cluster, make necessary repairs, bring the computer back up, and then add the repaired node back to the failover cluster instance.
Recovering from operating system failure this way can take time. If the operating system failure can be recovered easily, avoid using this technique.
For more information, see How to: Create a New SQL Server Failover Cluster (Setup) and How to: Recover from Failover Cluster Failure in Scenario 2.
Resolving Common Problems
Problem: The Network Name is offline and you cannot connect to SQL Server using TCP/IP
Issue 1: DNS is failing with cluster resource set to require DNS.
Resolution 1: Correct the DNS problems.
Issue 2: A duplicate name is on the network.
Resolution 2: Use NBTSTAT to find the duplicate name and then correct the issue.
Issue 3: SQL Server is not connecting using Named Pipes.
Resolution 3: To connect using Named Pipes, create an alias using the SQL Server Configuration Manager to connect to the appropriate computer. For example, if you have a cluster with two nodes (Node A and Node B), and a failover cluster instance (Virtsql) with a default instance, you can connect to the server that has the Network Name resource offline using the following steps:
1. Determine on which node the group containing the instance of SQL Server is running by using the Cluster Administrator. For this example, it is Node A.
2. Start the SQL Server service on that computer using net start. For more information about using net start, see Starting SQL Server Manually.
3. Start the SQL Server SQL Server Configuration Manager on Node A. View the pipe name on which the server is listening. It should be similar to \\.\$$\VIRTSQL\pipe\sql\query.
4. On the client computer, start the SQL Server Configuration Manager.
5. Create an alias SQLTEST1 to connect through Named Pipes to this pipe name. To do this, enter Node A as the server name and edit the pipe name to be \\.\pipe\$$\VIRTSQL\sql\query.
6. Connect to this instance using the alias SQLTEST1 as the server name.
Problem: SQL Server Setup fails on a cluster with error 11001
Issue: An orphan registry key in [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.X\Cluster]
Resolution: Make sure the MSSQL.X registry hive is not currently in use, and then delete the cluster key.
Problem: Cluster Setup Error: "The installer has insufficient privileges to access this directory: \Microsoft SQL Server. The installation cannot continue. Log on as an administrator or contact your system administrator"
Issue: This error is caused by a SCSI shared drive that is not partitioned properly.
Resolution: Re-create a single partition on the shared disk using the following steps:
1. Delete the disk resource from the cluster.
2. Delete all partitions on the disk.
3. Verify in the disk properties that the disk is a basic disk.
4. Create one partition on the shared disk, format the disk, and assign a drive letter to the disk.
5. Add the disk to the cluster using Cluster Administrator (cluadmin).
6. Run SQL Server Setup.
Problem: Applications fail to enlist SQL Server resources in a distributed transaction
Issue: Because the Microsoft Distributed Transaction Coordinator (MS DTC) is not completely configured in Windows, applications may fail to enlist SQL Server resources in a distributed transaction. This problem can affect linked servers, distributed queries, and remote stored procedures that use distributed transactions. For more information about how to configure MS DTC, see Before Installing Failover Clustering.
Resolution: To prevent such problems, you must fully enable MS DTC services on the servers where SQL Server is installed and MS DTC is configured.
To fully enable MS DTC, use the following steps:
1. In Control Panel, open Administrative Tools, and then open Computer Management.
2. In the left pane of Computer Management, expand Services and Applications, and then click Services.
3. In the right pane of Computer Management, right-click Distributed Transaction Coordinator, and select Properties.
4. In the Distributed Transaction Coordinator window, click the General tab, and then click Stop to stop the service.
5. In the Distributed Transaction Coordinator window, click the Logon tab, and set the logon account NT AUTHORITY\NetworkService.
6. Click Apply and OK to close the Distributed Transaction Coordinator window. Close the Computer Management window. Close the Administrative

Comments (2)

balaji Said on January 21, 2015 at 9:26 PM

Good Document

Ajit Sahoo Said on January 24, 2018 at 4:17 AM

colour combination is not good... difficult to read the contents

Server Admin Interview Questions Windows Server Interview questions - Level 3

About Me

Cluster Concepts

Posted by Nageswari Vijayakumar | Posted in | Posted on 1:46 AM

Comments (2)

Post a Comment

Blog Archive

Followers

Server Admin Interview Questions Windows Server Interview questions - Level 3

About Me

Cluster Concepts

Posted by Nageswari Vijayakumar | Posted in | Posted on 1:46 AM

Comments (2)

Post a Comment

Blog Archive

Followers

Posted by Nageswari Vijayakumar | Posted in | Posted on 1:46 AM