Quantcast
Channel: SQL Server Cluster Archives - SQL Authority with Pinal Dave
Viewing all 53 articles
Browse latest View live

SQL SERVER – Clustered SQL Resource Not Coming Online

$
0
0

SQL SERVER - Clustered SQL Resource Not Coming Online servererror When I was on-site for the performance tuning workshop, few DBAs suddenly got call as there was an unexpected downtime of SQL Server. As per them, after some maintenance activities were scheduled, the SQL Server resource failed to come online on both nodes. When they try to bring it online, it remains in online pending state for some time before failing eventually. Here are the things I tried. Let us see an error related to clustered SQL Resource not coming online.

  • We tried starting it from services.msc and SQL Server service could start successfully.
  • When we checked SQL Server ERRORLOG, and found no errors when bringing online in a cluster.
  • As a last resort, I generated cluster log and found the below SQL SERVER – Steps to Generate Windows Cluster Log?
  • As per above we can see that the cluster service was not to connect to SQL Server
  • The error was Login failed for user ‘PRODUCTION\SQLNODE1$’
  • This means the account; ‘PRODUCTION\NODE1$’ did not have login permission on SQL Server. This is actually the local machine account (NODE1 was the machine name).

WORKAROUND/SOLUTION

We started SQL Server from services console and gave the sysadmin server role of the LocalSystem (NT AUTHORITY\SYSTEM) account. Here was the command

ALTER SERVER ROLE [sysadmin] ADD MEMBER [NT AUTHORITY\SYSTEM]

After this, we stopped SQL Server from the services console and tried bringing it online from cluster administrator and it succeeded.

As a safely measure, we tested a failover and failback between the SQLNOD1 and SQLNODE2 and it worked perfectly.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Clustered SQL Resource Not Coming Online


SQL SERVER – sp_server_diagnostics – The User Does Not Have Permission to Perform this Action. (297)

$
0
0

In SQL Server 2012 onwards, the cluster health check detection logic has been enhanced. Instead of the traditional pull mechanism of the cluster (IsAlive and LooksAlive), SQL Server (version 2012 onwards) uses a push mechanism to detect the health of the SQL instance. This is done by special stored procedure called sp_server_diagnostics. We should remember that the failover mechanism for the AlwaysOn FCI and the AlwaysOn Availability Groups is same.

While troubleshooting, it is very important to know which log to look at along with the basics of a feature. Recently, one of my clients was having trouble in bringing SQL Server AlwaysOn availability group resource online.

Whenever I am stuck with a cluster related issue, I always look at cluster log. You can refer my previous blog about how to generate cluster logs. SQL SERVER – Steps to Generate Windows Cluster Log?

Now let us change see the error logs.

Now let us inspect them heavily and you will find following error.

SQL SERVER - sp_server_diagnostics - The User Does Not Have Permission to Perform this Action. (297) sp_server_diagnostics

The User Does Not Have Permission to Perform this Action. (297)

I have removed date-time column from the output to provide the clarity. If we look at series of messages, you would notice that cluster has made a connection to SQL Server. After this it executes below statement

exec sp_server_diagnostics 10

As we can see in the next line that this execution failed with error.

The user does not have permission to perform this action

Due to above error diagnostic health check failed and SQL Server will not be able to come online in a cluster. Same error can also appear in the AlwaysOn availability group as well.

Now the real question is which user? And how to fix this issue?

WORKAROUND/SOLIUTION

The account which is used to connect to SQL Server from a cluster is a local system account. My client informed that due to hardening they have modified default permissions in SQL Server.

To fix this issue, we can add VIEW SERVER STATE permission to the SYSTEM account.

use [master]
GO
GRANT VIEW SERVER STATE TO [NT AUTHORITY\SYSTEM]
GO

Once done, the issue was resolved and SQL came online in the cluster as well.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – sp_server_diagnostics – The User Does Not Have Permission to Perform this Action. (297)

SQL SERVER – Installation Error – Status Code: 183 Description: Cannot Create a File When that File Already Exists

$
0
0

SQL SERVER - Installation Error - Status Code: 183 Description: Cannot Create a File When that File Already Exists errorstop One of my client was unable to install SQL Server 2014 on a Windows 2012 Cluster. They were using mount points with below the structure. Let us learn about installation error.

  • M:\Data_MP
  • M:\Log_MP
  • M:\System_MP
  • M:\Temp_MP
  • M:\Backup_MP

Over here, M was 10GB local drive, Data_MP is a SAN disk mounted in the root drive. As we can see they have configured disks from the SAN as mount points. The validation is getting successful, but the SQL installation is failing with below error.

Error description: The resource ‘Cluster Disk 1’ could not be moved from cluster group ‘Available Storage’ to cluster group ‘SQL Server (MSQLSERVER)’. Error: There was a failure to call cluster code from a provider. Exception message: Generic failure . Status code: 183. Description: Cannot create a file when that file already exists

When I looked at the Detail.txt file for more details, the message is same.

WORKAROUND/SOLUTION

There are two workarounds which I found for them

  1. Pre-create a Windows Cluster group/Role in Cluster and add the disks to that group before starting the setup. During the setup wizard, choose this group for SQL to be installed.
  2. Mount the mount points in a folder instead of Root drive like below
    • M:\UserDB_Data\Data_MP
    • M:\UserDB_Log\Log_MP
    • M:\SystemDB\System_MP
    • M:\TempDB\Temp_MP
    • M:\Backup\Backup_MP

Over here, M was 10GB local drive, UserDB_Data is a folder inside M and Data_MP is SAN disk mounted inside folder.

Hope this would help someone in the world!

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Installation Error – Status Code: 183 Description: Cannot Create a File When that File Already Exists

SQL SERVER – Cluster Patching: The RPC Server is Too Busy to Complete This Operation

$
0
0

SQL SERVER - Cluster Patching: The RPC Server is Too Busy to Complete This Operation patchicon-800x800 If you have ever contacted me via email, you would know that I am very active in replying to emails. Many of the emails are for suggestions and I don’t get much time to help everyone, but I do reply to them letting them know the alternatives. If you are following my blog, you would know that I do provide “On Demand” services to help critical issues. This blog is an outcome of one of such short engagement. This is how it started and it is about RPC Server.

Received an email
Need your urgent help “On Demand”!

We are trying to install SP2 on the passive node of a SQL Server 2016 cluster. This patch has already worked on one server, but now we’re getting RPC too busy errors.

We are under strict timelines to finish this activity. Can you please help us quickly?

Without spending much time, I asked them to join GoToMeeting and got started. When they showed me the screen, the SQL Server setup was on the error “Failed to retrieve data for this request”.

I asked to cancel the setup and share the setup logs to see exact error.

Microsoft.SqlServer.Management.Sdk.Sfc.EnumeratorException: Failed to retrieve data for this request. —> Microsoft.SqlServer.Configuration.Sco.SqlRegistryException: The RPC server is too busy to complete this operation.

WORKAROUND/SOLUTION

As we can see above, SQL setup is failing to do an activity on remote node. I immediately recalled a similar blog which I wrote earlier having same remote node symptoms, but the error message was different.

SQL SERVER – Microsoft.SqlServer.Management.Sdk. Sfc.EnumeratorException: Failed to Retrieve Data for This Request

We checked “Remote Registry Service” on remote node and sure enough, it was in “stopped” state. As soon as we started, we were able to move forward and finish the activity in less than the scheduled time.

If you are having any quick issue to resolve, you can also avail the same kind of services. Click here to read more about it.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Cluster Patching: The RPC Server is Too Busy to Complete This Operation

SQL SERVER – Unable to Get Listener Properties Using PowerShell – An Error Occurred Opening Resource

$
0
0

I was engaged with a client for an AlwaysOn project and they had some follow-up questions. I took some time to find the answers and encountered an interesting error. I am sharing them here so that others can get benefited. They informed me that they are not able to see and modify listener properties. Let us learn about this error related to opening resource.

Initially, I shared script to get the properties of the listener via T-SQL. As you can see below, we can use catalog views.

SELECT grp.name AS [AG Name],
lis.dns_name AS [Listener DNS Name],
lis.port AS [Listener Port]
FROM sys.availability_group_listeners lis
INNER JOIN sys.availability_groups grp
ON lis.group_id = grp.group_id
ORDER BY grp.name, lis.dns_name

Here is the output.

SQL SERVER - Unable to Get Listener Properties Using PowerShell - An Error Occurred Opening Resource list-powershell-01

My client came back and told that networking team has asked to change RegisterAllProvidersIP setting. We are not able to use PowerShell and getting error “An error occurred opening resource”. We are not sure what wrong with the listener in the cluster.

Get-ClusterResource AGListener | Get-ClusterParameter
PS C:\> Get-ClusterResource AGListener | Get-ClusterParameter
Get-ClusterResource : An error occurred opening resource 'AGListener'.
At line:1 char:1
+ Get-ClusterResource AGListener | Get-ClusterParameter
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 + CategoryInfo : ObjectNotFound: (:) [Get-ClusterResource], ClusterCmdletException
 + FullyQualifiedErrorId : ClusterObjectNotFound,Microsoft.FailoverClusters.PowerShell.GetResourceCommand

If we look at cluster the value “AGListener” we are using seems correct, but still PowerShell thinks its incorrect. Here is the screenshot from cluster manager.

SQL SERVER - Unable to Get Listener Properties Using PowerShell - An Error Occurred Opening Resource list-powershell-02

I did some more searching and found that when we create Listener in through SSMS its naming convention like AGNAME_ListenerName. This is the reason that when we run the command Get-ClusterResource for the listener, we can’t see the properties. Here are the properties of the listener resource. (Right Click)

SQL SERVER - Unable to Get Listener Properties Using PowerShell - An Error Occurred Opening Resource list-powershell-03

WORKAROUND/SOLUTION

SQL SERVER - Unable to Get Listener Properties Using PowerShell - An Error Occurred Opening Resource list-powershell-04

Based on above explanation, we need to use the “name” as shown in properties and the command was working as expected.


Get-ClusterResource BO1AG_AGListener | Get-ClusterParameter

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Unable to Get Listener Properties Using PowerShell – An Error Occurred Opening Resource

SQL SERVER – How to Move SQL Server Cluster to Different Domain?

$
0
0

One of my blog readers sent an email where he wanted my quick opinion about SQL Server cluster.

Hi,
We are having SQL Server 2014 clustered instance running on top of Windows Server 2012. As a part of the acquisition, we need to move the domain servers. I could see a wiki content available
https://social.technet.microsoft.com/wiki/contents/articles/24960.migrating-sql-server-to-new-domain.aspx

Will you be able to give quick guidance before we hire you?
Thanks,
<Name Removed>

To put in simple words, they wanted to move SQL Cluster from SQLAuthority.com to PinalDave.com

SQL SERVER - How to Move SQL Server Cluster to Different Domain? cluster

SOLUTION / WORKAROUND

I looked into the wiki article, but found nothing about cluster. While searching for cluster, I interestingly found a knowledge base article from Microsoft.

Because of an increased dependence on Active Directory Domain Services, Microsoft does not support moving an already installed and configured Windows Server 2008, Windows Server 2008 R2 and Windows Server 2012 failover cluster from one domain to another. The following steps are not for Windows Server 2008, Windows Server 2008 R2 and Windows Server 2012 you must create a new cluster. Additionally, you must re-cluster highly available applications.

So, my client need to recreate a new windows cluster in the new domain and then use traditional ways to backup/restore to move to database new servers. He need to script out all the system objects from old server and recreate on the new server.

Later, I was hired for a short-term project of SQL Server Performance tuning by the team.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – How to Move SQL Server Cluster to Different Domain?

SQL SERVER – The Cluster Resource ‘SQL Server’ Could Not be Brought Online Due to an Error Bringing the Dependency Resource

$
0
0

In my lab setup, I already have a 2 node windows 2012 R2 cluster. I already had one SQL server 2012 instance is working fine without issues. However, when I was trying to install a new additional SQL 2012 instance, the installation reached till the last phase and getting failed with errors related to a cluster resource.

SQL SERVER - The Cluster Resource 'SQL Server' Could Not be Brought Online Due to an Error Bringing the Dependency Resource cluster-setup-dep-01-800x631

Here is the text of the error message.

The following error has occurred:
The cluster resource ‘SQL Server’ could not be brought online due to an error bringing the dependency resource ‘SQL Network Name(SAPSQL) ‘ online. Refer to the Cluster Events in the Failover Cluster Manager for more information.
Click ‘Retry’ to retry the failed action, or click ‘Cancel’ to cancel this action and continue setup.

When we look at the event log, we saw below message (event ID 1194)

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 20/06/2017 19:55:45
Event ID: 1194
Task Category: Network Name Resource
Level: Error
Keywords:
User: SYSTEM
Computer: NODENAME1.internal.sqlauthority.lab
Description:
Cluster network name resource ‘SQL Network Name (SAPSQL)’ failed to create its associated computer object in domain ‘internal.sqlauthority.com’ during: Resource online.

WORKAORUND/SOLUTION

To solve this problem, we logged into the domain controller machine and created the Computer Account: SAPSQL (called as VCO – Virtual Computer Object). Gave the cluster name WINCLUSTER$ full control on the computer name. If we carefully read error message, we have the solution already listed there. Then clicked on the retry option in the setup. The setup continued and completed successfully.

Here are the detailed steps (generally done on a domain controller by domain admin):

  1. Start > Run > dsa.msc. This will bring up the Active Directory Users and Computers UI.
  2. Under the View menu, choose Advanced Features.
  3. If the SQL Virtual Server name is already created, then search for it else go to the appropriate OU and create the new computer object [VCO] under it.
  4. Right click on the new object created and click Properties.
  5. On the Security tab, click Add. Click Object Types and make sure that Computers is selected, then click Ok.
  6. Type the name of the CNO and click Ok. Select the CNO and under Permissions click Allow for Full Control permissions.
  7. Disable the VCO by right clicking.

This is also known as pre-staging of the VCO.

Hope this would help someone to save time and resolve issue without waiting for someone else assistance. Do let me know if you ever encountered the same.

I have used following two articles to find a solution for the error related to cluster resource.

  1. Failover Cluster Step-by-Step Guide: Configuring Accounts in Active Directory
  2. Event ID 1194 — Active Directory Permissions for Cluster Accounts

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – The Cluster Resource ‘SQL Server’ Could Not be Brought Online Due to an Error Bringing the Dependency Resource

SQL SERVER – Unable to Bring SQL Cluster Online – Reason: The Account is Disabled

$
0
0

This was one of an interesting issue reported by my client. They informed that after some hardening by the security team, they are not able to bring SQL resource online in the cluster. In this blog, we would learn about how to fix the issue where SQL cluster is not coming online.

In my client’s environment, there were two nodes (Node1 and Node2) They noticed that in “Failover Cluster Manager” and found that SQL resource was in “Online Pending” state.

SQL SERVER - Unable to Bring SQL Cluster Online - Reason: The Account is Disabled sql-clu-online-pending-01

After 3 minutes, it went to “failed” state. This happens when the cluster is not able to connect to SQL and showing in online pending even if service is running. Pending timeout in the cluster was set to 3 minutes. (default value is 3 minutes). Which means that we would see “online pending” for 3 minutes before goes to the failed state. After this SQL Service goes to stopped automatically.

I checked ERRORLOG and found below

2018-04-28 03:05:47.43 spid5s Recovery is complete. This is an informational message only. No user action is required.
2018-04-28 03:05:50.07 Logon Error: 18470, Severity: 14, State: 1.
2018-04-28 03:05:50.07 Logon Login failed for user ‘Domain\Node1$’. Reason: The account is disabled. [CLIENT: 128.xxx.xxx.xxx]

We also checked Cluster log and found the same error there as well.

SOLUTION/WORKAROUND

Now, we know that there is no issue with SQL startup, it is the SYSTEM account which is not able to connect to SQL Server to show the cluster resource online.

To fix the issue, we followed steps outlined below.

  1. Started SQL via command line.
NET START MSSQLSEVER
  1. Connected to SQL via SSMS
  2. Found that “NT AUTHORITY\SYSTEM” account was disabled. We enabled the account by using below T-SQL.
ALTER LOGIN [NT AUTHORITY\SYSTEM] ENABLE

Note: you can also do it via SSMS by going to Login > Properties > Status tab (as shown below).
SQL SERVER - Unable to Bring SQL Cluster Online - Reason: The Account is Disabled sql-clu-online-pending-02

Enable the login and hit OK.

  1. We stopped SQL via command line
NET STOP MSSQLSEVER

After following above steps, we attempted to bring SQL online and it came online. We also tested failover to another node and it worked like a charm.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Unable to Bring SQL Cluster Online – Reason: The Account is Disabled


SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 1

$
0
0

In recent past, I was assisting a client in installing SQL Server clustered instance in a Windows Cluster. There were many errors encountered and I learned a lot from this experience. In this blog, we would learn about Microsoft Cluster Service (MSCS) cluster verification errors which might appear during installation or AddNode.

We got the below error and as we can see, it was a setup “rule” failure rather than an installation error.

SQL SERVER - Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors - Part 1 clus-facet-p1-01-800x581

Setup wizard failed with the rule — Microsoft Cluster Service (MSCS) cluster verification errors. Once we have this error, we would not be able to proceed next unless we fix it or follow KB to skip this rule.

From the Detail.txt we found the below error:

<DateTime>Slp: Rule evaluation message: The cluster either has not been verified or there are errors or failures in the verification report. Refer to KB953748 or SQL Server Books Online for more information.

Now, everyone might think we already have a KB Article 953748. It also has a workaround and almost all the client with whom I have interreacted, just use that method. But this client was interested to know why such error is occurring.

SOLUTION/WORKAROUND

From what I have seen, this error can occur due to one the following reasons,

  • Cluster Validation was never run on the cluster
  • Cluster Validation report contains errors
  • Unable to access Cluster Validation report due to Admin$ share access blocked
  • Cluster Validation report not present in the cluster nodes
  • Unable to find the Cluster Validation report.

It is very important you understand why these errors have generated and try to resolve them first before going to the workaround step given in the above-mentioned article. Especially when you have found errors/warnings in the cluster validation report. If you have not run the report, I would highly recommend you run it before making any cluster-related installation.

Cluster Validation was never run on the cluster

Run Windows Cluster Validation using Failover Cluster Manager, read here.

Cluster Validation report contains errors

Fix the errors found in the report before you begin installing SQL Server.

Unable to access Cluster Validation report due to Admin$ share access is blocked

Unblock the Admin$ share before you begin installing SQL Server. Below is an example

Slp: Init rule target object: Microsoft.SqlServer.Configuration.Cluster.Rules.ClusterServiceFacet
Slp: Validation Report not found on Node1.DomainName.com
Slp: Validation Report not found on Node2.DomainName.com

I would write another blog in the same series as I found multiple reasons for such behavior. Stay tuned! Please comment and let me know if you found some more causes and ways to fix it.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 1

SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 2

$
0
0

In recent past, I was assisting a client in installing SQL Server clustered instance in a Windows Cluster. There were many errors encountered and I learned a lot from this experience. In this blog, we would learn about Microsoft Cluster Service (MSCS) cluster verification errors which might appear during installation or AddNode.

We got the below error and as we can see, it was a setup “rule” failure rather than an installation error.

SQL SERVER - Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors - Part 2 clus-facet-p2-01-800x581

Setup wizard failed with the rule — Microsoft Cluster Service (MSCS) cluster verification errors. Once we have this error, we would not be able to proceed next unless we fix it or follow KB to skip this rule.

From the Detail.txt we found the below error:

Slp: Initializing rule : Microsoft Cluster Service (MSCS) cluster verification errors
Slp: Rule applied features : ALL
Slp: Rule is will be executed : True
Slp: Init rule target object: Microsoft.SqlServer.Configuration.Cluster.Rules.ClusterServiceFacet
Slp: Validation Report not found on Node1.
Slp: Validation Report not found on Node2.DomainName.com.
Slp: Rule ‘Cluster_VerifyForErrors’ detection result: Is Cluster Online Results = True; Is Cluster Verfication complete = False; Verfication Has Warnings = False; Verification Has Errors = False; on Machine Node1
Slp: Evaluating rule : Cluster_VerifyForErrors
Slp: Rule running on machine: Node1
Slp: Rule evaluation done : Failed
Slp: Rule evaluation message: The cluster either has not been verified or there are errors or failures in the verification report. Refer to KB953748 or SQL Server Books Online for more information.

Now, everyone might think we already have a KB Article 953748. It also has a workaround and almost all the client with whom I have interreacted, just use that method. But this client was interested to know why such error is occurring.  We had discussed a few issues in Part 1 of this blog and you can find it here SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 1.

In this blog, we will discuss another scenario which produces the same error, but the solution is different.

Cluster Validation report not present in the cluster nodes

If the respective file is not found in the given location, we would see it clearly mentioned in Detail.txt file as shown below:

<DateTime>Slp: Validation Report not found on Node1.DomainName.com
<DateTime>Slp: Validation Report not found on Node2.DomainName.com

If you use one of my favorite tool and a procmon trace and you will notice entries like below. The user might have deleted/moved the given file. The report is usually located under – C:\Windows\Cluster\Reports

Date & Time: <DateTime>
Event Class: File System
Operation: CreateFile
Result: NAME NOT FOUND
Path: \\Node1\admin$\Cluster\Reports\Validation Data For Node Set 83ADA46DDCA6617CB8C1C06A947DF511909FFD34.xml

Date & Time: <DateTime>
Event Class: File System
Operation: CreateFile
Result: NAME NOT FOUND
Path: \\Node2.DomainName.com\admin$\Cluster\Reports\Validation Data For Node Set 83ADA46DDCA6617CB8C1C06A947DF511909FFD34.xml

SOLUTION/WORKAROUND

If you have moved this file, please move it back to the location. Else re-run Windows Cluster validation to generate a new report and this new report will be used by the setup the next time when it runs.

I would write another blog in the same series as I found multiple reasons for such behavior. Stay tuned! Please comment and let me know if you found some more causes and ways to fix it.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 2

SQL SERVER – Unable to Failover SQL Server Instance – Error: Registry Information is Corrupt or Missing

$
0
0

All RDBMS products are good, but I love SQL Server most because its fun to troubleshoot the issue. Meaningful error messages can make life easier and helps us in fixing the issue faster. Another reason I love SQL Server is that is my source of earning and I can run my family life due to SQL Server. In this blog, I am going to share my experience to fix an issue where my client was unable to failover SQL Server to another node.

If you search the internet, you would find many probable causes of failover issue. As I mentioned earlier, the exact cause can be determined with an error message and logical reasoning. To hunt right error message, I started looking at various logs.

SQL Server ERRORLOG:

You can refer to below blog which talks about ERRORLOG.

SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location

<DateTime> spid20s Service Broker manager has shut down.
<DateTime> spid11s SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.
<DateTime> spid11s SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.

From above we can see that it was not a SQL Server crash or a failure. SQL Server received a STOP request from Service Control manager due to which it went down. Which means SQL Server had started successfully, but due to some issues, it was asked to shut down. Now, remember this is a cluster. There are 2 stages involved to get the SQL Server successfully started in a cluster.

  • Get SQL Server service locally started.
  • Get SQL Server resource online in failover cluster manager.

If any of the 2 stages fail, SQL Server will go to failed state or it will fallback to the other node. In the current situation, it looked like it failed in the 2nd stage. Because it was the 2nd stage, it makes all the sense to look at the cluster log to check why it failed.

Cluster.log

This file needs to be generated manually. Refer my earlier blog to know the steps. SQL SERVER – Steps to Generate Windows Cluster Log?

Let me simplify with only the main errors of interest. Generally, you need to focus on lines having “ERR” in them.

  1. [Microsoft][SQL Server Native Client 10.0]Registry information is corrupt or missing. Make sure the provider installed and registered correctly.
  2. [Microsoft][SQL Server Native Client 10.0]Client unable to establish a connection
  3. [Microsoft][SQL Server Native Client 10.0]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.

The errors are talking about registry information missing or corrupted. Luckily, the client had another SQL Server instance present in the same cluster. They were OK to test a quick failover. As I was suspecting in my mind, that SQL instance too failed to come online, and the errors were the same. So, this was an issue related ONLY to the NODE 2. Then when I was comparing the settings between both the nodes, I noticed that it with the 64bit Client Protocols were missing on Node2.

SQL SERVER - Unable to Failover SQL Server Instance - Error: Registry Information is Corrupt or Missing clu-ncli-01-800x304

SOLUTION/WORKAROUND

Based on my search on the internet, these protocols are represented by the registry key:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer

As expected the above key was not present on Node2 and was present in Node1. Now I knew that why we did not have any issues on Node1. Without making the solution too complicated, we extracted the above-mentioned registry key from Node1 and imported it on Node2. After that, we were able to successfully failover the SQL Server instance. I must add that registry modification needs to be done with caution because there is no undo possible without backup. That’s why I always back up the registry before taking any such actions.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Unable to Failover SQL Server Instance – Error: Registry Information is Corrupt or Missing

SQL SERVER – Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction

$
0
0

As a part of my AlwaysOn related consultancy, one of my clients was having challenges to install SQL Server 2016 in a clustered environment. In this blog, we would learn about the cause of Slow SQL Server 2016 Installation in Cluster.

When I started SQL Server Setup, it got hung on this screen.

SQL SERVER - Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction hung-2k16-install-01

In the Detail.txt we saw the below info:

(01) 2018-03-30 09:25:34 Slp: Running Action: RunRemoteDiscoveryAction
(08) 2018-03-30 09:25:34 Slp: Discovered update on path C:\SQL2016\SQLServer2016SP1 \PCUSOURCE; Update: Microsoft SQL Server 2016 with SP1, Type: PCU, KB: 3182545, Baseline: 13.0.1601, Version: 13.1.4001
(01) 2018-03-30 09:25:34 Slp: Running discovery on remote machine: NODE2
(01) 2018-03-30 09:25:34 Slp: Running discovery on local machine
(08) 2018-03-30 09:25:34 Slp: Using service ID ‘3da21691-e39d-4da6-8a4b-b43877bcb1b7’ to search product updates.
(10) 2018-03-30 09:25:34 Slp: Searching updates on server: ‘3da21691-e39d-4da6-8a4b-b43877bcb1b7’

Based on above snip of the log, we can see that the action which setup was running is — Running Action: RunRemoteDiscoveryAction

What came to my mind is what does it take for setup to connect to Node2. This is where it trying to perform RemoteDiscoveryAction. Based on my previous experiences fixing such things, I could think of

  • Remote registry service in a stopped state
  • Remote Registry connectivity is disabled
  • Admin$ shares are disabled.

SOLUTION/WORKAROUND

In this case, we saw that the Admin$ shares were disabled. It can be easily tested by typing the below command in CMD prompt or use any file explorer window.

C:\>\\NODE1\c$

As soon as we hit the enter key, we got the message.

SQL SERVER - Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction hung-2k16-install-02

Here are the steps to get Admin$ share back (Reference)

  • Open a registry editor, start > Run > Regedit.exe.
  • Navigate to: HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters
  • In the right pane, locate and double-click AutoShareServer.
  • Change the value from 0 to 1.
  • Close the registry editor and restart the “Server” service for the change to take effect.

After allowing Admin$ share access, SQL setup did not have any further challenges and completed successfully. This action needs to be done all the nodes participating in the cluster. A reboot of the node is also required. Maybe on the latest operating systems (like Windows 2016), a reboot may not be required.

If above steps solve your installation issue, please let me know via comments.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Slow SQL Server 2016 Installation in Cluster: RunRemoteDiscoveryAction

SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 3

$
0
0

In recent past, I was assisting a client in installing SQL Server clustered instance in a Windows Cluster. There were many errors encountered and I learned a lot from this experience. In this blog, we would learn about Microsoft Cluster Service (MSCS) cluster verification errors which might appear during installation or AddNode.

We got the below error and as we can see, it was a setup “rule” failure rather than an installation error.

SQL SERVER - Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors - Part 3 clus-facet-p3-01-800x581

Setup wizard failed with the rule — Microsoft Cluster Service (MSCS) cluster verification errors. Once we have this error, we would not be able to proceed next unless we fix it or follow KB to skip this rule.

From the Detail.txt we found the below error:

<DateTime>Slp: Initializing rule : Microsoft Cluster Service (MSCS) cluster verification errors
<DateTime>Slp: Rule applied features : ALL
<DateTime>Slp: Rule is will be executed : True
<DateTime>Slp: Init rule target object: Microsoft.SqlServer.Configuration.Cluster.Rules.ClusterServiceFacet
<DateTime>Slp: Validation Report not found on Node1.
<DateTime>Slp: Validation Report not found on Node2.DomainName.com.
<DateTime>Slp: Rule ‘Cluster_VerifyForErrors’ detection result: Is Cluster Online Results = True; Is Cluster Verfication complete = False; Verfication Has Warnings = False; Verification Has Errors = False; on Machine Node1
<DateTime>Slp: Evaluating rule : Cluster_VerifyForErrors
<DateTime>Slp: Rule running on machine: Node1
<DateTime>Slp: Rule evaluation done : Failed
<DateTime>Slp: Rule evaluation message: The cluster either has not been verified or there are errors or failures in the verification report. Refer to KB953748 or SQL Server Books Online for more information.

<DateTime>Slp: Send result to channel : RulesEngineNotificationChannel
<DateTime>Slp: Initializing rule : Microsoft Cluster Service (MSCS) cluster verification warnings
<DateTime>Slp: Rule applied features : ALL
<DateTime>Slp: Rule is will be executed : True
<DateTime>Slp: Init rule target object: Microsoft.SqlServer.Configuration.Cluster.Rules.ClusterServiceFacet
<DateTime>Slp: Validation Report not found on Node1.
<DateTime>Slp: Validation Report not found on Node2.DomainName.com.
<DateTime>Slp: Rule ‘Cluster_VerifyForWarnings’ detection result: Is Cluster Online Results = True; Is Cluster Verfication complete = False; Verfication Has Warnings = False; Verification Has Errors = False; on Machine Node1
<DateTime>Slp: Evaluating rule : Cluster_VerifyForWarnings
<DateTime>Slp: Rule running on machine: Node1
<DateTime>Slp: Rule evaluation done : Warning
<DateTime>Slp: Rule evaluation message: The MSCS cluster has been validated but there are warnings in the MSCS cluster validation report, or some tests were skipped while running the validatation. To continue, run validation from the Windows Cluster Administration tool to ensure that the MSCS cluster validation has been run and that the MSCS cluster validation report does not contain errors.

We had discussed a few issues in Part 1 and Part 2 of this blog and you can find it here

Today we will discuss another scenario which produces the same error, but the solution is different.

Unable to find the Cluster Validation report:

If the respective file is not found in the given location, we would see it clearly mentioned in Detail.txt file as shown below:

<DateTime>Slp: Validation Report not found on Node1.DomainName.com
<DateTime>Slp: Validation Report not found on Node2.DomainName.com

If you use one of my favorite tool and a procmon trace and you will notice entries like below. The user might have deleted/moved the given file. The report is usually located under – C:\Windows\Cluster\Reports

<DateTime>
setup1xx.exe
7212 CreateFile
\\Node1.Domain.com\admin$\Cluster\Reports\Validation Data For Node Set 9431D731EF6810180D4000D550DDB48DD960F349.xml
NAME NOT FOUND

SOLUTION/WORKAROUND

Open the cluster validation report to check the results. If there are no errors found, just rename the file to the name found in the procmon (Validation Data For Node Set 9431D731EF6810180D4000D550DDB48DD960F349.xml). else re-run Windows Cluster validation to generate a new report and this new report will be used by the setup the next time when it runs.

I have written three blogs in the same series as I found multiple reasons for such behavior. Please comment and let me know if you found some more causes and ways to fix it.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Install Error: Microsoft Cluster Service (MSCS) Cluster Verification Errors – Part 3

SQL SERVER – Unable to Start SQL Resource in Cluster – HUGE Master Database!

$
0
0

One of my clients contacted me and informed that SQL in cluster failed over from Node 01 to Node 02. It failed to come online on Node 02, failed back to Node 01, and it failed to come online there either. They contacted me to know why they are not able to start SQL Server in the cluster. Let us learn how to start the SQL Resource in Cluster when there is a huge master database.

Their expectation from me was to know why this is happening and what should they do to resolve it so that you’re able to bring your SQL Server online.

To fix such issue, I always start from ERRORLOG file. If you are new to SQL Server, then refer below blog.

SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location

As per ERRORLOG, I could see that recovery of the master database was happening for a long time

2018-06-22 20:00:01.390 spid6s       Recovery of database ‘master’ (1) is 3% complete (approximately 2001 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

2018-06-22 20:00:06.260 spid6s       Recovery of database ‘master’ (1) is 3% complete (approximately 1995 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

and it was clear that cluster service was shutting down SQL while the master was recovering. When we looked at physical files on the operating system, we could see master MDF and LDFs size was HUGE.

SQL SERVER - Unable to Start SQL Resource in Cluster - HUGE Master Database! huge-mast-01-800x177

SOLUTION/WORKAROUND

Now the only option I thought of was to stop SQL and Start via command prompt and bypass cluster. As expected, it took around 32 minutes to recover the master database (we were watching ERRORLOG continuously).

  1. Start SQL Service via command

NET START MSSQLSERVER

  1. Then we need to monitor the ERRORLOG file.
  2. Once recovery is complete, connect to SQL via SSMS.
  3. Run below queries.
DBCC loginfo
GO
SELECT log_reuse_wait_desc
,recovery_model_desc
,name
FROM sys.databases
WHERE database_id = 1
  1. Make sure recovery model is simple (in second output)
  2. Make sure we have less number of VLFs and all are having status as zero.
  3. Shrink the master database. Here is another blog on the same topic.
    SQL SERVER – master Database Log File Grew Too Big
  4. Check for any abnormal big sized table.
  5. Stop SQL service via command prompt.

NET STOP MSSQLSERVER

Now, we started SQL in the cluster and it worked fine. SQL Server was failing to start because the master database was stuck in recovery. When the cluster service tried to connect to it to perform the health check, it failed to do so, assumed SQL was down and failed the SQL resource in the cluster. Later, we found that application is configured to use the master database for its temporary operations. I ask them to work with application vendor to modify their code and avoid using master.

Have you seen such situation earlier? How did you fix it? Please share via comments with others.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Unable to Start SQL Resource in Cluster – HUGE Master Database!

SQL SERVER – Event ID: 1135 – Cluster node ‘NodeName’ was Removed From the Active Failover Cluster Membership

$
0
0

When I work with customers, there are situations when I get chance to learn something from them. I was engaged with an AlwaysOn availability group engagement and got some interesting information from a customer which I am sharing here. In this blog, we would learn about how to solve event id 1135 – Cluster node ‘NodeName’ was removed from the active failover cluster membership.

SQL SERVER - Event ID: 1135 - Cluster node 'NodeName' was Removed From the Active Failover Cluster Membership clus-mem-err-01-800x555

Here are two “Critical” errors which you might see in System Event logs:

Event ID: 1135   

Message: Cluster node ‘N2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Event ID: 1177

Message: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Based on my knowledge about clustering, event Id 1135 indicates that the heartbeat communication failed between some nodes. It could be mostly the network connection or communication is failed among the cluster nodes. Next, the event 1177 indicates that fail-over occurred since the network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

SOLUTION/WORKAROUND

Of course, your networking team needs to be engaged first to understand the root cause of network issue. If it is happening on random basis and network team has no clue about it then here are few things which DBA can also do.

$cluster = Get-Cluster
$cluster.SameSubnetDelay=2000
$cluster.SameSubnetThreshold=10
$cluster.CrossSubnetThreshold=10
$cluster.CrossSubnetDelay=4000

Along with cluster setting, one of my clients also told me to disable TCP offloading and few more properties. As per him, they might cause network delays and intermittent failures. You can run the following commands in the CMD (run as administrator) on all nodes.

Netsh int tcp set global chimney=disabled
Netsh int tcp set global rss=disabled
Netsh int tcp set global netdma=disabled
Netsh int tcp set global autotuninglevel=disabled
netsh interface teredo set state disabled
netsh int ipv4 set global taskoffload=disabled

Also, update the NIC drivers, firmware, and teaming software (if there is) on all cluster nodes.

Above steps have solved the issue for them on several servers and they gave me permission to blog. If above steps solve the issue, please comment and let them know.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Event ID: 1135 – Cluster node ‘NodeName’ was Removed From the Active Failover Cluster Membership


SQL SERVER – Always On AG – HADRAG: Did not Find the Instance to Connect in SqlInstToNodeMap Key

$
0
0

During my On Demand (50 Minutes) consultancy, I solve the issue which seems quick to my client. SQL not starting, AlwaysOn not failing over, Cluster not working are few of quick things where my clients engage me. In this blog, I would share a situation where Always On Availability Group was not coming online due to error – Did not find the instance to connect in SqlInstToNodeMap key.

THE SITUATION

There was some instability in a cluster which caused few unexpected failovers of always-on availability group from node1 to node2 – back and forth sometimes. When they contacted me, we found that clustered resource for availability group was not coming online.

My first step, always, is to get the error what is being reported by SQL or Cluster or Windows. Event log reported below error:

Cluster resource ‘PRODAG’ of type ‘SQL Server Availability Group’ in clustered role ‘PRODAG’ failed.

SQL SERVER - Always On AG - HADRAG: Did not Find the Instance to Connect in SqlInstToNodeMap Key alwaysonerror

Based on the failed policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Above error is very generic and does not tell more than what we know already.

When I checked the SQL Server Management studio we saw that the secondary replica is not connected to the primary replica. The connected state is “DISCONNECTED” in DMV and it shows “red” symbol for this replica. Next step was to generate a Cluster log.

SQL SERVER – Steps to Generate Windows Cluster Log?

And BINGO! We were able to see some relevant messages there.

INFO  [RES] SQL Server Availability Group <PRODAG>: [hadrag] The DeadLockTimeout property has a value of 300000
INFO  [RES] SQL Server Availability Group <PRODAG>: [hadrag] The PendingTimeout property has a value of 180000
ERR   [RES] SQL Server Availability Group <PRODAG>: [hadrag] Did not find the instance to connect in SqlInstToNodeMap key.
ERR   [RHS] Online for resource PRODAG failed.

“ERR” is the tag I look for in cluster log and you should focus on. Just before failure, we see this error: Did not find the instance to connect in SqlInstToNodeMap key. I search and found that SqlInstToNodeMap is a registry key which should have the same information as sys.dm_hadr_instance_node_map.

When I checked the primary replica, we were not able to see the AG under “availability group” node in SSMS. Also, there were no replicas listed under “availability replica” node. When we tried querying sys.dm_hadr_database_replica_states, we did not get any results.

WORKAROUND/SOLUTION

All above symptoms mean that there is some metadata mismatch between information in cluster and information in SQL Server. Even both replicas are having a mismatch of information about availability group. We ran below command on secondary to remove information about AG. We were not able to use UI and it was giving an error.

DROP AVAILABILITY GROUP PRODAG

As soon as we executed, the databases were in restoring state and AG information was cleared from all DVMs and cluster also. Then we recreated the availability group using the AG wizard and we were back in business in less than 20 min of call with me.

I truly hope that this blog can help someone who is getting the same issue with AG.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Always On AG – HADRAG: Did not Find the Instance to Connect in SqlInstToNodeMap Key

SQL SERVER – Steps to Change IP Address of SQL Server Failover Cluster Instance

$
0
0

SQL SERVER - Steps to Change IP Address of SQL Server Failover Cluster Instance warning-1 There have been many questions in my mailbox asking about exact steps to change the windows cluster and SQL failover cluster IP address to the new VLAN. In this blog, I would outline the steps which are needed to perform this migration.

If we look at the IP addresses in a SQL Cluster running on Windows Cluster, we can see about below IPs

  • Cluster nodes IP addresses.
  • SQL IP address in Cluster (also called as SQL Virtual IP)
  • Cluster Access Point IP address (also called Cluster IP)

SOLUTION/WORKAROUND

  1. To change the cluster nodes IP addresses below are the steps:
  • Present new adapters with the new IP addresses
  • Confirm that new network presented in the “Cluster network”. Make sure there are two connections from all cluster nodes. Also make sure that they are online.
  • Make sure that this network is marked for use to cluster and heartbeat as well.
  1. To change the SQL IP address:
  • Choose IP address resource, for which we need to change the IP, right click and go to properties, change the subnet to the new subnet and then enter the new IP address.
  • Change the dependency, if needed. (If you created a new IP resource rather than modifying it)
  • Make sure that the new IP address resource is coming online
  • Now you can safely remove the old IP address if required
  1. To change the Cluster Access Point IP address:
  • From the properties of the cluster name – click on ADD to add new IP address in the new subnet
  • Make sure that the IP address is online
  • If everything is fine, then you can safely remove the old IP.

For my client, this whole process took down around 15 minutes. Hope this would help you in planning such activity.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Steps to Change IP Address of SQL Server Failover Cluster Instance

SQL SERVER – Error: Parameter ‘ProbePort’ does not exist on the cluster object. Unable to set Probe Port for Azure Load Balancer

$
0
0

Azure is gaining popularity and I am getting clients who want to create Always On availability group as their high availability solution in Azure Virtual machine. To keep myself up to date, I also try creating the customer’s scenario in my lab. In this blog, we would how to fix error Parameter ‘ProbePort’ does not exist on the cluster object while configuring probe port.

SQL SERVER – Error: Parameter 'ProbePort' does not exist on the cluster object. Unable to set Probe Port for Azure Load Balancer alwaysonerror

I was following Microsoft’s article about Configure a load balancer for an Always On availability group in Azure. I was going flawlessly without any error till I ran below script.

$ClusterNetworkName = “WinCluster” $IPResourceName = “MyListenerIP” $ListenerILBIP = “10.0.0.22” [int]$ListenerProbePort = 59999 Import-Module FailoverClusters Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{“Address”=”$ListenerILBIP”;”ProbePort”=$ListenerProbePort;”SubnetMask”=”255.255.255.255″;”Network”=”$ClusterNetworkName”;”EnableDhcp”=0}

It failed with below long error.

Set-ClusterParameter : Parameter ‘ProbePort’ does not exist on the cluster object ‘WinCluster’. If you are trying to update an existing parameter, please make sure the parameter name is specified correctly. You can check for the current parameters by passing the .NET object received from the appropriate Get-Cluster* cmdlet to “| Get-ClusterParameter”. If you are trying to update a common property on the cluster object, you should set the property directly on the .NET object received by the appropriate Get-Cluster* cmdlet. You can check for the current common properties by passing the .NET object received from the appropriate Get-Cluster* cmdlet to “| fl *”. If you are trying to create a new unknown parameter, please use -Create with this Set-ClusterParameter cmdlet.
At line:5 char:39
+ … ourceName | Set-ClusterParameter -Multiple @{“Address”=”$ILBIP”;”Prob …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [Set-ClusterParameter], ClusterCmdletException
+ FullyQualifiedErrorId : InvalidOperation,Microsoft.FailoverClusters.PowerShell.SetClusterParameterCommand

The message says that ProbePort in not the right parameter. I initially thought that my PowerShell might be old and its doesn’t understand the parameter. This was not the cause.

WORKAROUND/SOLUTION

Actually, I was using the wrong parameter values which were causing the error. If you look closely at my command, I was using Windows Cluster Network Name in first parameter “$ClusterNetworkName”. This ideally should be “Cluster Network 1” (or a value shown by Get-ClusterNetwork).

The second mistake was “$IPResourceName” value. This should be the name of the IP Address resource, not the value shown by cluster manager UI. We need to right click on IP resource, go to properties and pick Name from there.

Once I fixed both the parameters, I was able to run the script and configure ILB correctly.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Error: Parameter ‘ProbePort’ does not exist on the cluster object. Unable to set Probe Port for Azure Load Balancer

SQL SERVER – Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager

$
0
0

I have helped many customers to solve complex issues in their environment by Comprehensive Database Performance Health Check. Sometimes, the issue looks very complex but once the solution is found it seems very easy. In this blog, we would learn about a situation where the listener is missing in SSMS but working fine in failover cluster manager (cluadmin.msc).

While doing checks of their database, they showed me an interesting situation. Here it goes.

THE SITUATION

My client had 2 nodes Always On Availability Group on SQL Server 2017 and Windows Server 2016. They noticed that;

  1. In failover cluster manager, we are able to see Network Name resource for Listener.
  2. In SQL Server Management Studio (SSMS), we were not able to see anything under “Availability Group Listener”. It was empty!
  3. Below query also doesn’t show any listener in SQL. (0 rows affected)
SELECT *
FROM sys.availability_group_listeners
GO
SELECT *
FROM sys.availability_group_listener_ip_addresses
GO

SQL SERVER - Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager list-miss-02

  1. We were able to connect to the Listener and it was working fine, even after failover also.

SOLUTION/WORKAROUND

When I asked them the history of the listener creation, they informed me that this was created by Windows Admin team. SQLDBA team couldn’t create listener due to an issue which I have written in my previous blog

SQL SERVER – AlwaysOn Listener Error – The WSFC Cluster Could Not Bring the Network Name Resource With DNS Name ‘DNS name’ Online

When I checked availability group resource in cluster manager, I found that it was not having any dependency on the listener. As soon as dependency was added (no downtime needed) we were able to see the listener in #2 and #3 above.

SQL SERVER - Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager list-miss-03

In short, if you are creating the listener via cluster manager, a dependency must be added to the AG resource in Windows Failover Cluster Manager to make the AG dependent upon the listener. If you create it via SQL Server (using SSMS, T-SQL or PowerShell) you should not face this issue.

Have you seen such a situation in your production? Check it now and fix it!

Reference: Pinal Dave (https://blog.sqlauthority.com

First appeared on SQL SERVER – Always On Availability Group Listener Missing in SSMS but Working Fine in Failover Cluster Manager

SQL SERVER – SQL Clustered Resource in Online Pending State for Long Time Before Coming Online

$
0
0

SQL SERVER - SQL Clustered Resource in Online Pending State for Long Time Before Coming Online alwaysonerror While doing Comprehensive Database Performance Health Check I always ask my client if there is any pain point which they have with the current state of the database/server. Once I got an interesting question which I am going to answer in this blog post – Why is my SQL Clustered Resource in Online Pending state for a long time before coming online.

Before I show you how I found the cause, here are few earlier blogs where the situation was different where SQL was not coming online at all.

When SQL is in Online Pending state, the SQL Service is not fully ready for connection or unable to make a connection. SQL SERVER – Steps to Generate Windows Cluster Log?

Here are a few important events:

  1. Here is the offline event of Node1

Log Name: Microsoft-Windows-FailoverClustering/Operational
Source: Microsoft-Windows-FailoverClustering
Date: 1/28/2018 1:49:29 PM
Event ID: 1204
Task Category: Resource Control Manager
Level: Information
User: SYSTEM
Computer: NODE1.domain.com
Description: The Cluster service successfully brought the clustered service or application ‘SQL Server (MSSQLSERVER)’ offline.

  1. Here is the online event on Node2

Log Name: Microsoft-Windows-FailoverClustering/Operational
Source: Microsoft-Windows-FailoverClustering
Date: 1/28/2018 1:57:11 PM
Event ID: 1201
Task Category: Resource Control Manager
Level: Information
User: SYSTEM
Computer: NODE2.domain.com
Description: The Cluster service successfully brought the clustered service or application ‘SQL Server (MSSQLSERVER)’ online.

If you observe closely, there is a gap of 8 minutes between above 2 events.

  1. If we look at cluster logs, we found below messages.
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Service status checkpoint was changed from 0 to 1 (wait hint 20000). Pid is 2431
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Service status checkpoint was changed from 1 to 2 (wait hint 20000). Pid is 2431

.. number kept on increasing continuously. 2 to 3, 3 to 4 and so on. Finally, after around 40 attempts it came online. There was a gap of 2 seconds time in each line.

  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Service is started. SQL Server pid is 2431
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Connect to SQL Server …
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] The connection was established successfully
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Diagnostics is started
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Online worker helper is started
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] SQL Server component ‘system’ health state has been changed from ” to ‘clean’
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] SQL Server resource state is changed from ‘ClusterResourceOnlinePending’ to ‘ClusterResourceOnline’
  • Resource SQL Server (MSSQLSERVER) has come online. RHS is about to report status change to RCM
  • HandleMonitorReply: ONLINERESOURCE for ‘SQL Server (MSSQLSERVER)’, gen(0) result 0.
  • TransitionToState(SQL Server (MSSQLSERVER)) OnlinePending–>Online.

WORKAROUND/SOLUTION

When we looked at ERRORLOG, I found recovery messages for 8 minutes. We also found out that there was a huge number of VLF which seems like the root cause of the issue.

We learned that after reducing the count of VLF, by taking log backups and shrinking the log file, we were able to resolve the issue and SQL failover was very quick.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – SQL Clustered Resource in Online Pending State for Long Time Before Coming Online

Viewing all 53 articles
Browse latest View live