Quantcast
Channel: SQL Server Cluster Archives - SQL Authority with Pinal Dave
Viewing all 53 articles
Browse latest View live

SQL SERVER – Unable to Bring SQL Cluster Resource Online – Online Pending and then Failed

$
0
0

Here is the situation which my client explained and I was asked for help about SQL Cluster Resource.

Hi Pinal,
We are having 2 node windows cluster having 3 SQL Server instances clustered running on Windows 2012 R2 on VMware. We have one instance that will start from the services.msc but not from the Failover Cluster Manager when attempting to bring the service online. In reality the services start because during the 'Online pending' I am able to connect and query the databases on that instance, although it is in the 'Online pending' state.

Do you know what could be the problem?

The post SQL SERVER – Unable to Bring SQL Cluster Resource Online – Online Pending and then Failed appeared first on Journey to SQL Authority with Pinal Dave.


SQL SERVER – Attaching and Restoring Database in Clustering Generates An Error – Notes from the Field #115

$
0
0

[Notes from Pinal]: In my career, I have seen many database experts who are great with what they do, but when they have to work with clustering or AlwaysOn solutions, they usually avoid it. The reason is that there are not many experts who know this subject well enough. One thing I always personally felt that the documentation is also not widely available when it is about clustering. If one receives some error, they are usually lost. This is when I reached out to Eduardo and asked him what can we do to if we face error while attaching or restoring database in clustering environment in SQL Server.

The post SQL SERVER – Attaching and Restoring Database in Clustering Generates An Error – Notes from the Field #115 appeared first on Journey to SQL Authority with Pinal Dave.

SQL SERVER – Network Name resource fails to come online in a Windows Server 2008 R2 Failover Cluster

$
0
0

Even if you are a DBA, sometime you need to deal with issues which are not related to SQL Server. It is not by design, but this is part of our job description. It is always interesting to troubleshoot such issues and find a solution. I always think, sharing such strange troubleshooting – it will help those who get into this problem, not understanding what is happening behind the scenes. Let us learn about in this module Windows Server 2008 R2 Failover Cluster.

The post SQL SERVER – Network Name resource fails to come online in a Windows Server 2008 R2 Failover Cluster appeared first on Journey to SQL Authority with Pinal Dave.

SQL SERVER – Q&A: SQL Clustering Virtual Server Name and Instance Name

$
0
0

Of late some of the troubleshooting scenario I am getting involved is amazing. Though some of them are complex, it gives me a unique opportunity to learn and try something new to share later with you folks. Recently I was contacted by a team who takes care of SQL installations. Since they are very new to SQL Server cluster installation and had basic questions. Let us learn more about SQL Clustering.

The post SQL SERVER – Q&A: SQL Clustering Virtual Server Name and Instance Name appeared first on Journey to SQL Authority with Pinal Dave.

SQL SERVER – Understanding FAILOVERCLUSTERROLLOWNERSHIP with SQL Server Cluster Rolling Upgrade

$
0
0

I hardly get questions around the cluster and I try to keep away from the queries because some of these can lead to some deep level working with cluster. Some conversations can lead to a great learning experience. I started to hunt in the blog and saw there were a number of posts around (FAILOVERCLUSTERROLLOWNERSHIP) the cluster that caught my eyes:

The post SQL SERVER – Understanding FAILOVERCLUSTERROLLOWNERSHIP with SQL Server Cluster Rolling Upgrade appeared first on Journey to SQL Authority with Pinal Dave.

SQL SERVER – Installation Error – The wrong diskette is in the drive. Insert (Volume Serial Number: ) into drive.

$
0
0

The world of working with errors always gets the better of me. It is a wonderful way to understand why SQL Server behaves in a certain way and most importantly, it helps me solve some of the problems people face on a day-to-day basis. As you might know that along with performance consultancy, I also reply to any personal email asking for help. This blog is a result of one such interaction where the client was facing an error while installing Service Pack 4 for SQL Server 2008 in a clustered environment. Let us learn about Installation Error.

SQL SERVER - Installation Error - The wrong diskette is in the drive. Insert (Volume Serial Number: ) into drive. insert-disk-01

I always ask for setup logs and start from there. One of my old blog talks about location of setup files.

SQL SERVER – Installation Log Summary File Location – 2012 – 2008 R2

I looked into %programfiles%\MicrosoftSQL Server\110\Setup Bootstrap\Log\ and found folder corresponding to date-time of installation. Immediately open Summary.txt file and found errors.

Overall summary:
Final result: The patch installer has failed to update the following instance: MSSQLSERVER. To determine the reason for failure, review the log files.
Exit code (Decimal): -568706566
Exit facility code: 1562
Exit error code: 14842
Exit message: The patch installer has failed to update the following instance: MSSQLSERVER. To determine the reason for failure, review the log files.
Start time: 2016-05-17 03:18:32
End time: 2016-05-17 03:23:33
Requested action: Patch

Instance MSSQLSERVER overall summary:
Final result: The patch installer has failed to update the shared features. To determine the reason for failure, review the log files.
Exit code (Decimal): -568706566
Exit facility code: 1562
Exit error code: 14842
Exit message: The wrong diskette is in the drive. Insert (Volume Serial Number: ) into drive . (Exception from HRESULT: 0x80070022)
Start time: 2016-05-17 03:19:32
End time: 2016-05-17 03:22:33
Requested action: Patch

As per the error it suggested me to review the log file. I looked further and opened Detail.txt for this instance.

Exception summary:
The following is an exception stack listing the exceptions in outermost to innermost order
Inner exceptions are being indented

Exception type: System.Runtime.InteropServices.COMException
Message:
The wrong diskette is in the drive. Insert (Volume Serial Number: ) into drive . (Exception from HRESULT: 0x80070022)
Data:
DisableWatson = true
Stack:
at Microsoft.SqlServer.Interop.MSClusterLib.ISClusResource.get_Disk()
at Microsoft.SqlServer.Configuration.Cluster.ClusterPhysicalDisk.get_Partitions()
at Microsoft.SqlServer.Configuration.ClusterConfiguration.ClusterDiskPublicConfigObject.IsPathOnSharedDisk(String path)
at Microsoft.SqlServer.Configuration.SetupExtension.SlpInputSettings.ValidateNotOnSharedDisk(ValidationState vs, String directoryName, String bindingKey, String errorMessage)
at Microsoft.SqlServer.Configuration.SetupExtension.SlpInputSettings.Validate_InstallSharedDir(ValidationState vs)
at Microsoft.SqlServer.Configuration.SetupExtension.SlpInputSettings.ValidateSettings()
at Microsoft.SqlServer.Configuration.SetupExtension.ValidateFeatureSettingsAction.ExecuteAction(String actionId)
at Microsoft.SqlServer.Chainer.Infrastructure.Action.Execute(String actionId, TextWriter errorStream)
at Microsoft.SqlServer.Setup.Chainer.Workflow.ActionInvocation.InvokeAction(WorkflowObject metabase, TextWriter statusStream)
at Microsoft.SqlServer.Setup.Chainer.Workflow.PendingActions.InvokeActions(WorkflowObject metaDb, TextWriter loggingStream

Here are the major things in above stack, which any SQL DBA can understand.

Interop.MSClusterLib.ISClusResource.get_Disk()
ClusterPhysicalDisk.get_Partitions()
ClusterDiskPublicConfigObject.IsPathOnSharedDisk(String path)
ValidateNotOnSharedDisk
Validate_InstallSharedDir

From the stack it is clear that they have issues with clustered disk. So, I have asked them to contact their hardware team and they fixed the issue. Once it was fixed, Service pack installation went fine.

I believe, it not always possible to provide solution but a guideline is also sufficient. Hope this helps you and let me know via comments if you encountered this Installation Error.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Installation Error – The wrong diskette is in the drive. Insert (Volume Serial Number: ) into drive.

SQL SERVER – GetRegKeyAccessMask : Could Not Get Registry Access Mask For Registry Key – SQL Server Cluster

$
0
0

I have been getting many requests from my HIRE-ME page and a few of them are getting change to my blog. This is the outcome of one of my clients who was having a strange issue. They were having 2 nodes SQL Server Cluster (NODE1 and NODE2) and SQL Server resource was not able to come online on NODE21 but it was working fine when they failover to NODE2.

SQL SERVER - GetRegKeyAccessMask : Could Not Get Registry Access Mask For Registry Key - SQL Server Cluster fci-no-online-01-800x824

I asked them to look at ERRORLOG when they failover to NODE2 and it doesn’t come online. SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location

I was surprised when they told me that they are not seeing the file being generated when they attempt to bring SQL to NODE2. I thought it could be a permission issue, but there was no such permission error in Application and System Event Log. Finally, I asked them to generate Cluster Log SQL SERVER – Steps to Generate Windows Cluster Log?

Here is the relevant information I found in Cluster log

Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "Software\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\Replication"
Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\MSSQLServer"
Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\Cluster"
Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\SQLServerAgent"
Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\Providers"
Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\CPE"
Add-ClusterCheckpoint -ResourceName "SQL Server (SQL_INST1)" -RegistryCheckpoint "SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL_INST1\SQLServerSCP"

Error 2 = The system cannot find the file specified. So now, it’s clear that we are having issues with registry missing on this node. When I compared the key, I found that they were having missing startup parameter earlier which they added manually. Now this is next key which is missing. In general, windows cluster registry checkpoint takes care of syncing the value of registry keys for SQL Server. Here is an article by Balmukund about registry checkpoint Information: Checkpoint in SQL Server Cluster Resources

I was able to re-add the checkpoint using below PowerShell command.

As soon as the checkpoint was enabled, the registry came in sync on both the nodes and issue was resolved. Have you ever come across such issues?

Reference: Pinal Dave (http://blog.sqlauthority.com)

First appeared on SQL SERVER – GetRegKeyAccessMask : Could Not Get Registry Access Mask For Registry Key – SQL Server Cluster

SQL SERVER – Rule Windows Server 2003 FILESTREAM Hotfix Check failed on Windows 2012 R2 Cluster

$
0
0

For my upcoming training, I was trying to deploy SQL Server 2008 R2 Cluster in my client’s lab machines. I encountered a strange error and I was clueless. Let us learn about Rule Windows Server 2003 FILESTREAM Hotfix Check failed on Windows 2012 R2 Cluster.

SQL SERVER - Rule Windows Server 2003 FILESTREAM Hotfix Check failed on Windows 2012 R2 Cluster setup-wrong-error-800x595

I tried below:

  1. Ran cluster validation- it was all green
  2. I even download the patch mentioned, but as expected, it was only for the 2003 OS.

I looked into the setup logs and found below

2013-02-04 10:51:05 Slp: Initializing rule : Windows Server 2003 FILESTREAM Hotfix Check
2016-09-06 10:51:05 SQLEngine: –FilestreamRequiredClusterPatchFacet: Engine_FilestreamRequiredHotfixesCheck: Version: 5.2.3790.4083
2016-09-06 10:51:05 SQLEngine: –FilestreamRequiredClusterPatchFacet: Engine_FilestreamRequiredHotfixesCheck: C:\Windows\system32\Drivers\Clusdisk.sys version : 6.2.9200.16384
2016-09-06 10:51:05 SQLEngine: –FilestreamRequiredClusterPatchFacet: Engine_FilestreamRequiredHotfixesCheck: C:\Windows\Cluster\Clusres.dll version : 6.2.9200.16384
2016-09-06 10:51:05 Slp: C:\Windows\system32\W03a2409.dll
2016-09-06 10:51:05 Slp: Rule initialization failed – hence the rule result is assigned as Failed
2016-09-06 10:51:05 Slp: Send result to channel : RulesEngineNotificationChannel

Later I came across Using SQL Server in Windows 8 and later versions of Windows operating system. It pointed me to the direction where it mentioned that we should do a slipstream of the media.

SOLUTION/WORKAROUND

I did a slipstreaming of SQL Server 2008 R2 media using KB How to update or slipstream an installation of SQL Server 2008. Once I slipstreamed the media, I was able to install 2 nodes SQL Server failover cluster.

Have you seen any such incorrect messages with SQL Server?

Reference: Pinal Dave (http://blog.sqlauthority.com)

First appeared on SQL SERVER – Rule Windows Server 2003 FILESTREAM Hotfix Check failed on Windows 2012 R2 Cluster


SQL SERVER – Fix: The Cluster Resource Could not be Deleted Since it is a Core Resource

$
0
0

Working with SQL Server is fun. Since SQL Clustering, and AlwaysOn availability group needs Windows Clustering so sometimes there are some cluster issues which I have to deal with and fix. Let us learn about an error related to core resource.

Recently, one of my client did some changes to the cluster and wanted to change file share witness to the new share. So they modified file share witness and after that they started seeing two file share resources in cluster core resources. When they tried deleting the unused one, we got below error.

If we press Ctrl+C on the message, we can paste it and will get below.

SQL SERVER - Fix: The Cluster Resource Could not be Deleted Since it is a Core Resource clus-core-01

[Window Title] Error
[Main Instruction] The operation has failed
[Content] The cluster resource could not be deleted since it is a core resource.
[^] Hide Details [OK] [Expanded Information] Error Code: 0x800713a2
The cluster resource could not be deleted since it is a core resource.

I was not able to reproduce the error in my lab. When I provide a new share name, old one gets removed automatically, but that was not the case when my client faced it.

SOLUTION

In this case the solution was not hard. We ran “Quorum Configuration Wizard” and modified the quorum model from “Node and File Share Majority” to “Node Majority”. As soon as we did that, both file shares disappeared from the failover cluster manager. Later we changed the quorum model back to “Node and File Share Majority” and selected the share with we needed for witness.

Reference: Pinal Dave (http://blog.sqlauthority.com)

First appeared on SQL SERVER – Fix: The Cluster Resource Could not be Deleted Since it is a Core Resource

SQL SERVER – Added New Node in Windows Cluster and AlwaysOn Availability Databases Stopped Working

$
0
0

Almost all the time, whenever there is a wizard, it’s a human habit to go with the defaults and finally click finish. Once of my client sent below email to me. In this blog post we are going to learn about Added New Node in Windows Cluster and AlwaysOn Availability Databases Stopped Working.

Hi Pinal,
We are trying to add new node to the AlwaysOn Availability Group and for that we must add new node to Windows cluster. Before doing this in production, we are trying to our test environment and we ran into issues. We noticed that as soon as node is added in windows, our databases which were part of an availability group went to not synchronizing state. Later I noticed that local disks were added to the cluster under “available storage”.

Have you seen this issue? What is wrong with our setup?

Thanks!

I asked for any error in event log and they shared below.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Description: Cluster resource ‘Cluster Disk 2’ of type ‘Physical Disk’ in clustered role ‘Available Storage’ failed. The error code was ‘0x1’ (‘Incorrect function.’)

I told them that they must have followed the wizard and must have forgotten to “uncheck” the highlighted checkbox.

SQL SERVER - Added New Node in Windows Cluster and AlwaysOn Availability Databases Stopped Working add-node-disk-01-800x547

This is the default setting and has caused issues for many, including me during the demo. They also confirmed the same. What is the solution now?

SOLUTION/WORKAROUND

To work around this problem, we must remove the disk resource from Failover Cluster Manager. Once done, we need to bring these drives online in Disk Management once they are removed from Failover Cluster Manager.

Have you run into the same issue? Anywhere default setting in the wizard has caused the problem?

Reference: Pinal Dave (http://blog.sqlauthority.com)

First appeared on SQL SERVER – Added New Node in Windows Cluster and AlwaysOn Availability Databases Stopped Working

SQL SERVER – FIX Error – Cluster Network Name showing NETBIOS status as “The system cannot find the file specified”

$
0
0

Over a period, I have learned that fixing an issue is easier if we know which log we should look at and what we should search on google.  Here is the situation which I was into few days back. In this blog post we will learn how to fix the error Cluster Network Name showing NETBIOS status as “The system cannot find the file specified”.

I was at client site for performance tuning exercise and they asked me if I know Windows Clustering. Of course, I can’t call myself as an expert but I at least know whatever is needed for AlwaysOn availability groups. So, I asked them the issue. They told that they are seeing Cluster Network Name in failed state as below.

SQL SERVER - FIX Error - Cluster Network Name showing NETBIOS status as "The system cannot find the file specified" cluster-core-01

We tried to move the cluster group between the nodes and tried bringing CNO online but it failed. In the CNO parameters we see the NETBIOS status as “The system cannot find the file specified”. So, I asked them to capture cluster logs.

SQL SERVER – Steps to Generate Windows Cluster Log?

Here is the snippet from the error log.

548020 000022c8.00004208::2016/10/08-10:38:29.991 ERR [RES] Network Name: Agent: InitializeModule, Trying to initialize Module(fb729fe4-79ea-4a0d-857e-411636879e67,Identity) when there is one already in Initialized/Idle state
548064 000022c8.00000368::2016/10/08-10:38:29.991 ERR [RES] Network Name: [NNLIB] Unable to add server name WindowsCluster to transport \Device\NetBt_If1, status 2
548083 000022c8.000038b8::2016/10/08-10:38:29.991 ERR [RES] Network Name : Online thread Failed: ERROR_SUCCESS(0)’ because of ‘Initializing netname configuration for Cluster Name failed with error 2.’
548088 000022c8.000038b8::2016/10/08-10:38:29.991 ERR [RHS] Online for resource Cluster Name failed.
548102 00001cac.00002f94::2016/10/08-10:38:29.991 ERR [RCM] rcm::RcmResource::HandleFailure: (Cluster Name)

As seen above, status 2 indicates: The system cannot find the file specified”. My search for “NetBt” was pointing to issue with “NetBIOS over TCPIP” setting. When I verified the network adapters, I noticed under WINS section NETBIOS over TCP/IP has been disabled.

WORKAROUND/SOLUTION

Here is the screen where I changed the settings.

SQL SERVER - FIX Error - Cluster Network Name showing NETBIOS status as "The system cannot find the file specified" cluster-core-02

We have changed it to default and issue was resolved and we were able to bring the resource online.

Have you found any solution using cluster log?

Reference: Pinal Dave (http://blog.sqlauthority.com)

First appeared on SQL SERVER – FIX Error – Cluster Network Name showing NETBIOS status as “The system cannot find the file specified”

SQL SERVER – Clustered Instance Online Error – SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF]

$
0
0

While I was playing with SQL Cluster in my lab, I restarted the VMs and found that I was not able to bring SQL Server online. As always I was looking for error message, but there was nothing interesting. Let us see in this blog post how to fix Clustered Instance Online Error.

Here were the observations:

  1. SQL ERRORLOG is getting created.
  2. If I start SQL from the services it runs fine.
  3. If I try to bring SQL resource online in the cluster, it stays for “Online Pending” and then it goes to “Failed” state

To get more about failure in the cluster, I generated cluster log using steps in my own article.

INFO [API] s_ApiGetQuorumResource final status 0.
INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:5447358a-a102-4fc9-95f4-c040e8716859:Netbios
ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08001] [Microsoft][SQL Server Native Client 11.0]SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF]. (268435455)
ERR [RES] SQL Server : [sqsrvres] ODBC Error: [HYT00] [Microsoft][SQL Server Native Client 11.0]Login timeout expired (0)
ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08001] [Microsoft][SQL Server Native Client 11.0]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online. (268435455)
INFO [RES] SQL Server : [sqsrvres] Could not connect to SQL Server (rc -1)
INFO [RES] SQL Server : [sqsrvres] SQLDisconnect returns following information
ERR [RES] SQL Server : [sqsrvres] ODBC Error: [08003] [Microsoft][ODBC Driver Manager] Connection not open (0)
INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:52cf277d-234b-4a81-a9a7-0f078fca2a17:Netbios

As per cluster logs, the cluster is not able to connect to SQL Service.

WORKAROUND / SOLUTION

Here are the normal causes of the above error:

  1. Incorrect client alias created in the configuration manager
  2. SQL Browser isn’t running when SQL is listening on a non-default port or a named instance.
  3. TCP port connection issue.

I already have detailed checklist for common causes.

SQL SERVER – FIX : ERROR : (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: )

In my lab, I found that I had a TCP alias created and port of SQL Server was changed after reboot, causing the SQL cluster issue.

SQL SERVER - Clustered Instance Online Error - SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF] sql-clus-01-800x241

To fix that forever, I changed SQL Server to listen on a static port instead of dynamic port.

Have you ever encountered same situation where the cluster log has helped you?

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Clustered Instance Online Error – SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF]

SQL SERVER – Error After Cluster Patching – Error: 5184, Severity: 16, State: 2

$
0
0

During my last consulting engagement, I was pulled by my client to consider an issue which they were facing. They informed that they have applied service pack on one of their clustered environment and since than SQL Server is not coming online. I asked to share ERRORLOG from the SQL instance. SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location. Let us learn about how to fix error after cluster patching.

2016-11-20 21:09:49.44 spid9s Starting execution of PREINSTMSDB100.SQL
2016-11-20 21:09:49.44 spid9s —————————————-
2016-11-20 21:10:01.67 spid9s Error: 5184, Severity: 16, State: 2.
2016-11-20 21:10:01.67 spid9s Cannot use file ‘D:\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\temp_MS_AgentSigningCertificate_database.mdf’ for clustered server. Only formatted files on which the cluster resource of the server has a dependency can be used. Either the disk resource containing the file is not present in the cluster group or the cluster resource of the Sql Server does not have a dependency on it.
2016-11-20 21:10:01.67 spid9s Error: 1802, Severity: 16, State: 1.
2016-11-20 21:10:01.67 spid9s CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
2016-11-20 21:10:01.67 spid9s Error: 912, Severity: 21, State: 2.
2016-11-20 21:10:01.67 spid9s Script level upgrade for database ‘master’ failed because upgrade step ‘sqlagent100_msdb_upgrade.sql’ encountered error 598, state 1, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2016-11-20 21:10:01.67 spid9s Error: 3417, Severity: 21, State: 3.
2016-11-20 21:10:01.67 spid9s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2016-11-20 21:10:01.67 spid9s SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.

The start of the problem is Error: 5184, Severity: 16, State: 2.

If we look at error message is clear that the D drive is not having dependency with the SQL Server resource. We checked failover cluster manager and found below.

SQL SERVER - Error After Cluster Patching - Error: 5184, Severity: 16, State: 2 clus-sp-err-01

As we can see we have only cluster disk 4 which was E drive. We added by clicking on a highlighted area. Once we added the disk we found that issue was still not solved and SQL was not coming online. Checked ERRORLOG again and found a new problem.

2016-11-20 21:09:48.32 Logon Error: 18456, Severity: 14, State: 11.
2016-11-20 21:09:48.32 Logon Login failed for user ‘NT AUTHORITY\SYSTEM’. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors. [CLIENT: 100.168.11.171]

I asked them series of action and they informed that they have already attempted to rebuild the system databases – which was a news to me. So now the problem was that this login was not existing in SQL Server as System databases were rebuilt. Here were the steps to fix this issue.

  1. Start SQL using command prompt
NET START MSSQLSERVER /m
  1. Added ‘NT AUTHORITY\SYSTEM’ account
  2. Stopped SQL Server
NET STOP MSSQLSERVER

After this we could bring SQL Server online and issue was resolved. Have you seen a similar issue where rebuild was done in the cluster and it didn’t work?

Reference: Pinal Dave (http://blog.sqlauthority.com)

First appeared on SQL SERVER – Error After Cluster Patching – Error: 5184, Severity: 16, State: 2

SQL SERVER – Why Cluster Network is Unavailable in Failover Cluster Manager?

$
0
0

It’s always a good experience to visit customer sites and talk to people. Sometimes I get to see things outside SQL world as well. There is a lot to learn and I believe that I can do that by sharing what I learned. In this blog post we will discuss Why Cluster Network is Unavailable in Failover Cluster Manager?

During my last visit to an India based company, I was talking to a windows admin during lunch and he was talking about a cluster issue. It was an interesting conversation where he told that sometimes a reboot is THE solution to solve a problem. He told me an incident where Cluster networks were shown as unavailable in failover cluster manager. After lunch, I went to his desk to get more details.

SQL SERVER - Why Cluster Network is Unavailable in Failover Cluster Manager? cluster-down-800x407

As we can see under box created around Nodes, this was only with one node.

When we look at cluster logs, we see below the messages.


========B02===========
00000648.00002464::2016/11/29-08:58:45.173 INFO [FTI][Initiator] This node (1) is initiator
00000648.00002464::2016/11/29-08:58:45.173 WARN [FTI][Initiator] Ignoring duplicate connection: usable route already exists
00000648.00002464::2016/11/29-08:58:45.173 INFO [CHANNEL 147.170.123.251:~3343~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
00000648.00002464::2016/11/29-08:58:45.174 WARN cxl::ConnectWorker::operator (): GracefulClose(1226)’ because of ‘channel to remote endpoint 147.170.123.251:~3343~ is closed’


========B01============
00004090.00005db0::2016/11/29-08:58:45.157 INFO [FTI][Follower] This node (2) is not the initiator
00004090.00005db0::2016/11/29-08:58:45.157 DBG [FTI] Stream already exists to node 1: false
00004090.00005db0::2016/11/29-08:58:45.157 DBG [CHANNEL 147.170.123.252:~54783~] Close().
00004090.00005db0::2016/11/29-08:58:45.157 INFO [CHANNEL 147.170.123.252:~54783~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)
00004090.00005db0::2016/11/29-08:58:45.157 INFO [CORE] Node 2: Clearing cookie 63cfe37d-42be-4211-8cd8-6db6b3344b52
00004090.00005db0::2016/11/29-08:58:45.157 DBG [CHANNEL 147.170.123.252:~54783~] Not closing handle because it is invalid.
00004090.00005db0::2016/11/29-08:58:45.157 WARN mscs::ListenerWorker::operator (): GracefulClose(1226)’ because of ‘channel to remote endpoint 147.170.123.252:~54783~ is closed’

Based on cluster logs and highlighted message “Ignoring duplicate connection: usable route already exists, we can say that this issue is caused due to stale information on network from rejecting node.

The only solution to fix the error was to reboot the active node.

I search on internet and found that this could be because of real network issue, some antivirus software as well. So, if above message is not shown in cluster log, then you can search further. Please share the solution if you find.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Why Cluster Network is Unavailable in Failover Cluster Manager?

SQL SERVER – FIX: Error 19456: None of the IP Addresses Configured for the Availability Group Listener can be Hosted by the Server

$
0
0

Recently, while deploying a hybrid AlwaysOn availability group for a client, I faced this error. Since it was something I was not able to find many hits on internet search, I thought of sharing this via this blog. I am sure it would help others.

Topology

The client was trying to deploy hybrid cluster with both on-premise instances and instances hosted in Microsoft’s Azure cloud. All machines were domain-joined and it they were part of multi-subnet network connected via Express-route.

Error message

Here was the error message when they were trying to add a replica in Azure VM.

Msg 19456, Level 16, State 1, Line 3
None of the IP addresses configured for the availability group listener can be hosted by the server ‘AZURESQL-1’. Either configure a public cluster network on which one of the specified IP addresses can be hosted, or add another listener IP address which can be hosted on a public cluster network for this server.


Msg 41158, Level 16, State 3, Line 3
Failed to join local availability replica to availability, group ‘HR_AG’. The operation encountered SQL Server error 19456 and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry the ALTER AVAILABILITY GROUP JOIN command.

SQL SERVER - FIX: Error 19456: None of the IP Addresses Configured for the Availability Group Listener can be Hosted by the Server err-19456

Workaround / Solution

I looked at the IP addresses under the network name in the “Failover Cluster Manager” and found that there were two IPs: one in 10.150.xx.xx and one 10.160.xx.xx range. We saw that the replica that we were attempting to add was in 10.140.xx.xx. So, we then added an IP address in the appropriate subnet as a dependency of the network name.

In short, this error can be resolved by adding a right IP to the listener. The IP address for all subnets cannot be an IP address already in use, i.e. The IP address of one of the nodes. After adding above, we again attempted to join the replica. And as expected, the operation succeeded.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – FIX: Error 19456: None of the IP Addresses Configured for the Availability Group Listener can be Hosted by the Server


SQL SERVER – Add Failover Cluster Node Fails With Error – This SQL Server Edition Does Not Support the Installed Number of Cluster Nodes

$
0
0

SQL SERVER - Add Failover Cluster Node Fails With Error - This SQL Server Edition Does Not Support the Installed Number of Cluster Nodes clusters In this blog post we will discover how to fix an Add Failover Cluster Node Fails With Error. If you have installed SQL Server cluster, then it would be easy for you to remember that it’s a two-step process.

  1. InstallFailoverCluster
  2. AddNode

My client completed step 1 successfully, but while adding a second node, he was getting below error in Detail.txt.

(13) 2017-01-08 14:33:01 Slp: Executing rules engine…
(13) 2017-01-08 14:33:01 Slp: Start rule execution, total number of rules loaded: 18
(13) 2017-01-08 14:33:01 Slp: Initializing rule : Number of cluster nodes supported for edition
(13) 2017-01-08 14:33:01 Slp: Rule is will be executed : True
(13) 2017-01-08 14:33:01 Slp: Init rule target object: Microsoft.SqlServer.Configuration.SetupExtension.NumberOfNodesFacet
(13) 2017-01-08 14:33:01 Slp: Rule ‘Cluster_NumberOfNodes’ edition Invalid allows 0 cluster nodes.
(13) 2017-01-08 14:33:01 Slp: Rule ‘Cluster_NumberOfNodes’ detected 1 cluster nodes.
(13) 2017-01-08 14:33:01 Slp: Evaluating rule : Cluster_NumberOfNodes
(13) 2017-01-08 14:33:01 Slp: Rule running on machine: SQLNODE02
(13) 2017-01-08 14:33:01 Slp: Rule evaluation done : Failed
(13) 2017-01-08 14:33:01 Slp: Rule evaluation message: This SQL Server edition does not support the installed number of cluster nodes. To continue, remove nodes and then complete cluster installation.

Initially, my thoughts, it’s because of a standard edition. SQL SERVER – Add failover cluster node fails with “number of cluster nodes supported for edition”

So, I asked to check SELECT SERVERPROPERTY (‘Edition’) and it was the enterprise! We should note that they were using Enterprise edition, which doesn’t have the limitation of number of nodes in the cluster. The error was very strange but when I looked at line by line, I found something interesting as below.

(13) 2017-01-08 14:33:01 Slp: Rule ‘Cluster_NumberOfNodes’ edition Invalid allows 0 cluster nodes.

From above it looks SQL setup is not able to get edition and it’s marked as “Invalid”. I checked further and found below message as well.

(04) 2017-01-08 14:33:01 Slp: Loading rule: AddNodeEditionBlock
(04) 2017-01-08 14:33:01 Slp: Creating rule target object: Microsoft.SqlServer.Configuration.SetupExtension.AddNodeEditionBlock
(04) 2017-01-08 14:33:01 Slp: Rule applied features : ALL
(04) 2017-01-08 14:33:01 Slp: ———————————————————————-
(04) 2017-01-08 14:33:01 Slp: Skipping rule AddNodeEditionBlock
(04) 2017-01-08 14:33:01 Slp: Rule will not be evaluated due to the following failed restriction(s):
(04) 2017-01-08 14:33:01 Slp: Condition “Is requested input setting is set to PID” did not pass as it returned false and true was expected.
Returning false as an unhandled exception was caught:
Microsoft.SqlServer.Chainer.Infrastructure.ChainerInvalidOperationException: The input ‘PID’ requested by the StringInputSettingExistsCondition is not of string type.

Based on my search on the internet, it looks like sqlboot.dll is used to get version using checksum value. The DLL is located in the path “SharedCode” stored in “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\120” and “checksum” is located in “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.<InstanceName>\Setup”

In my client’s case the file was missing from C:\Program Files\Microsoft SQL Server\120\Shared I am not sure how that happened. They did tell me that setup was incomplete on Node1 and they did manual hack to fix that.

Have you ever seen such weird error? Thanks to the internet to provide internals.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Add Failover Cluster Node Fails With Error – This SQL Server Edition Does Not Support the Installed Number of Cluster Nodes

SQL SERVER – The Lease between Availability Group ‘PRODAG’ and the Windows Server Failover Cluster has Expired

$
0
0

This is one of the common errors I have seen while working with customers who are using SQL Server AlwaysOn availability groups. Once this error comes, the resource in the cluster goes to failed state and in SQL Server Management Studio, we should be an availability group in resolving state. Resolving state essentially means that the role of availability group is neither primary nor secondary.

SQL SERVER - The Lease between Availability Group 'PRODAG' and the Windows Server Failover Cluster has Expired AlwaysOn-800x201

Here is the snippet from ERRORLOG when the lease expires. I have tried to explain the meaning of each line.

2017-02-27 19:31:07.34 Server Error: 19407, Severity: 16, State: 1.
2017-02-27 19:31:07.34 Server The lease between availability group ‘PRODAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

Above message means the lease between windows cluster and SQL Server.

2017-02-27 19:31:07.34 Server      AlwaysOn: The local replica of availability group ‘PRODAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

Due to lease renewal failure, the AG resource in the cluster would go to failed state.

2017-02-27 19:31:07.34 Server      The state of the local availability replica in availability group ‘PRODAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’.  The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired.  For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.

Since AG is failed, the AG state in SQL Server would change from PRIMARY to RESOLVING_NORMAL. At the same time, on secondary, we should see state from SECONDARY_NORMAL to RESOLVING_NORMAL

Now the next challenge would be to find WHY lease was expired.

SOLUTION/WORKAROUND

Based on my research on the internet, I found that in most of the cases, the lease gets expired due to shortage of resources on the machine. You can think of this as a “momentarily hang” of windows operations. Generally, we should look at the cause of slowness. The client with whom I worked, I could see tons of IO related messages like below.

2017-02-27 19:23:26.09 spid35s     SQL Server has encountered 244 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [M:\MSSQL\SAM_PRODUCTION.MDF] in database id 21.  The OS file handle is 0x0000000000001B44.  The offset of the latest long I/O is: 0x00004fcfb76000

I have explained above issue in below blog SQL SERVER – WARNING – SQL Server Has Encountered N Occurrence(s) of I/O Requests Taking Longer Than 15 Seconds

We also saw below, which again points to IO slowness.

2017-02-27 19:31:05.09 spid18s                 average writes per second: 1225.28 writes/sec
average throughput:  86.19 MB/sec, I/O saturation: 149511, context switches 190629
2017-02-27 19:31:05.09 spid18s                 last target outstanding: 278, avgWriteLatency 38

Till you find the actual cause, you can increase LeaseTimeout value so that the AG is remaining healthy

But remember that we have not fixed the issue, but applied band-aid.

Here are few more things to do.

  • Limit Max Server Memory of SQL Server, if not capped.
  • Consult your storage team for storage performance issues, since we see many IO stalled messages.
  • Enable lock pages in memory. This will prevent work set trimming and prevent it from being paged out.  Please refer to the below link https://msdn.microsoft.com/en-IN/library/ms190730.aspx

Have you found something more than above?

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – The Lease between Availability Group ‘PRODAG’ and the Windows Server Failover Cluster has Expired

SQL SERVER – Unable to Install Service Pack in Cluster – There was an Error to Lookup Cluster Resource Types

$
0
0

SQL SERVER - Unable to Install Service Pack in Cluster - There was an Error to Lookup Cluster Resource Types SQL-Cluster Learning never stops for me with SQL Server. Even though I have written articles to solve many issues, I still get pinged from various clients and I find one new thing every day. In this blog post we will learn how to fix the error related to installing the service pack in the cluster.

The other day, while my client was trying to apply service pack for SQL Server and it was failing with an error. When I asked them to share the Detail.txt, I could find below exception which was causing failures.

The following is an exception stack listing the exceptions in outermost to innermost order

Inner exceptions are being indented

ErrorType = 2
Operation = GetObject
ParameterInfo = MSCluster_ResourceType.Name=”Double-Take Source Connection”
ProviderName = MS_CLUSTER_PROVIDER
StatusCode = 4104
Stack:

When I searched on the internet, I found that this was caused to another person in the world due to “Double-Take Source Connection”. We can also see that in our stack.

WORKAROUND/SOLUTION

I gave the option to the customer to contact either Microsoft or Double take software vendor to find the cause of the issue. But they informed that they are not using double-take and it was installed for testing purpose. There was no resource of double-take in their cluster.

Since they were OK to get rid of double-take, we ran below command.

Remove-ClusterResourceType -name “Double-Take Source Connection”

This deleted the “Double-Take Source Connection” cluster resource type, now it was no longer listed using below command

Get-ClusterResourceType

Once we removed above resource type, they were able to install the service pack successfully.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Unable to Install Service Pack in Cluster – There was an Error to Lookup Cluster Resource Types

SQL SERVER – Slow Installation Wizard on Cluster – Please Wait While Microsoft SQL Server 2016 Setup Process the Current Operation

$
0
0

Have you come across a situation where setup is taking many hours while navigating from one screen to another in clustered environment?  Have you ever waited for many hours watching following message related to slow installation wizard on the cluster?

Please wait while Microsoft SQL Server 2016 Setup process the current operation

SQL SERVER - Slow Installation Wizard on Cluster - Please Wait While Microsoft SQL Server 2016 Setup Process the Current Operation setup-pls-wait-01-800x193

When I looked into Detail.txt, I found something like below

(01) 2017-04-08 16:14:46 Slp: Running Action: RunRemoteDiscoveryAction
(01) 2017-04-08 16:14:46 Slp: Running discovery on local machine
(01) 2017-04-08 16:14:47 Slp: Discovery on local machine is complete
(01) 2017-04-08 16:14:47 Slp: Running discovery on remote machine: CRM-SQLNODE1
(01) 2017-04-08 21:13:35 Slp: Discovery on CRM-SQLNODE1 is complete
(01) 2017-04-08 21:13:35 Slp: Completed Action: RunRemoteDiscoveryAction, returned True

Notice the time in each row. There is a HUGE gap of time between two lines and that’s the time SQL setup is doing a remote discovery to find out what is there on a other node of the cluster.

When I asked, they informed that they have done hardening of the server as per the company standards. While checking their hardening document, we found that Admin shares were disabled.

WORKAROUND/SOLUTION

The remote discovery issue was resolved by setting the following registry key to 1.

HKLM\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters\AutoShareServer

Reference https://technet.microsoft.com/en-us/library/aa997392(v=exchg.80).aspx

  • Open a registry editor, start > Run > Regedit.exe.
  • Navigate to: HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters
  • In the right pane, locate and double-click AutoShareServer.
  • Change the value from 0 to 1.
  • Close the registry editor, and restart the Server service for the change to take effect.

Have you seen this issue earlier? Do you know any other such issues due to hardening?

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Slow Installation Wizard on Cluster – Please Wait While Microsoft SQL Server 2016 Setup Process the Current Operation

SQL SERVER – Unable to Add Server Name SQLAUTH-LISTENER to Transport Device NetBt_If4 Status 2 – Windows Cluster

$
0
0

During one of my consulting engagement about Always On availability group configuration, I found that listener network name was not coming online. As a rule of thumb, I always generate cluster log for any error related to cluster. SQL SERVER – Steps to Generate Windows Cluster Log?

Here is the information in the cluster log. I have removed the timestamp column to make it readable.

DBG [RHS] Resource SQLAUTH-PRODAG_SQLAUTH-LISTENER called SetResourceStatusEx: checkpoint 2. Old state OnlinePending, new state OnlinePending, AppSpErrorCode 0, Flags 0, nores=false
ERR [RES] Network Name: [NNLIB] Unable to add server name SQLAUTH-LISTENER to transport \Device\NetBt_If4, status 2
INFO [RES] Network Name : Netbios: Registered Name: SQLAUTH-LISTENER into Netbios (Type: Singleton), result: 2 (0 transports registered)
INFO [RES] Network Name : Netbios: Slow Operation, FinishWithReply: 2
INFO [RES] Network Name : Netbios: End of Slow Operation, state: Initializing/Idle, prevWorkState: Idle
INFO [RES] Network Name: Agent: OnInitializeReply, Failure on (1dac30da-a46a-4c44-923b-d428c6329d6d,Netbios): 2

I look for ERR in cluster log and here is the error which we can see above.

Unable to add server name SQLAUTH-LISTENER to transport \Device\NetBt_If4, status 2

Status 2 indicates: The system cannot find the file specified”. My search for “NetBt” was pointing to issue with “NetBIOS over TCPIP” setting

WORKAROUND/SOLUTION

When we checked Network Name resource properties, we saw “The system cannot find the file specified.” for NetBIOS status. Now to clear that, I looked at IP address resource and unchecked NetBIOS related setting as shown below.

SQL SERVER - Unable to Add Server Name SQLAUTH-LISTENER to Transport Device NetBt_If4 Status 2 - Windows Cluster NetBT-01

If you want to change that setting a NIC level, then you can follow my earlier blog.

SQL SERVER – FIX Error – Cluster Network Name showing NETBIOS status as “The system cannot find the file specified”

After fixing above, we were able to bring listener online.

Let me know if you have ever faced similar error with windows cluster log. I would like to know your opinion and workaround as well. Remember sharing is the caring.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

First appeared on SQL SERVER – Unable to Add Server Name SQLAUTH-LISTENER to Transport Device NetBt_If4 Status 2 – Windows Cluster

Viewing all 53 articles
Browse latest View live