Just recently, I violated one of my guiding work principles as a DBA … Be Proactive, Not Reactive … What this principle means to me, is that I want to try to anticipate problems arising and fix them, rather than not noticing them until they bloom into a full-fledged issue that has a greater impact on my users.
Here was the situation … It was just after a maintenance window, and there were some infrastructure issues. I probably could stop the story there, but I will continue … On one of my servers, I am using AlwaysOn (https://msdn.microsoft.com/en-us/library/ff877884.aspx) and my disk space monitoring process noticed rapid growth on the primary replica with one of the database’s log file to the point of almost filling the drive. As I checked the log backups, everything was working fine. So I queried the log_reuse_wait_desc column of sys.databases (https://msdn.microsoft.com/en-us/library/ms178534.aspx), which will give a description of why the log file can’t be reused. And as you might have guessed, it pointed to AlwaysOn (AVAILABILITY_REPLICA). I immediately realized that the problem was that the data was not moving to the replica and I resumed the data movement once the underlying issue was fixed. And for those wondering, the issue was caused because the secondary replica, which was our DR server, had a networking issue. Although the Hyper-V host was up and the virtual server appeared up, it was not allowing any traffic in or out.
At this point, I realized that I had no monitoring in place to determine whether or not data is moving between the replicas or if a replica is falling behind. So, I needed to setup something and I went with Policy Based Management (https://msdn.microsoft.com/en-us/library/bb510667.aspx). The basic idea is that you define a condition that specifies the state of a management facet. A management facet is defined by as “a set of logical properties that are related to an area of management interest”. Then, you define a policy that references that condition and specify how and when to evaluate it.
How to use Policy Based Management to monitor AlwaysOn?
There is a great post, which I feel is a must read, about monitoring performance for AlwaysOn at: https://msdn.microsoft.com/en-us/library/dn135338.aspx.
In that post, it explains how data is synchronized, how to estimate recovery time, how to estimate how much data will be loss, and how to setup monitoring. As you can see in the picture above, the monitoring consists of setting up 3 conditions and 2 policies. The policies evaluate the current situation to see if the failover time will exceed 10 minutes or if the potential for data loss is greater than 1 hour.
So, there you have it, a simple way to monitor your AlwaysOn instances. Of course, some of the SQL Server monitoring solutions have built in functionality to do something similar. But, if you aren’t using one of them on that server, this is an easy way to implement something.
In my case, if I had setup this policy based management, I would have been alerted and the log drive would not nearly had grown so large. And I would have been proactive, instead of reactive.