« All posts

Identifying Server Problems with Trend Analysis

One of the first questions that must be answered when beginning to perform tests on SQL servers or operating systems is to identify the minimum activity of the system, or, in other words, to establish the Baseline. Understanding and documenting the normal functioning of the system will be make it easy to spot unusual behavior should it occur. Likewise, one must know how the system parameters behave during certain hours and on special days (such as paydays, monthly fluctuations in activity and increased momentum around holidays and festivals).
Messages to the system increase awareness of unusual activity in the service and encourage proactive measures to inhibit potential problems. Among the abnormal activities that should be monitored are increased CPU consumption, consumption of drive space, swelling or overuse of the log database, connection errors and others.

The Big Problem

Recently, with the help of AimBetter, a serious blunder was avoided when a developer ran a report created by unusual activity in TEMPDB; within a half hour the drive lost 50GB. Though such a lack of space on the drive could bring things to a grinding halt, but in this case, the problem wasn’t that the IT team left too little space to run the report. Instead, the error was potentially caused by a developer who ran a complex report with secondary data, and he forgot to date the report.

The Quick Resolution

First, we received an alert by email.
alert-mail

We entered the system and realized that the problem began within the past half hour.

low-size

We reviewed the logs and discovered the problem

log-list

Eventually we alerted the developer to the serious nature of his error the report requires normal date parameters in order to run properly. We then reduced the TEMPDB without any resulting downtime and returned the system to optimal function.
Without a proper tend analysis and documentation of the source of the decline we would not have been able to identify the problem or its starting point so quickly. As it turned out, a mere half hour to evaluate, inspect and repair prevented significant damage, downtime and frayed nerves.