Anyone working in an IT shop knows server uptime is crucial. Server downtime can cause major issues, from reduced worker productivity to issues that impact the customer experience and may even lead to lost sales. But keeping downtime at a minimum isn't as simple as just fixing problems as they arise. To keep your servers online, and avoid downtime, you need to understand what’s happening with them.
That's where network monitoring comes in.
Monitoring all the system resources associated with a server will help you build an understanding of resource usage patterns that will let you know when things are running right, and when they're starting to go awry. That way, you can always ensure that your servers are optimized accordingly, take care of problems before they arise, and provide a better end-user experience.
But accessing all of this information, and synthesizing it into digestible, actionable alerts and reports is easier said than done.
In this blog post, I’m going to take you through a few of the ways you can use WhatsUp gold to monitor physical servers—from server health to utilization. If you're looking for a more in-depth look at the topics discussed here, as well as a demo of WUG's server monitoring capabilities in action, check out our recent Server Monitoring webinar, now available on-demand
Monitoring Your Server's Health
Sometimes it can seem like there’s always something wrong with server hardware. From CPU errors to memory overloads, issues can frequently arise in normal use, and only become more frequent as your shop grows and incorporates more and more devices—which may not always play nice with each other.
The best way to stay on top of all of this is to monitor the essential indicators of server health—CPU, memory, and disk utilization, with active monitors and set up alerts that will let you know when things aren’t quite right.
That means you can do things like track CPU load through specific time-frames so that you can see when CPU load is either unexpectedly peaked ( maybe due to processor bottlenecks, service attacks, or other service incidents) or you can see when it’s abnormally idle, so if server dropped from load balancer configuration or kernel panic, you’ll be aware of that as well. And of course, you can be alerted when utilization falls outside of your chosen threshold.
You can do the same for memory utilization and can also set up reports that will compare disk storage capacity to actual utilization for devices with on-disk storage, which is useful for capacity planning.
Monitoring Hardware Components
Hardware components are another good way to keep track of the health of your servers—if your server is operating at a high temperature for an extended period of time, that can indicate deeper issues. If possible, you should set up a temperature monitor that will check the status of a device’s temperature sensors—if the sensor returns a “normal” or “ok” state indicators, it is considered up. If not it is considered down.
WUG can also be configured to displays details such as fan and power supply status. The information available about the server depends on the device being monitored. Typically, we are able to monitor all of this information for Dell, Cisco, HP and EMC devices.
Setting Up Critical Alerts
Of course, none of these cool monitoring capabilities matter if you can't tell when something’s wrong. That’s where alerting comes in. In the event that a server or your entire network is strained, WUG is going to tell you right away – via customizable alerts delivered thru email, SMS, or even slack. You can, therefore, intervene quickly and save your company from having to deal with potentially serious consequences.
But alerting can also be a headache if poorly configured—You should not, for example, get alerts for each dependent device that goes down. If a gateway-device goes down, that’s the only alert you need, you don’t need an alert from every single connected device beyond that telling you that its lost connection.
With WUG, those alert storms are easily avoided, as it automatically apply dependencies rules to discovered layer 2 and layer 3 devices to prevent alert storms. These settings can also be set manually.
With Alert Escalation, notification policies in the Alert Center can be configured to escalate alerts based on the criticality of the network components – the alerts can move up from automatic trouble ticket generation to sending out alerts to pre-designated administrators.
With the Alert Acknowledgement feature, the first responder’s acknowledgment of an alert is considered an indication the issue is being addressed, and no further alerts are sent, unless triggered by the notification policy or as log messages after the issue has been resolved. This ensures that the issues that are not fixed within the timeframe are addressed appropriately. Likewise, information about the action taken can be added to the acknowledgment process, thereby providing problem resolution data that can be used in the event the issue reoccurs.