Eight Ways to Achieve System Stability

Eric Bruno | Posted on April 16, 2017 | Monitoring

No one template can ensure system stability.

Unfortunately, no one template can guarantee 100-percent system stability. There are, however, guidelines any sysadmin can follow to define policies and procedures that proactively ensure your network doesn't have a random fire drill.

What Constitutes Stability

To be clear, system stability is the measurement of overall system performance, accessibility and usability. It includes ensuring uptime in components such as Web and database servers, sure, but it goes beyond that. It's also about maintaining end-to-end reliability from the user's point of view. Taking into account the user's experience, it is important to measure individual server statistics. For instance, if your application processes a million requests per day, and a small segment of those transactions (just 1 to 2 percent) experience difficulty, tens of thousands of users can be affected.

Defining what you consider to be a stable system establishes a benchmark you can use to measure performance, accessibility, change management and supportability. From there, you can work on strategies to achieve as close to 100-percent system uptime as you can.

Here are eight recommended protocols and workplace policies you can help enforce to ensure it stays this way.

1. Define (Your) System Stability

Define and establish what the department can consider a stable computing environment, including server metrics and their effect on UX. They may include both a Recovery Time Objective (RTO), the maximum time tolerable without access to the application, and a Recovery Point Object (RPO), the maximum data loss that can be tolerated. Foster a company-wide view of holistic system metrics instead of one that's technology focused, and measure risk factors that can disrupt your (and your clients') business. Otherwise, you end up with a siloed approach wherein individual system owners view and report only on their individual components.

2. Create Change Management Policies

Create and enforce a strict, well-defined change management process so failures don't occur when something is modified. This includes hardware and network configuration, patch installation and software version upgrades.

3. Enforce End-to-End Test Procedures

Common sense suggests higher quality software results in greater uptime. But make sure your company implements proper testing procedures to ensure quality across the board. Every component and modification — from code changes to system reconfiguration and even network infrastructure upgrades — need to be regression tested end-to-end.

4. Map and Monitor Your Network

Slow or compromised communications can appear as an outage, directly affecting stability. Know what's out there on your global network: physical and virtual servers, network infrastructure, which ports are open, where vital communications occur and where your weak points are. The best way to do this is visually, using tools that help you interpret complexity at a glance.

5. Proper Server Monitoring

To avoid downtime, you need to know when an issue occurs as it occurs, with enough insight to fix the issue quickly. Use a unified monitoring and analysis tool to help you discover all devices and servers, then isolate performance issues to help focus your efforts. Root cause analysis, a problem-solving technique through process improvement, looks at the system as a whole — not just individual pieces — and helps you improve over time.

6. Implement Corporate Collaboration Tools

A critical factor to restoring system stability is staff communication, especially with geographically distributed teams. Collaboration tools — those that work across mobile devices and desktops — are important to limit downtime when issues occur.

7. Test System Restoration Procedures

Develop the ability to quickly restore or deploy new server images from a trusted repository in case of catastrophic failure. This includes source code management or continuous integration systems. It's important to test these procedures before a time when you need them the most.

8. Use Big Data Analytics to Predict Outages

The best way to ensure stability? Stop outages before they occur. By collecting large volumes of data from across each system, both when they're running properly and when they fail, you can use analytics tools to discover trends that help predict future outages.

Ultimately, pursue a proactive approach to stability, rather than reacting to a system after a failure occurs. This has been proven to work in other areas, and is working for helpdesks.