Are computing infrastructure outages killing your reputation?
In our 24/7/365 world, computing infrastructure outages can kill a CIO’s reputation and career prospects swiftly and dramatically. Outages have attained an extremely high profile in most organizations because they visibly and quickly:
- Cost revenue.
- Undermine customer service.
- Cause work to grind to a halt.
- Undermine brand reputation.
Computing infrastructure outages occur for many reasons including:
- Insufficient capacity.
- Failing to monitor end-to-end response time.
- Sloppy server management.
- Gaps in configuration management processes.
- External and internal network issues.
- DBA finger problems.
- Flaky application execution.
- External and internal electrical power outages.
- Scheduled maintenance taking too long.
At the recent Collision from Home virtual conference, Sebastien Stormacq, Principal Developer Advocate at Amazon Web Services (AWS), explored design patterns to achieve high availability. AWS is a well-known supplier of cloud computing infrastructure. He said that “Modern computing infrastructures embrace failure, rather than trying to avoid it. Best practices systems are designed to handle and recover from unexpected conditions.”
The best practices for minimizing outages and achieving high availability of applications focus on consciously planning for as many failure scenarios as conceivable. Below are effective measures that CIO’s can implement to:
- Reduce the risk of computing infrastructure outages.
- Improve computing infrastructure resilience.
- Achieve high availability.
- Enhance their reputation and preserve their career prospects.
Migrate applications to the cloud
Most organizations struggle to achieve continuous high availability for their on-premise computing infrastructure. It’s expensive to buy the components and then implement them. It’s difficult to justify the wide range of technical specialists required to operate with high availability because most of the specialists are not required on a full-time basis.
A better approach is to contract with the suppliers of cloud computing infrastructure. They have accumulated the experience and work hard with capable technical teams to achieve often elusive high availability. Because these suppliers operate at a larger scale where the cost of technical specialists is amortized over many customers, the cost per customer and the results are attractive.
Buy failover services
Organizations can implement a failover environment for their on-premise computing infrastructure. However, implementing changes to applications and operating procedures to seamlessly switch to the failover environment during an outage can be challenging. Unfortunately, some organizations discover the gap in their configuration during the first outage with significant negative consequences.
A better approach is to buy one of the levels of failover service that suppliers of cloud computing infrastructure all offer. These automatic failover services often eliminate or at least minimize the impact of:
- Computing infrastructure component failures.
- Internet backbone outages.
Architect applications for high availability
The post-incident review of computing infrastructure outages most often discovers:
- Single points of failure in the computing infrastructure.
- Application software defects.
Architecting applications for high availability typically includes the following features:
- Application software that performs extensive data validation on data input.
- An implemented backup and recovery strategy.
- A plan for addressing data loss or corruption before the fact, not after.
- Application software that is aware of the active servers in the cluster.
- Application segmentation onto multiple servers to distribute functions such as web services, authentication, compute, database access, content management, email, and reporting.
Upgrade your on-premise network
At many organizations, the on-premise network works reasonably well most of the time. However, it is at risk of outages due to:
- Bottlenecks created by high-volume, local traffic.
- Single points of failure.
- Broadcast storms originating from defective Ethernet network cards and switches.
To achieve high availability of your on-premise network, upgrade it even when you have migrated most of your applications to the cloud. Network upgrades to consider include:
- Fewer end-user devices per switch to minimize sharing of the local network segment capacity.
- Multiple paths from switches to the edge of the local network to create multiple paths for network redundancy.
- More cable for Gigabit Ethernet and less Wi-Fi to improve network reliability and performance.
- Use of subnets to split a large network into multiple smaller, interconnected networks to isolate local traffic to the local subnet whenever possible.
- Load balancing routers at the edge of your network to spread network traffic and server workload.
- Multiple ISP connections to spread the external network traffic and to create redundant paths to the Internet.
- Active network monitoring to detect intrusions.
What strategies would you recommend to successfully improve computing infrastructure resilience? Let us know in the comments below.